Intelligent search with AI: applying semantics with low cost and simplicity

I decided to add intelligent search to the articles on this blog without turning a simple site into a heavy platform for a simple problem. The blog works in a direct way: the content is written in Markdown, GitHub Actions runs the build, Markdown becomes HTML, and the files are sent to the server through SSH. The challenge was to apply semantic search inside that flow without a dedicated search service, without a vector database, and without depending on a paid API token for every query.

That constraint pushed the solution toward a simpler path. The site could become more sophisticated without carrying a heavy stack or creating additional costs. The main decision was to calculate ahead of time everything that could be calculated ahead of time. The index, chunks, vocabulary and part of the vectors would be generated during the build. The server would only handle what depends on the reader's query. The goal was to apply low-cost AI while preserving operational simplicity.

When articles are short, a simple word search solves a lot. When articles take more than an hour to read, searching only by title changes the game. The content the reader wants may be buried inside a section in the middle of the article, and the title may not contain the term typed in the search box. Someone searches for "CAP theorem" and expects to land on the right section about consistency and availability. Someone types "hexagonil" by mistake and expects the system to understand that they may be looking for hexagonal architecture. Someone types "CAAP" and expects to find the section about CAP, without having to know that one extra letter changes everything for a machine.

This is where search stops being just a visual field in the interface. It starts involving decisions about indexing, ranking, error tolerance, index updates and execution cost.

Alfred Korzybski is known for the phrase "the map is not the territory". The map is never the territory. A search index is never the content either. It is a compact, partial and biased representation of the content. The quality of search depends on the kind of map we decide to build.

Literal word search

The simplest search treats the article as a large string and asks whether the typed text appears inside it. In JavaScript, that would look like text.toLowerCase().includes(query.toLowerCase()). It works when the searched word appears exactly in the article, but it fails when there is a typo, a plural variation, an acronym written differently, or an intention expressed with different words.

This approach is easy to implement and works for exact matches. If the article contains CAP and the reader types CAP, the result appears. The problem appears when the query does not repeat the text exactly.

A person does not always remember the exact title. Often they remember a concept, type a singular word while the article uses the plural, or miss a letter along the way. They search for "hexagonil architecture" while the text uses "hexagonal architecture". They type "CAAP" while the article says "CAP". For literal search, each small difference becomes a complete mismatch.

Literal search has another problem. It can say whether it found something, but it cannot say whether the result is good. An article that mentions "architecture" once in a footnote may appear next to another one with twenty sections about the topic. Without ranking, every match looks the same.

When every result looks the same, the work goes back to the reader. They need to open article after article, use the browser search, search inside the page, go back, and try another term. The search returns a list, but it does not guide the choice.

What indexed search changes

Indexed search changes when part of the work happens.

In naive search, every query scans the raw content. In indexed search, the content is analyzed beforehand. The system reads the articles during the build, splits the text into smaller parts, extracts tokens, normalizes words, creates auxiliary structures and stores everything in a format that can be queried quickly.

That changes the server's responsibility. Instead of reading all articles on every query, it queries structures that are already prepared.

In this blog, the articles already go through a build. Markdown enters on one side, HTML comes out on the other. That creates an excellent opportunity. Instead of generating the index on the server for every search, the build can generate a SQLite database together with the site.

The production server keeps a light responsibility. It receives the query, opens the SQLite database, calculates ranking and returns results. The expensive work happens earlier, in the pipeline.

flowchart LR
    MD[Articles in Markdown] --> Build[Site build]
    Build --> HTML[HTML pages]
    Build --> Index[SQLite search database]
    HTML --> Host[Server]
    Index --> Host
    User[Reader] --> API[Search layer]
    API --> Index
    API --> JSON[Results]
    JSON --> UI[Search interface]

This design preserves the simplicity of the hosting model. The site remains mostly static. Search adds a small API, but the index is just a file published together with the site.

Why index chunks instead of entire articles

For long articles, indexing the whole article as a single unit produces imprecise results.

That path is common at first. The user searches, the system returns articles. But a long article has different granularity. A one-hour text may discuss five topics in depth. If the reader searches for "CAP theorem", returning only the article forces them to search again inside the page.

That is why the index was created with chunks.

A chunk is a piece of the article. It carries the article title, category, tags, date, current heading, excerpt text, URL and anchor. That allows a result to point to the correct section.

The chunk changes the experience because the result starts answering "where is this?" instead of only "which article contains something about this?".

flowchart TD
    Article[Long article] --> H2A[Section about architecture]
    Article --> H2B[Section about cache]
    Article --> H2C[Section about CAP]
    H2A --> C1[Chunk 1]
    H2A --> C2[Chunk 2]
    H2B --> C3[Chunk 3]
    H2C --> C4[Chunk 4]
    C4 --> Result[Result with a direct link to the section]

Chunk size needs care. Too small, and it loses context. Too large, and it behaves like an entire article again. I used a simple strategy: split by headings, limit by approximate word count, and add a small overlap between blocks. The overlap reduces the risk of cutting a line of reasoning in half.

That decision defines result precision before any ranking algorithm enters the picture.

Inverted index and FTS5

The classic model of textual search starts with a structure called an inverted index.

In a normal index, you start from a document and find the words inside it. In an inverted index, you start from a word and find the documents where it appears.

If three chunks contain the word "cache", the inverted index stores something like this.

cache -> chunk 12, chunk 18, chunk 44
cap -> chunk 19, chunk 20
hexagonal -> chunk 91, chunk 93, chunk 94

That inversion explains why search engines can answer quickly even with many documents. The query "cache cap" does not need to read all texts. It looks up occurrence lists.

SQLite offers FTS5, short for Full Text Search 5. It is a SQLite extension for textual search. It creates a virtual table optimized for searching terms in text fields. FTS5 was chosen for three reasons: it is available in many SQLite installations, it does not require a separate service, and it works well for a small corpus like a blog.

In the blog index, every chunk is stored both in presentation tables and in an FTS table.

The normal table stores presentation and control data. The FTS table stores the fields used for textual search: title, description, category, tags, heading and text.

erDiagram
    CHUNKS {
        integer id
        string lang
        string slug
        string title
        string category
        string date
        string url
        string heading
        string excerpt
        string text
    }

    CHUNKS_FTS {
        string title
        string description
        string category
        string tags
        string heading
        string text
    }

    VOCABULARY {
        string lang
        string token
        integer doc_count
    }

    CHUNKS ||--|| CHUNKS_FTS : rowid

FTS5 solves the fast candidate lookup part. After that, results still need to be ordered. For textual ordering, BM25 enters.

How BM25 orders results

BM25 is a textual ranking algorithm. Ranking, in this context, means assigning a score to each result and ordering the most promising ones first. The full name, Okapi BM25, comes from the Okapi system developed at City University London.

BM25 answers a practical question. Given a query with certain terms, which chunks should appear first?

It uses three main criteria.

When a query term appears in a document, the score increases. When it appears many times, it increases more, but with saturation. The tenth occurrence is not worth ten times the first one. After a certain point, repeating the same word adds little information.

When a term is rare in the corpus, it is worth more. If the word "architecture" appears in many articles, it helps, but it does not differentiate much. If "CAP" appears in only a few chunks, it carries more information. This idea is called IDF, inverse document frequency.

When a document is very long, BM25 adjusts the score to prevent long texts from winning only because they contain more words. This matters for long articles, because a larger chunk naturally has more chances of containing any term.

BM25 can be summarized like this.

score = term rarity
      * presence in the document with saturation
      * adjustment by document length

In the complete formula, two parameters control much of the behavior. k1 controls the saturation of term frequency. With a higher k1, repeating a word in the same document keeps increasing the score for longer. With a lower k1, the gain saturates earlier. This prevents a chunk from ranking higher only because it repeated the same word many times.

The b parameter controls how much document length affects the adjustment. With a higher b, long documents are penalized more strongly. With a lower b, length matters less. In long articles, this changes ordering significantly, because a larger block has more chance of containing any word simply by volume.

FTS5 exposes BM25 in a practical way. Instead of reimplementing the entire formula, I use the bm25 function and assign different weights per column. In the blog index, title, heading and tags have more weight than body text. That respects an editorial intuition. If a term appears in a section title, it probably describes that section better than an isolated occurrence in the middle of a paragraph.

bm25(chunks_fts, 8.0, 3.0, 2.0, 2.5, 4.0, 1.0)

Those weights represent the relative importance of the columns. Title weighs more. Heading also weighs more. Text weighs less. Search still looks at everything, but it understands that not every occurrence has the same value.

BM25 was chosen because it solves the textual part of search well without requiring new infrastructure. For this corpus size, starting with a heavier stack like Elasticsearch, a dedicated Lucene setup or an external embeddings API, including OpenAI, would be excessive. BM25 does not understand the deep meaning of the text. Even so, it usually ranks very well when the query and the document share vocabulary.

When the query uses another word, a typo, or an acronym written differently, BM25 needs to be combined with other signals.

Where textual search stumbles

Textual search stumbles in three common situations.

First, typos. The person types "hexagonil". The text contains "hexagonal". For a pure textual index, those are different terms.

Second, short acronyms. The person types "CAAP". The text contains "CAP". The visual distance is small for a person. For the index, it is another word.

Third, intention. The person types a nearby but non-identical formulation. The text may use an acronym, an English term, a technical variation or a more specific expression. Textual search finds well what shares words.

Embeddings can help in these cases, but they do not remove basic textual search problems. The first decision was to solve as much as possible with an index, ranking, term correction and cheap-to-calculate signals. Simple scenarios usually ask for simple solutions. Besides being cheaper, they are easier to explain, test and fix.

The answer was to build hybrid search.

Hybrid search as a composition of approximations

Hybrid search means combining more than one type of evidence to order results.

One signal comes from BM25, used when the words in the query appear in the text. Another comes from an enriched lexical vector, used to approximate related terms inside a known vocabulary. Another comes from fuzzy correction, used for typos. Another comes from editorial weights, because title and section usually describe the topic better than an isolated occurrence in the body.

Each signal covers part of the problem. The final ranking sums this evidence with different weights.

flowchart TD
    Query[Reader query] --> Normalize[Normalization]
    Normalize --> Correct[Vocabulary correction]
    Correct --> FTS[FTS5 and BM25 search]
    Correct --> Vector[Enriched lexical vector]
    FTS --> Merge[Score combination]
    Vector --> Merge
    Merge --> Dedup[Deduplication by article and section]
    Dedup --> Results[Ordered results]

This composition has an operational advantage. Lexical search depends only on the SQLite database and ordinary application code. There is no model to load in memory on that path. There is no exposed API key. There is no external call on each query. The cost per search stays predictable.

This path has limits. An enriched lexical vector approximates terms, corrects input and improves specific corpus cases. When the requirement becomes broader semantic comparison, the next step is to generate embeddings during the build and store those vectors in the index.

Normalization before any algorithm

Before ranking, the input needs to be prepared.

Normalization reduces superficial variations. The system converts text to lowercase, removes accents, extracts alphanumeric tokens and discards stopwords. That makes "Architecture", "architecture" and accent variations move toward similar representations.

Stopwords are frequent words with little discriminative value. "of", "the", "to", "with", "and". Removing them reduces noise in the index and in the query.

This also has risk. In some domains, a short word may matter. In technology, "io", "ai", "cap", "cpu" and "ddd" matter. That is why the stopword list needs to be small and adjusted to the corpus. An aggressive generic list could remove exactly the technical terms that make the search valuable.

In a software engineering blog, acronyms, pattern names, and mixed English and Portuguese terms appear all the time. Search needs to preserve that technical vocabulary instead of applying a cleanup that is too generic.

Fuzzy correction using the corpus vocabulary

Fuzzy correction is the step that tries to approximate a typed word to a word that exists in the index. It was added for cases like hexagonil and CAAP.

During the build, the index creates a corpus vocabulary. It contains tokens found in the articles and the number of chunks where each token appears.

When a query arrives, the system checks whether each term exists in the vocabulary. If it exists, it uses the term as typed. If it does not, it looks for similar candidates in the vocabulary.

That candidate search uses two ideas.

The first is trigram similarity. A trigram is a sequence of three characters. The word hexagonil generates pieces like hex, exa, xag, ago, gon, oni, nil. The word hexagonal generates many similar pieces. The intersection between those sets gives a proximity signal.

The second is edit distance, also known as Levenshtein distance. It measures how many operations are needed to transform one word into another. Insertions, removals and substitutions.

CAAP to CAP has small distance. It only needs one A removed. hexagonil to hexagonal is also close enough to be corrected.

The algorithm avoids correcting anything into anything. It compares only candidates with nearby length, requires minimum trigram similarity and applies an edit-distance limit based on term size. Short terms have a smaller limit, because one letter changes the meaning of an acronym a lot.

After correction, the query carries two pieces of information: the corrected terms used internally and a correction map that can be returned in JSON.

def correction_distance_limit(term):
    if len(term) <= 4:
        return 1
    if len(term) <= 8:
        return 2
    return 3


def correction_candidates(term, vocab_map, limit):
    candidates = []
    for token, doc_count in vocab_map.items():
        if token == term:
            continue
        if abs(len(token) - len(term)) > limit:
            continue
        similarity = trigram_similarity(term, token)
        if similarity < 0.28:
            continue
        distance = edit_distance(term, token, limit)
        if distance <= limit:
            candidates.append((distance, -similarity, -doc_count, token, doc_count))
    candidates.sort()
    return candidates


def should_correct_existing_term(term_count, best_count):
    return term_count <= 10 and best_count >= max(term_count * 2, term_count + 3)


def correct_query_terms(conn, terms, lang):
    if not terms:
        return terms, {}

    vocabulary = conn.execute(
        "SELECT token, doc_count FROM vocabulary WHERE lang = ?",
        (lang,),
    ).fetchall()
    vocab_map = {row["token"]: row["doc_count"] for row in vocabulary}
    corrected = []
    corrections = {}

    for term in terms:
        term_count = vocab_map.get(term)
        limit = correction_distance_limit(term)
        candidates = correction_candidates(term, vocab_map, limit)

        if term_count is not None:
            if candidates and should_correct_existing_term(term_count, candidates[0][4]):
                best = candidates[0][3]
                corrections[term] = best
                corrected.append(best)
            else:
                corrected.append(term)
            continue

        if candidates:
            best = candidates[0][3]
            corrections[term] = best
            corrected.append(best)
        else:
            corrected.append(term)

    deduped = []
    for term in corrected:
        if term not in deduped:
            deduped.append(term)
    return deduped, corrections

The code has four important gates. The first is the limit based on word length. Short terms, like acronyms, only accept distance 1. The second is the minimum trigram similarity, which reduces random candidates. The third is candidate ordering by distance, similarity and corpus frequency. The fourth allows correcting even a term that exists in the vocabulary, but only when it is rare and there is a nearby candidate that is much more frequent. This avoids a common side effect in technical blogs: the article about search itself mentions wrong examples, such as hexagonil and CAAP, and those examples enter the vocabulary. That lets CAAP fall back to CAP, and hexagonil fall back to hexagonal, without opening the door to overly aggressive corrections.

This transparency helps debugging. When a search looks strange, it is possible to check whether the issue is in correction, ranking or indexed content.

The enriched lexical vector

The lexical vector implemented here does not use a neural network. It transforms textual characteristics into numbers to allow an approximate comparison between query and chunk.

The idea is to transform each chunk into a sparse vector with fixed dimensions. Sparse means most positions are empty. Instead of storing 384 numbers for each chunk, we store only the positions that received some weight.

Each token creates a feature. Bigrams also create features. A bigram combines two consecutive terms, like cap_theorem or hexagonal_architecture. This helps capture compound expressions.

Character trigrams also enter the vector, with lower weight. They help with typos and nearby variations, although alone they can introduce noise. That is why their weight is low.

There is also a small concept map. When the text contains terms like architecture, hexagonal, mvc, layers and system, the vector receives related features. When it contains cache, redis, ttl, latency, it also receives conceptual approximations. It is a small, explicit, manual map that is easy to review.

That map does not try to represent the whole language. It records relationships useful to the blog's domain.

The calculation uses a technique called feature hashing. Each textual characteristic passes through a hash function and lands in a vector position.

"tok:hexagonal"             -> position 91
"bi:hexagonal_architecture" -> position 214
"tri:hex"                  -> position 37
"concept:architecture"     -> position 188

After that, the vector is normalized. Normalizing means adjusting weights so the vector length becomes 1. This allows vectors to be compared by cosine.

Cosine similarity

Up to this point, the textual index was concerned with discovering where a word appears. The vector changes the question a little. Instead of asking only "does this chunk contain this term?", it allows asking "does this chunk have signals similar to the query?".

Transforming text into a vector means transforming words and textual characteristics into numbers. One vector position may represent a token, like architecture. Another may represent a bigram, like hexagonal_architecture. Another may represent a trigram, like hex. Another may represent a manual concept, like concept:architecture.

query: hexagonil architecture

tok:architecture     -> weight 1.0
tri:hex              -> weight 0.16
tri:exa              -> weight 0.16
concept:architecture -> weight 0.8

A chunk about hexagonal architecture receives similar signals, even if the word hexagonil is wrong. This does not mean the vector "understood" the query. It means the numerical signals of the query and the chunk became close enough to justify a comparison.

Cosine similarity measures that proximity by looking at the direction of vectors.

If two texts point in similar directions in vector space, the cosine is high. If they point in different directions, it is low.

This choice has an important consequence. Cosine compares direction, not raw size. A long chunk should not win only because it has more words and therefore more signals. After normalization, what matters is the proportion of shared signals.

The value usually ranges between 0 and 1 when positive weights are used. The closer to 1, the more similar the vectors are.

similarity = dot_product(query_vector, chunk_vector)

Because the vectors are already normalized, the dot product is enough.

This vector signal enters ranking as a bonus. If the lightweight vector receives too much weight, it can pull results based on superficial similarity. That happened in the first attempt. hexagonil brought results without a strong relationship to hexagonal architecture because trigram similarity found weak proximity in other texts.

The correction came from two changes. First, correct terms against the vocabulary before search. Second, require a minimum similarity for purely vector-based results to enter the candidate set.

That is why the lexical vector enters with a minimum similarity threshold and controlled weight. It expands candidates, but it does not decide alone.

How the final ranking is assembled

The final ranking combines signals.

FTS5 returns candidates ordered by BM25. The system also calculates vector similarity between the query and the chunks. Then it adds weights for title, tags, category, heading, exact phrase and term coverage.

Coverage means how many query terms appear in the chunk. If the query is cache cap, a chunk containing both terms should receive more confidence than a chunk containing only cache.

There is also deduplication by article and section. Without it, a long article can dominate the list with several very similar chunks. The purpose of search is to help the reader decide where to go, not to show ten variations of the same paragraph.

The final result is calibrated for the reading experience. Ranking needs to respect the corpus behavior and the user's expectation. If a formula favors technically close but navigationally weak results, it needs adjustment.

In pseudocode, the ranking looks close to this.

query = normalize(input)
query = correct_by_vocabulary(query)

textual_candidates = search_by_fts5(query)
vector_candidates = search_by_lexical_vector(query)
dense_candidates = search_by_embedding(query) if available

for each candidate
  score = 0
  score += textual_weight * bm25(candidate)
  score += lexical_weight * lexical_similarity(candidate)
  score += dense_weight * dense_similarity(candidate)
  score += title_bonus_if_terms_match(candidate)
  score += heading_bonus_if_terms_match(candidate)
  score += exact_phrase_bonus(candidate)
  score += term_coverage_bonus(candidate)

sort by score
remove excessive duplicates from the same article
return the best results

The pseudocode shows why hybrid search needs calibration. Each signal has a role. BM25 organizes shared vocabulary. Fuzzy correction fixes input. Vectors approximate. Editorial weights help decide whether an occurrence appears in an important part of the article.

The interface needs to explain the result

Search does not end in the JSON returned by the API.

The interface decides whether the result looks useful. A result with title, category, date, section and excerpt guides the reader. A result that dumps long paragraphs creates fatigue before the click.

That is why results were compacted. The title is highlighted. Category and date appear as metadata. The section appears below, in an accent color. The excerpt has a visual limit. The entire result is clickable.

The highlight helps the eye find why the result appeared. It needs to be safe. The code does not inject HTML coming from the API. It escapes the text and only marks query tokens inside already treated content.

flowchart TD
    JSON[API JSON] --> Escape[HTML escape]
    Escape --> Tokenize[Token split]
    Tokenize --> Mark[Apply mark to terms]
    Mark --> DOM[Safe rendering]

This detail remains part of the search topic because good ranking and poor presentation produce the same feeling of failure. If the reader does not understand why a result appeared, they distrust the mechanism.

Why SQLite before Postgres or a vector database

It would also be possible to use a remote relational database or a vector database. SQLite was chosen for simplicity.

The blog content is static. Articles change in the repository. The build is already the moment when the site is generated. In that scenario, a SQLite database produced by the pipeline fits very well.

Postgres would make sense if search needed to store events, query history, statistics, preferences, user feedback or content edited directly on the server. For querying an index generated from Markdown, it would add a loading and synchronization step that would not bring much benefit at this point.

A vector database would have its place if there were many thousands or millions of chunks, or if vector comparison time became a bottleneck. For a few hundred chunks, querying SQLite and calculating scores in the application is enough.

This choice does not block evolution. The index already stores enough fields to replace or complement the ranking mechanism later.

What the pipeline started to guarantee

A search system is reliable only when index generation is part of the build.

If someone changes an article and forgets to update the index, search gets stale. If the index depends on a manual execution, someone will eventually forget. That is why the build generates the SQLite database. The deploy sends the index together with the site.

The pipeline also validates specific queries.

It checks whether the SQLite database exists and runs queries that represent important cases.

hexagonil needs to find hexagonal.

CAAP needs to find CAP.

These tests are small and protect an important part of the behavior. They prevent a future change from silently removing fuzzy correction or breaking vocabulary generation.

sequenceDiagram
    participant CI as Pipeline
    participant Build as Site build
    participant Index as SQLite database
    participant Test as Search validation
    participant Host as Server

    CI->>Build: generates HTML and index
    Build->>Index: creates chunks, FTS, vocabulary and vectors
    CI->>Test: queries hexagonil
    Test->>Index: expects result with hexagonal
    CI->>Test: queries CAAP
    Test->>Index: expects result with CAP
    CI->>Host: publishes site and index

When search is part of the main experience, it deserves tests like any other behavior.

Dense embeddings as a semantic layer

The next step was to test dense embeddings without an external API.

In this model, each chunk goes through an embedding model during the build. The query also becomes a vector at search time. After that, the system compares the query vector with the chunk vectors using cosine similarity.

The design keeps the expensive work outside the request. Markdown enters the build, chunks are extracted, the open source model generates a vector for each chunk, and those vectors are stored in SQLite. At search time, the server only needs to generate a small vector for the sentence typed by the reader.

For this test, I used an open source embeddings library with a small multilingual model. It is light enough to be considered in simple hosting, supports Portuguese and English, and generates vectors small enough to be stored in SQLite without creating new infrastructure.

The test showed a useful separation between types of problems. The query slow system found the article about progressive saturation with help from dense embedding, because the phrase describes a situation close to the article content. hexagonil and CAAP remain better cases for fuzzy correction, because they are typo and acronym variation problems. A semantic model can approximate ideas, but it should not be used to solve everything.

That is why the search remained hybrid.

flowchart TD
    Query[Reader query] --> Normalize[Normalization and tokens]
    Normalize --> Fuzzy[Fuzzy correction by vocabulary]
    Fuzzy --> BM25[FTS5 and BM25]
    Fuzzy --> Sparse[Enriched lexical vector]
    Query --> Dense[Optional dense embedding]
    BM25 --> Rank[Score combination]
    Sparse --> Rank
    Dense --> Rank
    Rank --> Results[Results with excerpt, section and direct link]

BM25 continues handling matching words. Fuzzy correction continues handling hexagonil and CAAP. The enriched lexical vector remains cheap and predictable. Dense embedding enters as another signal, especially useful when the person describes an idea with different words.

This detail matters. Good semantic search rarely comes from replacing one algorithm with another. It improves when different signals are combined with clear weights. A result that matches a term in the title, appears in a suitable section, has a good BM25 score and is also close in vector space deserves to rise. A result that only got close through embedding, with no other editorial evidence, needs to enter carefully to avoid a confusing list.

In the current index, dense embeddings are optional. If the embedding library is not available, search continues working with the previous layers. If the build runs with embeddings enabled, SQLite stores the chunk vectors. If the query runs with dense search enabled, the application generates the query vector and mixes cosine similarity into the final score.

This decision preserves an important architectural quality. The site works without a paid API and without a vector database. For a few hundred or a few thousand chunks, comparing vectors against SQLite is still acceptable. When volume grows, then it would make sense to look at FAISS, hnswlib, sqlite-vec, pgvector or another approximate index.

The search is ready to grow without anticipating infrastructure the blog does not yet need.

Keeping the semantic layer cheap

flowchart TD
    Browser[Reader] --> Web[Web layer]
    Web --> Service[Local search service]
    Service --> Model[Model loaded once]
    Service --> SQLite[SQLite database]
    Service --> Response[Results]
    Web --> Fallback[Lexical fallback]

Generating embeddings in the build solves half of the cost. The other half is avoiding loading the model on every keystroke. That is why the semantic layer runs as a persistent local service. The model loads once, queries reuse that process, and the web layer keeps a lexical fallback in case the dense part is unavailable.

The most direct proof comes from the search layer response. A query for slow system returned the article about progressive saturation with both textual and dense vector signals.

query: slow system
result: Progressive saturation
signals: textual score + dense vector score
state: dense search active

That return shows that the dense layer is participating in ranking. It also keeps expectations in the right place. Dense embedding improves intention-based queries, like slow system. Fuzzy correction remains better for typos, like hexagonil. Edit distance remains better for acronyms typed with one extra letter, like CAAP.

When a new article is added, the build generates a new SQLite database and deploy publishes it with the site. The service observes the database modification time. When it sees that the index changed, it clears the cache and starts opening the updated version on the next queries.

The flow became this.

sequenceDiagram
    participant Build as Pipeline
    participant Site as Published files
    participant Service as Search service
    Build->>Build: generates HTML and SQLite database
    Build->>Site: publishes new package
    Service->>Site: detects database change
    Service->>Service: clears query cache
    Service->>Site: uses updated index

This decision keeps dense embeddings on the server, but avoids multiplying processes on every typed character. It respects the infrastructure without giving up the original goal.

Search after publication

Search does not end when the endpoint responds.

After the mechanism enters the site, questions appear that code alone cannot answer. What terms do people type? Which queries return nothing? Which results appear at the top but do not receive clicks? Which words do readers use that the author almost never uses?

If this blog starts collecting search analytics, Postgres comes back into the discussion. It could store searched term, language, number of results, first result, click or absence of click. With that, it would be possible to adjust vocabulary, create aliases and write articles where demand exists.

That layer does not need to exist at first. To begin, local logs, tests for known queries and manual result review already provide enough signals. When volume justifies it, the same design can accept an events table without replacing the whole index.

Search reveals the reader's vocabulary. Sometimes the reader searches for a word the author never uses. That difference shows a distance between how the content was written and how it is searched.

Closing

The mechanism implemented in this blog started from a simple constraint. I wanted intelligent search without turning the hosting environment into a complex platform.

The solution ended up layered.

Markdown becomes HTML and also becomes an index. Long articles become chunks. Chunks go into SQLite. FTS5 offers fast textual search. BM25 ranks by relevance. A corpus vocabulary corrects typos. An enriched lexical vector approximates some concepts inside the domain. The interface highlights terms and points directly to the right section.

The result can still evolve. Analytics can come later. A vector database can come later. A larger model can come later. But search has already stopped being a literal string comparison. It now combines what the reader typed, the vocabulary of the articles, textual ranking, error correction and vector proximity.

This is the most important point of the implementation. Simple situations were handled with simple and cheap solutions. hexagonil and CAAP needed vocabulary and fuzzy correction, not a larger model. slow system benefited from dense embedding because it describes a situation, not just a word. The architecture stayed layered so that each problem pays only the cost it needs to pay.

Search articles