How to Build a Two-Stage Semantic Search Pipeline on Your Data, With Zero Code

Deven Navani

October 16, 2023

Table of Contents

Building state-of-the-art, production-grade semantic search on top of one’s data is a significant engineering lift. 

What is semantic search?
Semantic search analyzes the intent and context behind search queries, rather than just the keywords.

What does two-stage refer to?
Stage 1: Retrieving a large set of candidates via vector search
Stage 2: Re-ranking these candidates by query relevance using a re-ranking model and returning the top candidates to the user. Re-ranking models are typically based on cross-encoders and work by taking in a query and a candidate document and outputting a relevance score between 0 and 1.

In this blog post, I walk through what building out semantic search involves. That said, it is not worth your time to build out this generic infrastructure. 

Whatever your role is at your company (developer, ML engineer, data scientist, etc.), your time is better spent using a platform that not only provides this infrastructure out of the box, but allows for rapid experimentation with the various parameters of a semantic search system we’ll describe below.

Don’t spend your time building semantic search – spend your time making sure semantic search performs.

Preparing your Data

The first step in building out semantic search is preparing your data to be searchable. 

This involves:

1. Quick, scalable embedding generation

How quickly can you compute embeddings for a large text dataset? This typically involves deploying a distributed, autoscaling compute cluster (options include Dask, Ray, Spark, etc.).

How easily can you experiment with different parameters for computing embeddings?

These parameters include, but aren’t limited to:

  1. Text chunking strategies: Perhaps you just want 1 embedding per document in your dataset. Or maybe you want to chunk each document and compute 1 embedding per chunk. There are different ways to chunk a document, whether it’s by sentence/paragraph or some fixed number of tokens for each chunk.
  2. Embedding models: Embedding models, whether you self-host an open-source one or use a commercial model behind an API, vary in quality, speed, and cost. You’ll want to experiment with different models to determine which one is best for your use case.
  3. Pooling strategies: Pooling refers to taking a sequence of token-level embeddings and compressing them into a singular embedding to represent a multi-token sequence of text. There are different pooling strategies (CLS pooling, mean pooling, etc.) which can impact the quality of your semantic search.
Chunk Average and Special Token Average

As new data is made available, can you continuously chunk and embed it?

2. Scalable and dynamic indexing of embeddings

What’s the largest number of embeddings you can feasibly insert into your vector similarity search index (e.g. HNSW)? 

semantic search with text embeddings

More embeddings means a larger search index, and you’ll need to load this index into memory for fast vector search, while ideally sharding the index across multiple servers.

As new embeddings are made available, can you dynamically insert them into your index?

3. Clean version management of search indexes

Each permutation of parameters for computing embeddings you try will result in a different set of embeddings for your data, and hence a different search index. How can you manage these different index versions, and swap between them if necessary?

In addition to this macro versioning of indexes derived from permutations of embedding techniques, each new embedding batch update to a search index represents a new micro version of that index. A new micro version may make results worse for queries that used to perform well — perhaps un-cleaned or incorrectly embedded data was inserted into your index. If so, how easily can you roll back to an earlier micro version?

4. Fine-tuning your embedding model

What if you want to further pre-train an embedding model on your dataset before generating embeddings? Do you have the requisite training infrastructure and compute (GPUs) to do so?

Say you perform additional pre-training of your embedding model every X months on new data  — when you’re ready to swap in a newly trained model, how easily can you re-compute and re-index embeddings for existing data you’ve already processed?

Building the Semantic Search Pipeline

Once your data has been chunked, embedded, and indexed, you need to build out the actual search pipeline and deploy it behind an API endpoint. 

Here’s what such a pipeline may look like:

Let’s walk through each execution step of the pictured pipeline and the important considerations you’ll need to keep in mind:

1. Fast embedding inference

How quickly can you compute an embedding for the user’s search query? You’ll need an inference server with the embedding model loaded into memory.

Fast embedding inference

2. Fast, quality vector search

How quickly can you retrieve the approximate nearest neighbors to the query embedding, without sacrificing retrieval quality?

The index will need to be loaded into memory. If your index is sharded (which it should be for large datasets), you’ll need to query each shard and reduce the results.

Fast, quality vector search

3. Fast, customizable re-ranking

How quickly can you re-rank these N candidates and return the top X results, where X < N? Similar to the embedding model, the re-ranking model will need to be loaded into memory.

If you have the data to do so, can you easily fine-tune your re-ranking model of choice?

Fast, customizable re-ranking

Ideally, the inference time for this entire pipeline should be a couple hundred milliseconds. The pipeline should be deployed behind a performant API endpoint with load-balancing.

Semantic Search with Graft

Graft is a Modern  AI platform, and one of its core offerings is production-grade semantic search on your data, without a single line of code.

Here is a video demo of building semantic search in Graft:

So, why should you consider Graft over rolling out your own infrastructure?

  1. Simplicity: With Graft, the complexity we described above of creating a semantic search pipeline is abstracted away. There's no need to worry about embedding generation, model hosting, index management, or deployment intricacies. You get a robust, ready-to-use solution right out-of-the-box.
  2. Unified Platform: Why bother trying to stitch together HuggingFace/OpenAI embedding models, Langchain, Pinecone, etc. when you have a highly production-izable solution like Graft that offers everything these point solutions offer and more?
  3. Scalability: As your dataset grows, Graft dynamically scales with your needs, ensuring that you don't have to manage infrastructure or worry about performance bottlenecks.
  4. Rapid Experimentation: Graft allows for rapid testing of different embedding and re-ranking models, chunking strategies, pooling approaches, and other parameters. You can quickly iterate and determine the best parameters without having to code each variant.
  5. Clean Index Management: Graft’s platform automatically manages versions of your search indexes. If you want to switch between embedding models, or an update doesn’t provide the desired results, reverting to a different index version is straightforward.
  6. Continuous Learning: As new data becomes available or as embedding models get updated, Graft seamlessly integrates these changes, ensuring your search remains top-notch and up-to-date.
  7. Zero Maintenance: With Graft, you free yourself from the ongoing maintenance overhead. Whether it's software updates, security patches, or ensuring high availability, Graft has you covered.
Graft's Modern AI Platform
Graft's Modern AI Platform

As data volumes grow and user expectations rise, having a robust, fast, and accurate search becomes imperative.

While building and maintaining such a system in-house is possible, it requires significant resources, expertise, and ongoing commitment. With Graft, you get a state-of-the-art semantic search solution without the headaches of building and maintaining it yourself.

Channel your time, attention, and resources away from deployment and ML challenges that are orthogonal to your company’s business needs and towards delivering exceptional value to your users.

If semantic search with Graft sounds interesting, feel free to reach out at! I’d love to show you more of Graft and chat about your individual use cases.

Check out our Deep Dive into Semantic Search with Modern AI.

Last Updated

November 6, 2023

Deven Navani

Senior Software Engineer

Deven is a full-stack software engineer with previous internships at Observe, Meta, Microsoft, and Splunk. He earned degrees in EECS and Business Administration from UC Berkeley's M.E.T. program

Check out other articles

With Graft,
for AI.