Introduction to RAG

Retrieval-Augmented Generation (RAG) systems have become the backbone of modern AI applications that need to combine the power of large language models (LLMs) with up-to-date, reliable, and custom information.

But why is RAG such a big deal? And what’s actually going on under the hood? Let’s take a walk through the world of RAG, vector databases, and some of the more advanced techniques that are shaping the next generation of AI systems.

# Why is RAG good (or even necessary)?

If you’ve played around with LLMs, you’ll know they’re prone to hallucination: confidently making things up, sometimes with hilarious, sometimes with disastrous results. RAG is the antidote: it lets you inject real, up-to-date, and custom data into the LLM’s workflow. Here’s why that matters:

Minimizing hallucination: By grounding answers in retrieved documents, RAG systems can provide more accurate, less speculative responses.
Providing sources: You get citations, so you can check where the answer came from (and call out the model if it’s bluffing).
Custom/private data: LLMs don’t know about your company’s internal wiki, your latest research, or your private notes. RAG lets you bring that data into the conversation.
Custom knowledge bases: Build specialized assistants for legal, medical, technical, or any other domain.
Compliance and traceability: For regulated industries, being able to show your work (and your sources) is a must.
Exploring deep relationships: RAG isn’t just about surface-level search; it can help you uncover connections between entities, concepts, and events.
Up-to-date information: LLMs are frozen in time at their last training cut-off. RAG lets you plug in the latest news, docs, or research.

# Types of RAG

Not all RAG systems are created equal. Here are a few flavors you might encounter:

Basic RAG: The classic setup—retrieve relevant chunks from a single vector database, stuff them into the LLM prompt, and let the model generate an answer.
Multi-hop RAG: Sometimes, answering a question requires chaining together multiple retrievals (e.g., find a person’s employer, then retrieve info about that company).
Hybrid RAG: Combine dense (vector) and sparse (keyword/BM25) retrieval for better coverage.
Graph-based RAG: Use knowledge graphs to navigate relationships and context, not just raw text.
Domain-specific RAG: Tailor retrieval and generation to a specific field (legal, medical, etc.), often with custom chunking, metadata, ranking, and even specialized embedding models. For example, Voyage offers models like voyage-law-2 for legal retrieval, voyage-finance-2 for finance, and others optimized for code or general-purpose use, enabling much better results in those domains.

# Vector Databases: The Engine Room

At the heart of most RAG systems is a vector database. But what’s actually going on here?

Embeddings: Every chunk of text (or image, or code) is turned into a high-dimensional vector, which is a long list of numbers that captures its meaning. This is done using an embedding model (OpenAI, Cohere, BAAI, etc).
High-dimensional vectors: These aren’t your average 2D or 3D points. We’re talking 384, 768, or even 1536 dimensions. Each dimension can capture a subtle aspect of meaning—one might represent “formality,” another “topic,” another “sentiment,” or even more abstract features. For example, two texts about sports might be close in a “topic” dimension, but differ in a “sentiment” or “formality” dimension. The idea is that similar meanings end up close together in this hyperspace, because their values across many dimensions align.
Dense vs sparse vectors: Dense vectors are what you get from neural embeddings; sparse vectors are more like classic keyword counts. Hybrid systems use both.
Similarity metrics: To find the most relevant chunks, the database uses metrics like cosine similarity, inner product, or Euclidean distance.
Why is it fast? Vector DBs use clever indexing (like Approximate Nearest Neighbor aka ANN) to quickly find the closest vectors, even in huge datasets. This is much faster than running an LLM over all your data (the limited context window would also disallow this anyway).
Limitations: Vector search isn’t perfect: semantic drift, context window limits, and embedding quality all matter.

Popular vector DBs include Milvus, Pinecone, Chroma, Weaviate, and Qdrant. Each has its own strengths, tradeoffs, and quirks—open source vs managed, scalability, integrations, etc.

# Superlinked: The Vector Compute Layer

Most people have heard of Pinecone or Milvus, but fewer know about Superlinked. They provide a vector compute layer that sits between your data and your vector database. This lets you combine text, images, and structured metadata into multi-modal vectors, and run complex, multi-objective queries (eg. relevance + freshness + popularity). Superlinked aims to make vector search smarter and more flexible, especially for use cases like recommendations, analytics, and advanced RAG.

# The Two Main Phases: Indexing and Querying

RAG systems have two main jobs:

Indexing:
- Chunk your documents (how you split them matters a lot)
- Embed each chunk (turn it into a vector)
- Store the vectors (and metadata) in your vector DB
- Optionally, add extra metadata for filtering, permissions, etc.
Querying:
- Take the user’s question, embed it
- Retrieve the most relevant chunks (using similarity search, filters, etc)
- (Optionally) rerank or filter the results
- Construct the context window for the LLM
- Generate the answer
- (Optionally) collect feedback to improve future retrievals

# Advanced RAG Techniques

There’s a lot more you can do:

Query enhancement/expansion: Reformulate the user’s query, run multiple queries, or use LLMs to generate better search terms so you can find chunks that would not be covered by the users initial query but are semantically relevant for providing a truthful and correct answer.
Reranking: Use cross-encoders or LLMs to rerank retrieved chunks for better relevance.
Retrieval fusion: Combine results from multiple retrieval strategies (dense, sparse, graph, etc).
Context compression: Summarize or filter retrieved chunks to fit more information into the LLM’s context window.
Memory-augmented RAG: Store and retrieve episodic or long-term memory for more personalized or persistent conversations.

# Graph RAG (and Microsoft’s take)

One of the most exciting directions is Graph RAG. Instead of just retrieving isolated chunks, you use a knowledge graph to navigate relationships—entities, events, connections. Microsoft’s Graph RAG is a great example: it links retrieved facts, lets you reason over them, and supports more complex, multi-hop questions. This is especially powerful for research, compliance, and any domain where relationships matter as much as facts.

# Other Advanced RAG Techniques

Tool-augmented RAG: Combine retrieval with external tools (calculators, code execution, third-party APIs).
Multi-modal RAG: Retrieve and reason over text, images, code, and more.
Evaluation and benchmarking: Use metrics and datasets to measure retrieval and generation quality.
Privacy and security: Control what data is indexed, retrieved, and shown to the LLM; handle permissions and redaction.

# Wrapping up

RAG is a practical, evolving set of techniques that make LLMs more useful, reliable, and grounded in reality. Whether you’re building a chatbot, a search engine, or a research assistant, understanding the moving parts of RAG (and the ecosystem around it) is key to building something that actually works.

If you want to go deeper, check out the docs and blogs for the vector DBs mentioned above, and also the new players like Superlinked. The field is moving fast, and the best RAG systems of tomorrow will be the ones that combine smart retrieval, flexible compute, and a healthy dose of skepticism about what the LLM is telling you.

Written on July 15, 2025

If you notice anything wrong with this post (factual error, rude tone, bad grammar, typo, etc.), and you feel like giving feedback, please do so by contacting me at hello@samu.space. Thank you!

Back