Building a basic RAG system using Ollama and ChromaDB

10/22/2025

I wanted to build a basic RAG application using locally deployed LLM using Ollama. I wanted to avoid frameworks like langChain as much as possible.

Prereqisities

Model : LLM Qwen3-0.6B Ollama

I used the uv to initialize the project.

uv init

I added the required libraies using

uv add chromadb transformers pypdf numpy

Project Structure

this is the project structure that I follow

.
├── README.md
├── data
│   ├── 4dfb26a1-b0f6-403b-992f-08109a9cd0a6
│   │   ├── data_level0.bin
│   │   ├── header.bin
│   │   ├── length.bin
│   │   └── link_lists.bin
│   ├── Machine_Learning_System_Design.pdf
│   └── chroma.sqlite3
├── main.py
├── pyproject.toml
├── src
│   ├── db_test.py
│   ├── extract_pdf.py
│   ├── qwen_embedding.py
│   └── retrieve.py
└── uv.lock

First extract the text from the pdf file using pyPdf library.


from pypdf import PdfReader

with open(pdf_file, 'rb') as f:
    content = PdfReader(stream=f)
    for page in content.pages:
        # Extract text
        text = page.extract_text()

Use text normalization to clean up the text, remove redundant information, clean up the additional newline characters, etc.

def normalize_text(text:str) -> str:
    # change to lower case
    text = text.lower()

    # replace -/n
    text = text.replace("-\n", "")
    
    # remove URLs 
    text = re.sub(r'(?:https?://|www\.)[^\s]+', '', text)

    # remove details within parenthesis
    text = re.sub(r'\([^)]*\)', '', text) 

    return text

Chunk the tokens based on the paragraphs within each page. Page information and chunk number for the page can be easily derived fom the metadata.

# Chunk text
chunks = [chunk.replace("\n", " ") for chunk in text.split("\n ")]
text_dict[page.page_number] = chunks

For each chunk, generate embeddings. Since I was using the Qwen3-0.6B model, I could not use the default EmbeddingFunction while creating the collection. I used the CustomEmbeddingFunction in the ChromaDB to generate embedding for the all the chunks. I wanted to persist with the collections, so I have the Persistance collection.

from chromadb import Documents, EmbeddingFunction, Embeddings
import requests
class QwenEmbeddingFunction(EmbeddingFunction):
    def __call__(self, input: Documents) -> Embeddings:
        embedding_list = []
        # embed the documents 
        for i in input:
            result = requests.post("http://localhost:11434/api/embeddings", json={"model": "qwen3:0.6b", "prompt": i})
            embedding_list.append(result.json()["embedding"])
        return embedding_list

Once the embeddings are generated, check the count and if they are stored in the vector db.

import chromadb
chroma_client = chromadb.PersistentClient(path="../data")
collection = chroma_client.get_or_create_collection(
    name="system_design",
)
print(collection.count())
print(collection.peek(50))

Time for retrival. Use a natural language query to generate embedding and find out the nearest matches from the vector store.

collection = chroma_client.get_or_create_collection(
    name="system_design",
    embedding_function=QwenEmbeddingFunction(),
)
    
context = collection.query(query_texts=[user_query], n_results=1)
print(context)

Generate the response based on the retrieved context and the user query.


user_query= """ What is a baseline solution? """
        

chroma_client = chromadb.PersistentClient(path="../data")
collection = chroma_client.get_or_create_collection(
    name="system_design",
    embedding_function=QwenEmbeddingFunction(),
)
   
context = collection.query(query_texts=[user_query], n_results=1)
print(context)

input_prompt = f"Given the context\n {context['documents'][0][0]}, answer the following\n {user_query}"
r = requests.post("http://localhost:11434/api/chat", json={
    "model": "qwen3:0.6b",
    "messages": [{"role": "user", "content": input_prompt}],
    "stream": False
})
print(r.json())

{'model': 'qwen3:0.6b', 'created_at': '2025-10-22T20:03:35.446415986Z', 'message': {'role': 'assistant', 'content': "<think>\nOkay, the user is asking for a baseline solution in the context of a problem decompositioning, and they mentioned it should take a few hundred milliseconds. Let me think about how to approach this.\n\nFirst, I need to recall what a baseline solution is. From what I remember, a baseline solution refers to the most basic or optimal solution that serves as a reference point for comparison. It's usually the simplest or most efficient approach. So, for example, if a problem is broken down into parts, the baseline solution could be the initial steps or the simplest possible way to achieve the goal.\n\nBut wait, the user also provided some context about the problem taking a few hundred milliseconds. That probably means the solution is efficient. So maybe the baseline solution is the most efficient method for that specific problem set. I should make sure that the answer ties in with the time constraints mentioned. Maybe the baseline solution is the initial steps or the simplest approach that's optimal given the time limit.\n\nAlso, I should check if there's any specific terminology or framework that defines baseline solutions. In problem decomposition, it's about breaking down a problem into components. The baseline could be the foundational part of that decomposition. So, the answer should explain that the baseline solution is the simplest or optimal part of the decomposition process, optimized for the given time constraints.\n</think>\n\nA **baseline solution** refers to the simplest, most efficient, or optimal approach to a problem, often serving as a reference for comparison. In the context of problem decompositioning, it likely refers to the foundational or simplest steps or components that define the decomposition process. If the problem is broken down into manageable parts, the baseline solution would be the most straightforward or optimal approach to achieve the goal within the given time constraints. For example, if the decomposition involves multiple steps with a fixed time limit, the baseline solution would be the initial, optimal steps that ensure the overall process is efficient and effective."}, 'done_reason': 'stop', 'done': True, 'total_duration': 10403433381, 'load_duration': 13218815, 'prompt_eval_count': 46, 'prompt_eval_duration': 166526197, 'eval_count': 390, 'eval_duration': 10222473768}```

Building a basic RAG system using Ollama and ChromaDB

Prereqisities

Project Structure

Table of Contents