I wanted to build a basic RAG application using locally deployed LLM using Ollama. I wanted to avoid frameworks like langChain as much as possible.
Prerequisites: LLM Qwen3-0.6B
I used the uv to initialize the project.
uv init
I added the required libraies using
uv add chromadb transformers pypdf numpy
this is the project structure that I follow
.
├── README.md
├── data
│ ├── 4dfb26a1-b0f6-403b-992f-08109a9cd0a6
│ │ ├── data_level0.bin
│ │ ├── header.bin
│ │ ├── length.bin
│ │ └── link_lists.bin
│ ├── Machine_Learning_System_Design.pdf
│ └── chroma.sqlite3
├── main.py
├── pyproject.toml
├── src
│ ├── db_test.py
│ ├── extract_pdf.py
│ ├── qwen_embedding.py
│ └── retrieve.py
└── uv.lock
- First extract the text from the pdf file using
pyPdflibrary.
from pypdf import PdfReader
with open(pdf_file, 'rb') as f:
content = PdfReader(stream=f)
for page in content.pages:
# Extract text
text = page.extract_text()
- Use text normalization to clean up the text, remove redundant information, clean up the additional newline characters, etc.
def normalize_text(text:str) -> str:
# change to lower case
text = text.lower()
# replace -/n
text = text.replace("-\n", "")
# remove URLs
text = re.sub(r'(?:https?://|www\.)[^\s]+', '', text)
# remove details within parenthesis
text = re.sub(r'\([^)]*\)', '', text)
return text
- Chunk the tokens based on the paragraphs within each page. Page information and chunk number for the page can be easily derived fom the metadata.
# Chunk text
chunks = [chunk.replace("\n", " ") for chunk in text.split("\n ")]
text_dict[page.page_number] = chunks
- For each chunk, generate embeddings. Since I was using the
Qwen3-0.6Bmodel, I could not use the defaultEmbeddingFunctionwhile creating thecollection. I used theCustomEmbeddingFunctionin theChromaDBto generate embedding for the all the chunks. I wanted to persist with the collections, so I have thePersistancecollection.
from chromadb import Documents, EmbeddingFunction, Embeddings
import requests
class QwenEmbeddingFunction(EmbeddingFunction):
def __call__(self, input: Documents) -> Embeddings:
embedding_list = []
# embed the documents
for i in input:
result = requests.post("http://localhost:11434/api/embeddings", json={"model": "qwen3:0.6b", "prompt": i})
embedding_list.append(result.json()["embedding"])
return embedding_list
- Once the embeddings are generated, check the count and if they are stored in the vector db.
import chromadb
chroma_client = chromadb.PersistentClient(path="../data")
collection = chroma_client.get_or_create_collection(
name="system_design",
)
print(collection.count())
print(collection.peek(50))
- Time for retrival. Use a natural language query to generate embedding and find out the nearest matches from the vector store.
collection = chroma_client.get_or_create_collection(
name="system_design",
embedding_function=QwenEmbeddingFunction(),
)
context = collection.query(query_texts=[user_query], n_results=1)
print(context)
- Generate the response based on the retrieved context and the user query.
user_query= """ What is a baseline solution? """
chroma_client = chromadb.PersistentClient(path="../data")
collection = chroma_client.get_or_create_collection(
name="system_design",
embedding_function=QwenEmbeddingFunction(),
)
context = collection.query(query_texts=[user_query], n_results=1)
print(context)
input_prompt = f"Given the context\n {context['documents'][0][0]}, answer the following\n {user_query}"
r = requests.post("http://localhost:11434/api/chat", json={
"model": "qwen3:0.6b",
"messages": [{"role": "user", "content": input_prompt}],
"stream": False
})
print(r.json())
{'model': 'qwen3:0.6b', 'created_at': '2025-10-22T20:03:35.446415986Z', 'message': {'role': 'assistant', 'content': "<think>\nOkay, the user is asking for a baseline solution in the context of a problem decompositioning, and they mentioned it should take a few hundred milliseconds. Let me think about how to approach this.\n\nFirst, I need to recall what a baseline solution is. From what I remember, a baseline solution refers to the most basic or optimal solution that serves as a reference point for comparison. It's usually the simplest or most efficient approach. So, for example, if a problem is broken down into parts, the baseline solution could be the initial steps or the simplest possible way to achieve the goal.\n\nBut wait, the user also provided some context about the problem taking a few hundred milliseconds. That probably means the solution is efficient. So maybe the baseline solution is the most efficient method for that specific problem set. I should make sure that the answer ties in with the time constraints mentioned. Maybe the baseline solution is the initial steps or the simplest approach that's optimal given the time limit.\n\nAlso, I should check if there's any specific terminology or framework that defines baseline solutions. In problem decomposition, it's about breaking down a problem into components. The baseline could be the foundational part of that decomposition. So, the answer should explain that the baseline solution is the simplest or optimal part of the decomposition process, optimized for the given time constraints.\n</think>\n\nA **baseline solution** refers to the simplest, most efficient, or optimal approach to a problem, often serving as a reference for comparison. In the context of problem decompositioning, it likely refers to the foundational or simplest steps or components that define the decomposition process. If the problem is broken down into manageable parts, the baseline solution would be the most straightforward or optimal approach to achieve the goal within the given time constraints. For example, if the decomposition involves multiple steps with a fixed time limit, the baseline solution would be the initial, optimal steps that ensure the overall process is efficient and effective."}, 'done_reason': 'stop', 'done': True, 'total_duration': 10403433381, 'load_duration': 13218815, 'prompt_eval_count': 46, 'prompt_eval_duration': 166526197, 'eval_count': 390, 'eval_duration': 10222473768}```