Local LLM with Retrieval-Augmented Generation

Let’s build a simple RAG application using a local LLM through Ollama.

Mar 24, 2025

Mar 26, 2025

11 minute read

Edit (2025-03-26): Added some words about next steps in conclusion.

Edit (2025-03-25): I re-ran the example with a clean database and the results are better. I also cleaned up the code a bit.

Over the past few months I have been running local LLMs on my computer with various results, ranging from ‘unusable’ to ‘pretty good’. Local LLMs are becoming more powerful, but they don’t inherently “know” everything. They’re trained on massive datasets, but those are typically static. To make LLMs truly useful for specific tasks, you often need to augment them with your own data–data that’s constantly changing, specific to your domain, or not included in the LLM’s original training. The technique known as RAG aims to bridge this problem by embedding context information into a vector database that is later used to provide context to the LLM, so that it can expand its knowledge beyond the original training dataset. In this short article, we’ll see how to build a very primitive local AI chatbot powered by Ollama with RAG capabilities.

The source code used in this post is available here.

Ingredients and set-up

In order to build our AI chatbot, we need the following ingredients:

Ollama, to facilitate access to different LLMs.
Chroma DB, our vector storage for our context.
Langchain, to facilitate the integration of our RAG application with the LLM.

First, we need to install Ollama. On Arch Linux, it is in the extra repository. If you happen to have an NVIDIA GPU that supports CUDA, you’re in luck. You can speed-up your inference but a significant factor! Otherwise, you can still just use the CPU. I have a (rather old) NVIDIA GPU, so I’m installing ollama-cuda as well.

pacman -S ollama ollama-cuda

Next, we need some LLM models to run locally. First, start the Ollama service.

systemctl start ollama.service

You can check whether the service is running with curl:

$ curl http://localhost:11434
Ollama is running%

Then, you need to get some LLM models. I have some installed that I use regularly:

$ ollama list
NAME                     ID              SIZE      MODIFIED    
llama3.1:8b              46e0c10c039e    4.9 GB    2 hours ago    
gemma3:12b               6fd036cefda5    8.1 GB    10 days ago    
phi4-mini:latest         78fad5d182a7    2.5 GB    10 days ago    
deepseek-coder-v2:16b    63fb193b3a9b    8.9 GB    10 days ago    
llama3.2:1b              baf6a787fdff    1.3 GB    4 weeks ago    
mistral-small:24b        8039dd90c113    14 GB     7 weeks ago    
phi4:latest              ac896e5b8b34    9.1 GB    7 weeks ago

I would recommend using llama3.1:8b (link) for this experiment, especially if you run a machine with little RAM. It is quite compact and works reasonably well¹. Additionally, some models like gemma3:12b do not support embeddings, so you need to choose carefully.

Get llama3.1:8b with:

ollama pull llama3.1:8b

That’s it for our LLM. Now, let’s create a new directory for our RAG chatbot project, and let’s install the dependencies.

mkdir rag-chat && cd rag-chat
pipenv install ollama langchain langchain-ollama chromadb overrides

This should install python-ollama, Langchain together with the -ollama hook, and Chroma DB.

Ollama model setup

The code we display below is available in this repository. Now, let’s create a file, rag-chat.py.

First, our chatbot will list all the available models and will ask for the model we want to use. This part is simple:

#!python
from langchain_ollama import OllamaEmbeddings, OllamaLLM
import chromadb, requests, os, ollama

OLLAMA_URL = "http://localhost:11434"

models = ollama.list()
model_names = []
for m in models.models:
    model_names.append(m.model)

print("Available models:\n -", '\n - '.join(model_names))

# Ask the user the LLM model to use
llm_model = input("\nModel to use: ")

Vector storage

Chroma DB is a vector database designed and optimized for storing and searching vector embeddings, which are crucial in RAG. Vector embeddings are numerical representations of data (text, images, audio, etc.). They capture the meaning or semantic information of the data. They are generated by machine learning models (often transformer models like those used in LLMs). These models are trained to map similar data points to nearby points in a high-dimensional vector space. That’s precisely what we need to do with our context data for it to be useful to the LLM. And to that purpose we use Chroma DB.

So, we need to initialize the Chroma DB client with a persistent storage, and get it ready to embed our context data. We decide to persist our storage to disk, in the directory chroma_db/.

# Configure ChromaDB
# Initialize the ChromaDB client with persistent storage in the current directory
chroma_client = chromadb.PersistentClient(path=os.path.join(os.getcwd(), "chroma_db"))

# Custom embedding function for ChromaDB (using Ollama)
class ChromaDBEmbeddingFunction:
    """
    Custom embedding function for ChromaDB using embeddings from Ollama.
    """
    def __init__(self, langchain_embeddings):
        self.langchain_embeddings = langchain_embeddings

    def __call__(self, input):
        # Ensure the input is in a list format for processing
        if isinstance(input, str):
            input = [input]
        return self.langchain_embeddings.embed_documents(input)

# Initialize the embedding using Ollama embeddings, with the chosen model
embedding = ChromaDBEmbeddingFunction(
    OllamaEmbeddings(
        model=llm_model,
        base_url=OLLAMA_URL
    )
)

# Define a collection for the RAG workflow
collection_name = "rag_collection_demo"
collection = chroma_client.get_or_create_collection(
    name=collection_name,
    metadata={"description": "A collection for RAG with Ollama"},
    embedding_function=embedding  # Embedding function we defined before
)

Provide context for retrieval

Once we have Chroma DB set up, with our collection and embedding function, we need to populate it. Here, we’ll just use an array with text, but you could fetch this information from files easily.

In my example, I use a totally made up text (1), the abstract to a paper about characterizing atmospheres with JWST (2), and some information about the new Gaia Sky website. The first is made up, so the model can’t know it. The second is a paper which came out much later than the model was published. The third is about the new Gaia Sky website, which was also created after the model.

# Function to add documents to the ChromaDB collection
def add_documents_to_collection(documents, ids):
    """
    Add documents to the ChromaDB collection.
    
    Args:
        documents (list of str): The documents to add.
        ids (list of str): Unique IDs for the documents.
    """
    collection.add(
        documents=documents,
        ids=ids
    )

# Example: Add sample documents to the collection
documents = [
    "The Mittius is a sphere of radius 2 that is used to disperse light in all
    directions. The Mittius is very powerful and sometimes emits light in various
    wavelengths on its own. It is a completely fictional object whose only purpose
     is testing RAG in a local LLM.",
    
    "The newly accessible mid-infrared (MIR) window offered by the James Webb Space
    Telescope (JWST) for exoplanet imaging is expected to provide valuable information to
    characterize their atmospheres. In particular, coronagraphs on board the JWST
    Mid-InfraRed instrument (MIRI) are capable of imaging the coldest directly
    imaged giant planets at the wavelengths where they emit most of their flux. The
    MIRI coronagraphs have been specially designed to detect the NH3 absorption
    around 10.5 microns, which has been predicted by atmospheric models. We aim to
    assess the presence of NH3 while refining the atmospheric parameters of one of the
    coldest companions detected by directly imaging GJ 504 b. Its mass is still a matter
    of debate and depending on the host star age estimate, the companion could either
    be placed in the brown dwarf regime or in the young Jovian planet regime. We present
    an analysis of MIRI coronagraphic observations of the GJ 504 system. We took advantage
    of previous observations of reference stars to build a library of images and to
    perform a more efficient subtraction of the stellar diffraction pattern. We detected
    the presence of NH3 at 12.5 sigma in the atmosphere, in line with atmospheric model
    expectations for a planetary-mass object and observed in brown dwarfs within a
    similar temperature range. The best-fit model with Exo-REM provides updated values
    of its atmospheric parameters, yielding a temperature of Teff = 512 K and radius of
    R = 1.08 RJup. These observations demonstrate the capability of MIRI coronagraphs to
    detect NH3 and to provide the first MIR observations of one of the coldest directly
    imaged companions. Overall, NH3 is a key molecule for characterizing the atmospheres
    of cold planets, offering valuable insights into their surface gravity. These
    observations provide valuable information for spectroscopic observations planned
    with JWST.",
    
    "Gaia Sky has a new website built with Hugo. It contains download pages for all new
    and old versions of the software, and a full listing of all the catalogs and datasets
    offered with the software. The datasets can be downloaded in-app with the provided
    dataset manager."

]
doc_ids = ["doc_mittius", "doc_paper_jwst", "doc_gaiasky_web"]

# Documents only need to be added once or whenever an update is required. 
# This line of code is included for demonstration purposes:
add_documents_to_collection(documents, doc_ids)

Finally, the chatbot logic

Now we have our vector storage set up. Let’s build the chat logic!

# Function to query the ChromaDB collection
def query_chromadb(query_text, n_results=3):
    """
    Query the ChromaDB collection for relevant documents.
    
    Args:
        query_text (str): The input query.
        n_results (int): The number of top results to return.
    
    Returns:
        list of dict: The top matching documents and their metadata.
    """
    results = collection.query(
        query_texts=[query_text],
        n_results=n_results
    )
    return results["documents"], results["metadatas"]

# Function to interact with the Ollama LLM
def query_ollama(prompt):
    """
    Send a query to Ollama and retrieve the response.
    
    Args:
        prompt (str): The input prompt for Ollama.
    
    Returns:
        str: The response from Ollama.
    """
    llm = OllamaLLM(model=llm_model)
    return llm.invoke(prompt)

# RAG pipeline: Combine ChromaDB and Ollama for Retrieval-Augmented Generation
def rag_pipeline(query_text):
    """
    Perform Retrieval-Augmented Generation (RAG) by combining ChromaDB and Ollama.
    
    Args:
        query_text (str): The input query.
    
    Returns:
        str: The generated response from Ollama augmented with retrieved context.
    """
    # Step 1: Retrieve relevant documents from ChromaDB
    retrieved_docs, metadata = query_chromadb(query_text)
    context = " ".join(retrieved_docs[0]) if retrieved_docs else "No relevant documents found."

    # Step 2: Send the query along with the context to Ollama
    augmented_prompt = f"Context: {context}\nQuestion: {query_text}\nAnswer: "
    # Uncomment next line if you want to see the context
    # print(augmented_prompt)

    response = query_ollama(augmented_prompt)
    return response

# User query loop
while True:
    query = input("Ask a question (or type 'exit' to quit): ")
    if query.lower() == "exit":
        break
    response = rag_pipeline(query)
    print(response)

That’s it! In this part, we ask the user to provide a query. Then, the script uses this query to fetch some context from our local Chroma DB, adds the context to the prompt (in augmented_prompt), and sends the query to Ollama. Then, we wait for the response and print it out.

Here is an example output. I ask it about JWST and exoplanets, the imaginary object ‘Mittius’, and the new Gaia Sky website. I have highlighted the response lines:

$ rag-chat.py
Available models:
 - llama3.1:8b
 - gemma3:12b
 - phi4-mini:latest
 - deepseek-coder-v2:16b
 - llama3.2:1b
 - mistral-small:24b
 - phi4:latest
Model to use: llama3.1:8b
Insert of existing embedding ID: doc_mittius
Insert of existing embedding ID: doc_paper_jwst
Insert of existing embedding ID: doc_gaiasky_web
Add of existing embedding ID: doc_mittius
Add of existing embedding ID: doc_paper_jwst
Add of existing embedding ID: doc_gaiasky_web

Ask a question (or type 'exit' to quit): JWST and exoplanets.
The James Webb Space Telescope (JWST) will play a significant role in exoplanet
research by providing new capabilities for characterizing their atmospheres.
The telescope\'s Mid-InfraRed instrument (MIRI) coronagraphs are capable of
imaging cold directly imaged giant planets and detecting specific molecules,
such as ammonia (NH3), which is a key indicator of surface gravity.

In the provided text, JWST is mentioned in the context of:

1. The newly accessible mid-infrared (MIR) window offered by JWST for exoplanet
   imaging.
2. The MIRI coronagraphs on board JWST are capable of detecting NH3 around
   10.5 microns.
3. Future spectroscopic observations planned with JWST will be guided by the
   valuable information obtained from these observations.

Overall, JWST is expected to provide new insights into exoplanet atmospheres
and help scientists refine their understanding of these distant worlds.

Ask a question (or type 'exit' to quit): What is the mittius?
A completely fictional object used to test Reasoning and Argument Generation
(RAG) in a local Large Language Model (LLM).

Ask a question (or type 'exit' to quit): What is the radius of the Mittius?
The answer can be found in the first sentence of the text:

"The Mittius is a sphere of radius 2 that is used to disperse light in all directions."

So, the radius of the Mittius is 2.

Ask a question (or type 'exit' to quit): How was the new Gaia Sky website built?
The new Gaia Sky website was built with Hugo.

Ask a question (or type 'exit' to quit): exit

As you can see, even this small 8B parameter model pretty much nails all three answer. In the third question (radius of the Mittius), it even provides the exact sentence that it sourced the subsequent answer from. Good work, Llama3.1!

Conclusion

As we’ve seen, with very little effort we can build a rudimentary RAG system on top of Ollama. This enables us to use context information in our queries in an automated manner, with the help of Chroma DB. This post only highlights the very basic notions required to get RAG to work in a very bare-bones manner. Further concepts to explore include different embedding models, quantization, system prompts, and more.

In our small test, we’ve used the Llama3.1 8B model, which is rather small. Using a larger model, like Gemma3 (12B), DeepSeek-R1 (14B), or even Mistral-small (24B), should improve the results at the expense of performance.

The code in this post is partially based on this medium article.

https://artificialanalysis.ai/models/llama-3-1-instruct-8b ↩︎

langur@monkey:~/$

~/blog

~/projects

~/publications

~/cv

~/photo

~/search