Featured image of post RAG and Tools for Managing Collections

RAG and Tools for Managing Collections

With LangChain, ChromaDB, HuggingFace Embeddings and OpenAI

Introduction

I previously wrote about implementing RAG (Retriever Augmented Generation) with OpenAI embeddings and ChromaDB vector store. This post extends that discussion and adds a different flavor by using HuggingFace embeddings with some additional details on managing document collections. My previous post on RAG provides a more comprehensive overview if needed, so you can skip the introductory bits and head straight to the code if you have read it.

Retrieval Augmented Generation

Retrieval Augmented Generation (RAG) provides a simple method for supplementing an LLM’s training data with specialized and up-to-date information, to enhance responses and deliver contextually relevant responses. Implementing RAG typically follows a structured process. Supplemental sources (user files, proprietary company data, product data, research papers and so on) are loaded and pre-processed into raw text, which is then split into smaller chunks called documents.

These chunks of text are then transformed into vector representations - word embeddings - which capture the semantics or meaning of the text. During the embedding process a word, sentence or whole text file is converted into a sequence of numbers (a vector), such that similar texts will have similar vector representations. Using semantic similarity searches, the embeddings provide an efficient means of finding texts that are most relevant and similar in context.

Now here comes the clever bit: before the user query is submitted to the chat model, the system searches through the stored embeddings to find the document chunks that are most closely associated with the query, based on their semantic similarity to the input. These document chunks likely contain the most pertinent information and are retrieved and included with the original user query. This augmented query is sent to the LLM, providing additional context and information to improve the generated response. The LLM thus generates output based on its (1) underlying language understanding and (2) the augmented prompt containing information provided by the RAG system.

Libraries

This post uses LangChain as the main framework. The word embeddings are done using a HuggingFace embedding model and the main chat model is provided by OpenAI. Embeddings are stored in a ChromaDB vector database which easily integrates with LangChain, and the database is saved locally using SQLite. For reference, here are some of the package versions used in the subsequent code.

1
2
3
4
5
6
7
8
9
## numpy 1.26.4
## pandas 2.2.3
## keras 2.13.1
## tensorflow 2.13.1
## langchain 0.3.9
## chromadb 0.5.15
## langchain_openai 0.2.3
## langchain_chroma 0.1.4
## langchain_community 0.3.3
1
2
3
4
5
6
from langchain_chroma import Chroma # Vector database for embeddings
from langchain_openai import ChatOpenAI # Chat model
from langchain_community.document_loaders import TextLoader # Document loading
from langchain.text_splitter import CharacterTextSplitter # Document chunking
# from langchain_openai import OpenAIEmbeddings # Embedding model
from langchain_huggingface import HuggingFaceEmbeddings # Embedding model

APIs

This post uses OpenAI models, so you’ll need a paid account to use their API if you want to run the code. There used to be a free tier, but this is no longer the case and a small amount of credit will be needed in your account. There are plenty of resources online showing you how to sign up. Once you have an API key, you will have to import it into your development environment. If you are using a conda environment, you can add it to the environment variables and then import it.

1
2
3
import os
api_key = os.environ.get('OPENAI_KEY')
print(api_key) # Don't share this obviously !

First we establish a connection to the OpenAI language model.

1
2
# Connection to OpenAI API
llm = ChatOpenAI(openai_api_key = api_key)

Document Loading

Next is a document splitter, which chunks the input data into chunks prior to vectorization (i.e. creating the embeddings). Two important parameters are chunk_size which determines the number of characters for each split and chunk_overlap which reduces the potential information loss at the chunk boundaries. Selection of these parameters impacts model performance, so these need to be determined experimentally depending on the application.

1
2
3
4
5
6
7
# Initialize a splitter object
# Note chunk size is CHARACTERS not WORDS
text_splitter = CharacterTextSplitter(
    separator = ".",   # Split on a full-stop
    chunk_size = 250,  # The split is done at the nearest delimiter
    chunk_overlap = 50 
)

For the purposes of this example, I set up two separate collections in order to demonstrate how to navigate between them. I used extracts from two military autobiographies: To the Limit: An Air Cav Huey Pilot in Vietnam by Tom Johnson and Storm of Steel by Ernst Jünger. A general-purpose LLM would not have been trained on these texts, so we can use these to demonstrate RAG in action.

Extract from To the Limit:

“war, many individual Rotary Wing Classes suffered even heavier losses than that of the average helicopter crew. I graduated 16th of 286 men in Warrant Officer Rotary Wing Aviation Class 67-5. In the twelve months after we received our Army aviator wings at Fort Rucker, Alabama, 1 out of every 13 of us died in South Vietnam. Of those who died, the average time in country was 165 days and the average age was 23.11 years.’ 1, Statistical data provided by Vietnam Helicopter Pilots Association. Made possible only through the individual efforts of Gary Roush and the other members of the Data Base Committee as listed in the 2004 Membership Directory. The An Lao Valley Incident Tonight, this 19-year-old will most likely leave us. His wounds are massive. Like others before him, he thrashes about on the hard alumi- num floor bathed in blood and suffering. Repeatedly, he calls for his mother, not his God. At great risk to ourselves, we will push our flying abilities and this helicopter to the brink of disaster trying to save him. Although cloaked in darkness, we must fly hard and low, cutting every corner. In spite of all my efforts, he will likely make the transition from life to death” - Tom A. Johnson

Extract from Storm of Steel:

“Guard duty was either in the trench or else in one of the numerous forward posts that were connected to the line by long, buried saps; a type of insurance that was later given up, because of their exposed position.The endless, exhausting spells of sentry duty were bearable so long as the weather happened to be fine, or even frosty; but it became torture once the rain set in in January. Once the wet had saturated the canvas sheeting overhead, and your coat and uniform, and trickled down your body for hours on end, you got into a mood that nothing could lighten, not even the sound of the splashing feet of the man coming towards you to relieve you. Dawn lit exhausted, clay-smeared figures who, pale and teeth chattering, flung themselves down on the mouldy straw of their dripping dugouts.Those dugouts! They were holes hacked into the chalk, facing the trench, roofed over with boards and a few shovelfuls of earth. If it had been raining, they would drip for days afterwards; a desperate waggishness kitted them out with names like ‘Stalactite Cavern’, ‘Men’s Public Baths’, and other such. If several men wanted to rest at the same time, they had no” - Ernst Jünger

Document Chunking

Now we load the custom documents and split them into chunks using the previously defined splitting function.

1
2
3
4
5
# Load the first file
loader = TextLoader("limitsnippet.txt", encoding='cp1252')
# Split into chunks
doc_1 = loader.load_and_split(
    text_splitter = text_splitter)

The output is a Document object containing a list of document chunks. Each chunk contains metadata and the actual text.

1
2
# Number of document chunks
print(len(doc_1))
1
## 10
1
2
# Inspect one of the chunks
print(doc_1[5].page_content)
1
## Although cloaked in darkness, we must fly hard and low, cutting every corner. In spite of all my efforts, he will likely make the transition from life to death. His soul will depart, and his earthly body will finally lie in peace, without pain

The page_content component is extracted and stored in a simple list that will be fed into the embedding model.

1
2
3
4
5
6
# Initialize a list
doc_1_content = []

# Simple loop to extract the page contents for each document chunk
for i in range(0,len(doc_1)):
    doc_1_content.append(doc_1[i].page_content) 

Embedding

Next, define the embedding function that uses a sentence transformers embedding model. Sentence transformers is a “Python module for accessing, using and training… embedding models”.

1
2
3
4
5
6
# Embedding model
model_name = "sentence-transformers/all-MiniLM-L6-v2"

# Load model
# The first time you run this, your system will download the model so it can take a little time
hf_embed_func = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")

The embedding function is called to make the embeddings. The output is an vector of 384 dimensions i.e. each chunk of text is represented by 384 numbers.

1
2
# Make word vectors
doc_1_embeddings = hf_embed_func.embed_documents(doc_1_content)

The preceding process is then repeated for the second document extract.

1
2
3
4
5
# Load the second file
loader = TextLoader("steelsnippet.txt", encoding='cp1252')
# Split into chunks
doc_2 = loader.load_and_split(
    text_splitter = text_splitter)
1
2
## Created a chunk of size 308, which is longer than the specified 250
## Created a chunk of size 264, which is longer than the specified 250
1
2
# Inspect one of the chunks
print(doc_2[4].page_content)
1
## Guard duty was either in the trench or else in one of the numerous forward posts that were connected to the line by long, buried saps; a type of insurance that was later given up, because of their exposed position
1
2
3
4
5
6
7
8
9
# Initialize a list
doc_2_content = []

# Simple loop to extract the page contents for each document chunk
for i in range(0,len(doc_2)):
    doc_2_content.append(doc_2[i].page_content) 

# Make word vectors
doc_2_embeddings = hf_embed_func.embed_documents(doc_2_content)

Vector Storage

The resulting embeddings are stored in a vector database. There are two alternatives for using Chroma: use the LangChain Chroma functions, or use a Chroma client. I used the latter, since I found it a bit more intuitive, but the more skilled practitioner might prefer the LangChain implementation. Different embedding models use differing vector sizes and importantly, a vector store can only store embeddings that are the same size, so make sure you don’t mix them up.

1
2
3
4
5
import chromadb

# The persistent client will save output to file in a ./chroma directory
# unless another location is specified by the user
client = chromadb.PersistentClient()

Embeddings are added to collections, which are logical groupings of documents: a collection can be created per user, per document author, per document topic and so on. See Chroma DB documentation for more details on collections.

1
2
3
# Create or fetch a collection object (if it already exists)
# Alternatively, use 'create_collection' function
collection_1 = client.get_or_create_collection("first_collection")

The document chunks are added to the collection object and stored in the vector database, along with a unique identifier per chunk, the embeddings and user-defined metadata. The metadata is a dictionary of key-value pairs that is user defined. Multiple key-value pairs are comma separated in a list metadatas = ['key1':'value1', 'key2':'value2', 'key3':'value3'] etc.

1
2
3
4
5
6
7
collection_1.add(
    ids = [f"{i}" for i in range(0,len(doc_1_content))], # Add a simple index/identifier for each vector
    embeddings = doc_1_embeddings, # This is the embedding generated previously
    documents= doc_1_content, # The document chunks 
    metadatas=[{"Index":f"{i}", # Custom metadata
        "Author":"Johnson",
        "Tag":"Autobiography"} for i in range(0,len(doc_1_content))])

Running the add() function multiple times will add multiple copies of the documents to the vector database, so make sure to keep this in mind. Duplicates can be removed after a similarity search, but the simplest approach may be to delete the collection and start again. Obviously this will be unattractive if the vector database is large, so plan ahead.

Much like a regular database, a document be updated in the vector database using the ID as identifier.

1
2
3
4
5
collection_1.update(
    ids=["1"],
    documents=["This is a meaningless update"],
    metadatas=[{"Index":1, "Author":"Johnson", "Tag":"Autobiography"}]
)

Similarly, we can add a second collection.

1
2
3
4
5
6
7
8
9
collection_2 = client.get_or_create_collection("second_collection")

collection_2.add(
    ids = [f"{i}" for i in range(0,len(doc_2_content))], 
    embeddings = doc_2_embeddings, 
    documents= doc_2_content, # The document chunks 
    metadatas=[{"Index":f"{i}", # Custom metadata
        "Author":"Jünger",
        "Tag":"Autobiography"} for i in range(0,len(doc_2_content))])

Managing Collections

Deleting a collection is simple enough by reference to the collection name.

1
2
# Delete collection - NOT RUN
client.delete_collection("collection_name")

The count() function is used to check the number of records in a collection, and the count_collections() function is used to count the collections in a database.

1
2
# Entries in collection 1
print(collection_1.count())
1
## 10
1
2
# Entries in collection 2
print(collection_2.count())
1
## 10
1
2
# Number of collections
print(client.count_collections())
1
## 2

Collections in a database can be listed simply with reference to the client.

1
2
# List of collections (ID and collection name)
client.list_collections()
1
## [Collection(id=5cb237c1-2e80-46cc-b169-430fb6d85c23, name=first_collection), Collection(id=c70e00b7-0fce-473d-97c2-b113750c4148, name=second_collection)]

The contents of a collection can be inspected using peek(), which will show the first 10 items, or peek(n) which will show a particular entry. The output shows the id, embedding vector, the document text and the metadata. This output is rather verbose since it shows all the embeddings.

1
2
# Contents of the second entry in collection 1
collection_1.peek(1)
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
## {'ids': ['0'], 'embeddings': array([[ 2.27102116e-02,  6.14032941e-03, -4.80835959e-02,
##          3.86925302e-02, -3.70215885e-02,  9.53903235e-03,
##         -9.50556695e-02,  5.85649982e-02, -7.88567662e-02,
##          7.55468244e-03,  4.83353585e-02,  6.78964183e-02,
##          9.11675543e-02, -8.48558545e-02, -2.75335908e-02,
##         -6.75899256e-03,  2.17288383e-03, -6.31214902e-02,
##         -1.06172375e-01,  3.06285731e-02, -2.81816050e-02,
##          4.37728465e-02,  9.63245635e-04,  9.81493443e-02,
##          9.94360540e-03,  2.03290675e-02, -5.27451821e-02,
##          5.66304401e-02, -2.77701449e-02, -3.12108435e-02,
##         -4.13540527e-02,  1.28873780e-01, -4.95139509e-02,
##          7.93067366e-03,  1.41949169e-02,  7.73883536e-02,
##          3.87193188e-02,  1.22740038e-03, -1.90804116e-02,
##         -9.22277644e-02,  6.47668215e-03,  4.18972485e-02,
##          9.83181819e-02, -2.66655050e-02,  2.43461337e-02,
##          1.42540196e-02, -2.83763222e-02, -9.09649357e-02,
##          5.10919355e-02,  2.97434963e-02, -1.59350373e-02,
##          4.06306423e-03,  6.06482895e-03, -4.46536429e-02,
##          1.15044843e-02, -7.17438851e-03,  8.64676107e-03,
##         -1.64969005e-02, -9.38233286e-02,  1.27243912e-02,
##         -8.96703675e-02,  6.08004723e-03, -5.84018864e-02,
##         -7.30859535e-03, -5.14442958e-02, -2.24382915e-02,
##          6.43175421e-03, -8.51190761e-02,  3.12606506e-02,
##          1.62818879e-02,  2.11426709e-02, -2.67124586e-02,
##         -1.04614586e-01,  9.65476483e-02,  1.00437645e-02,
##          1.70094203e-02,  8.98193642e-02,  4.51927409e-02,
##          1.01508666e-02,  1.10353334e-02,  7.67143071e-02,
##         -1.98449623e-02, -6.48215935e-02,  7.18796253e-02,
##          5.46101965e-02, -4.37747724e-02,  4.81316335e-02,
##          5.03473990e-02, -2.16085892e-02,  6.03366457e-02,
##          1.28595615e-02,  9.44692548e-03,  7.65296519e-02,
##          4.45383675e-02, -6.61289841e-02,  3.97079363e-02,
##         -6.52418425e-03,  2.27112956e-02, -4.87927757e-02,
##          3.31008658e-02, -3.03109293e-03, -5.40616922e-02,
##         -7.93939573e-04, -2.40488350e-02, -6.21036068e-02,
##          3.96030694e-02,  9.24422443e-02,  1.19820051e-02,
##         -5.16093560e-02, -5.21980114e-02,  3.50239798e-02,
##          1.62449758e-02, -4.50335220e-02, -3.46179307e-02,
##         -4.78279125e-03, -1.80519801e-02,  3.72449085e-02,
##          3.04493308e-02, -1.06133334e-02,  1.44863024e-01,
##         -1.30077479e-02,  2.08829790e-02, -2.25172900e-02,
##         -2.11460814e-02, -1.69830918e-02, -2.51462199e-02,
##         -7.10482895e-02,  1.69460919e-33,  5.31253777e-02,
##          1.67924806e-03, -1.17400508e-04,  8.49439576e-02,
##         -6.36259839e-03, -4.19324264e-02, -3.57730873e-02,
##         -4.03451733e-03,  9.19407830e-02, -1.78624913e-02,
##          2.07570288e-02,  6.88472623e-03,  4.16386202e-02,
##         -5.04162684e-02,  4.56908457e-02,  2.65931897e-02,
##         -9.04584900e-02,  1.92370992e-02, -5.51742017e-02,
##         -1.19358744e-03,  1.45169105e-02, -4.03452776e-02,
##         -1.76537558e-02, -6.67933375e-02,  1.48249259e-02,
##          4.94543873e-02, -5.37988124e-03, -6.96274340e-02,
##          1.16566690e-02,  4.60281596e-02,  1.06230088e-01,
##          5.94263896e-03, -3.78866605e-02, -4.77756187e-02,
##          1.35918672e-03,  8.10222998e-02,  3.71637195e-02,
##         -5.49478605e-02, -2.24234946e-02, -4.64124000e-03,
##         -5.04290834e-02, -3.20241526e-02, -1.60522424e-02,
##          1.91726745e-03,  4.15866189e-02, -3.45086716e-02,
##          3.72235775e-02,  1.39082810e-02, -2.57630721e-02,
##          1.22261513e-02, -7.55386353e-02, -1.16136633e-02,
##          1.62256379e-02, -2.61639543e-02,  4.50714678e-02,
##          1.20031019e-03,  5.80661744e-02,  8.27285349e-02,
##         -5.55984043e-02,  2.29494888e-02, -2.66119149e-02,
##         -2.35491207e-05,  2.12732446e-03,  2.45439410e-02,
##          1.75953470e-02, -9.82427076e-02, -3.23981866e-02,
##         -5.41815981e-02, -8.34035873e-03, -9.97202806e-05,
##         -1.32718598e-02, -9.22306255e-02, -5.67363873e-02,
##         -5.11056855e-02,  5.75213917e-02, -4.28842865e-02,
##         -2.62010060e-02, -1.63365509e-02,  7.14966059e-02,
##         -1.72847174e-02,  2.50341389e-02,  6.01616316e-02,
##         -3.44556160e-02, -3.89761105e-02,  6.18924275e-02,
##          8.42135586e-03,  6.56089410e-02, -9.70482528e-02,
##          4.79598343e-03,  6.33677002e-03, -3.88002419e-03,
##         -1.24595270e-01,  5.90664819e-02,  8.11683014e-02,
##         -2.92284880e-02, -4.12544408e-33, -8.92810971e-02,
##          1.35442734e-01, -2.60400791e-02,  3.04723270e-02,
##          7.28276223e-02, -5.06415172e-03,  4.22739014e-02,
##          6.95501938e-02, -7.74009824e-02, -2.73253899e-02,
##          6.19808659e-02,  1.45750837e-02, -3.70179266e-02,
##          6.80690929e-02, -1.34931020e-02, -3.44555341e-02,
##         -3.76936123e-02, -4.04635221e-02,  4.89090160e-02,
##         -4.58958447e-02,  8.97919163e-02,  3.33374813e-02,
##          5.10706715e-02,  9.17734504e-02, -2.35496787e-03,
##          1.97361894e-02, -1.61273964e-02,  3.50156687e-02,
##          4.60272208e-02, -3.73867340e-02,  5.51887937e-02,
##         -3.01495427e-04, -9.79811512e-03,  4.52413261e-02,
##         -4.50473204e-02,  8.58691856e-02, -4.28440943e-02,
##          4.86275181e-02, -2.80943122e-02,  5.84692881e-02,
##          1.26832379e-02, -5.02412468e-02,  1.81637015e-02,
##         -5.02712764e-02, -3.21703404e-02, -1.32468566e-01,
##          2.19027121e-02, -4.31058817e-02,  7.16213062e-02,
##         -3.08327563e-03, -4.85213920e-02, -1.77928712e-02,
##         -6.54494539e-02,  1.01415381e-01,  5.12932763e-02,
##         -5.33206314e-02,  1.12482257e-01, -8.06881338e-02,
##         -1.91947650e-02,  4.06996198e-02, -8.93735737e-02,
##         -2.83850916e-03, -6.75273165e-02,  9.84581262e-02,
##          5.68240415e-03, -1.16960164e-02,  2.44182199e-02,
##         -3.73494998e-02, -9.70139578e-02,  5.83976582e-02,
##         -1.47196241e-02,  1.45176249e-02,  1.49335451e-02,
##         -1.13506112e-02, -2.54287105e-02, -1.83726661e-02,
##         -8.64963699e-03,  6.31500483e-02,  1.97979510e-02,
##          3.59950177e-02, -8.91995151e-03, -4.60912287e-02,
##         -1.40133817e-02,  3.37419845e-02, -1.07804671e-01,
##          3.64966616e-02,  7.37655982e-02, -3.09236068e-02,
##          2.00593174e-02, -4.62606736e-03,  1.59351677e-02,
##         -2.33654846e-02,  4.22290452e-02, -1.26254261e-01,
##         -1.55488895e-02, -2.81150001e-08,  2.63555092e-04,
##          1.61827371e-01,  5.38357766e-03,  7.58769363e-02,
##          2.36869007e-02, -3.43407989e-02, -6.90715611e-02,
##          3.39998528e-02, -1.15753002e-02,  8.07016268e-02,
##         -5.33547252e-02, -2.85961144e-02,  2.11918391e-02,
##          2.43298756e-03,  8.71386155e-02, -7.15045910e-03,
##         -3.37467194e-02,  3.47114280e-02, -1.89220104e-02,
##         -6.23864233e-02,  2.31271088e-02, -8.45498592e-02,
##          1.90480687e-02, -2.77010631e-03, -7.82281235e-02,
##          3.13356780e-02, -2.49534864e-02, -4.20960747e-02,
##         -1.65449060e-03,  7.02366456e-02, -1.02101482e-01,
##          3.89494449e-02, -4.16127816e-02, -1.01512030e-01,
##         -7.20424801e-02,  4.40859608e-02,  1.51035900e-03,
##         -1.12356968e-01, -4.81209438e-03,  1.78855993e-02,
##         -4.19328399e-02, -2.62956824e-02,  5.13337664e-02,
##          1.98585074e-02,  1.06901757e-01,  6.50637746e-02,
##         -3.15076448e-02, -4.25435081e-02, -3.77409533e-02,
##         -3.32216471e-02,  9.00606886e-02, -1.38628250e-02,
##         -1.64821558e-02,  2.63768137e-02, -6.23841397e-02,
##          2.68802084e-02,  1.85365453e-02, -4.32835519e-02,
##         -1.71310212e-02, -4.82618473e-02,  6.57975860e-03,
##         -5.31343222e-02, -1.20867379e-01,  6.36211262e-05]]), 'documents': ['war, many individual Rotary Wing Classes suffered even heavier losses than that of the average helicopter crew. I graduated 16th of 286 men in Warrant Officer Rotary Wing Aviation Class 67-5'], 'uris': None, 'data': None, 'metadatas': [{'Author': 'Johnson', 'Index': '0', 'Tag': 'Autobiography'}], 'included': [<IncludeEnum.embeddings: 'embeddings'>, <IncludeEnum.documents: 'documents'>, <IncludeEnum.metadatas: 'metadatas'>]}

Querying

We can run a simple query on a collection object but you should not confuse this with querying a chat model. The query command creates an embedding of the query text and returns the specified number of results based on the queries’ similarity to the embeddings stored in the collection. Not also that this is discrete from the vector database, which may contain many collections.

1
2
3
4
5
results = collection_2.query(query_texts=["What were names of dugouts?"],
                             n_results=3) # Show the three closest matches

for item in results['documents'][0]: # The documents are in a nested list
  print(item)
1
2
3
## Those dugouts! They were holes hacked into the chalk, facing the trench, roofed over with boards and a few shovelfuls of earth
## Dawn lit exhausted, clay-smeared figures who, pale and teeth chattering, flung themselves down on the mouldy straw of their dripping dugouts
## If it had been raining, they would drip for days afterwards; a desperate waggishness kitted them out with names like ‘Stalactite Cavern’, ‘Men’s Public Baths’, and other such. If several men wanted to rest at the same time, they had no op

In addition to a text query, we can specify the embedding values that we want to match against. By specifying the value of n, we can choose how many matches to select.

1
2
3
4
# This trivial example is not executed
collection_2.query(query_embeddings=[[0, 0, 0, 0, 0, .....]], # A vector with 384 dimensions would be specified
                   where_document={"$contains":"dugouts"},
                   n_results=3) # Show the three closest matches

Instead of querying the collection object, we can query the collection by performing a similarity search on a Chroma class object. It may not be technically correct, but I think of the object as behaving similar to a database connection.

1
2
3
4
5
6
7
8
9
# Instantiate a vector store object 
vector_db = Chroma(collection_name="second_collection",
                  client=client,
                  embedding_function=hf_embed_func) # This is the embedding function, NOT the embeddings

results = vector_db.similarity_search("What were names of dugouts?",
                                         k = 2) # The default gives 4 closest matches
for item in results:
  print(item.page_content)
1
2
## Those dugouts! They were holes hacked into the chalk, facing the trench, roofed over with boards and a few shovelfuls of earth
## Dawn lit exhausted, clay-smeared figures who, pale and teeth chattering, flung themselves down on the mouldy straw of their dripping dugouts

Filtering

It is useful of course to be able filter the collections based on the contents of the metadata or the documents themselves.

Filtering on metadata:

1
collection_1.get(where={"Author": "Johnson"})
1
## {'ids': ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9'], 'embeddings': None, 'documents': ['war, many individual Rotary Wing Classes suffered even heavier losses than that of the average helicopter crew. I graduated 16th of 286 men in Warrant Officer Rotary Wing Aviation Class 67-5', 'This is a meaningless update', "11 years.' 1, Statistical data provided by Vietnam Helicopter Pilots Association. Made possible only through the individual efforts of Gary Roush and the other members of the Data Base Committee as listed in the 2004 Membership Directory", 'The An Lao Valley Incident Tonight, this 19-year-old will most likely leave us. His wounds are massive. Like others before him, he thrashes about on the hard alumi- num floor bathed in blood and suffering', 'Repeatedly, he calls for his mother, not his God. At great risk to ourselves, we will push our flying abilities and this helicopter to the brink of disaster trying to save him', 'Although cloaked in darkness, we must fly hard and low, cutting every corner. In spite of all my efforts, he will likely make the transition from life to death. His soul will depart, and his earthly body will finally lie in peace, without pain', 'September 5, 1967 At 0430 hours the company night clerk awakens me and advises me that Major Eugene Beyer,! A Company’s commanding officer, has picked me to “volunteer” for an emergency night resupply mission', 'Not fully awake, I plant my feet over the side of my cot and push the mosquito netting aside while attempting to comprehend the rapid briefing being given by the clerk', 'I “roger” as though I actually under- stand all he has said, then reach across the wooden pallets covering the dirt floor between my bunk and that of Warrant Officer James Arthur Johansen. Shaking him awake, I ask him to go with me', 'Shaking him awake, I ask him to go with me. Though more asleep than awake, he agrees. 1. Eugene Beyer was later promoted to a colonel. He is now retired'], 'uris': None, 'data': None, 'metadatas': [{'Author': 'Johnson', 'Index': '0', 'Tag': 'Autobiography'}, {'Author': 'Johnson', 'Index': 1, 'Tag': 'Autobiography'}, {'Author': 'Johnson', 'Index': '2', 'Tag': 'Autobiography'}, {'Author': 'Johnson', 'Index': '3', 'Tag': 'Autobiography'}, {'Author': 'Johnson', 'Index': '4', 'Tag': 'Autobiography'}, {'Author': 'Johnson', 'Index': '5', 'Tag': 'Autobiography'}, {'Author': 'Johnson', 'Index': '6', 'Tag': 'Autobiography'}, {'Author': 'Johnson', 'Index': '7', 'Tag': 'Autobiography'}, {'Author': 'Johnson', 'Index': '8', 'Tag': 'Autobiography'}, {'Author': 'Johnson', 'Index': '9', 'Tag': 'Autobiography'}], 'included': [<IncludeEnum.documents: 'documents'>, <IncludeEnum.metadatas: 'metadatas'>]}

Filtering on multiple metadata values requires an $or operator. Other possible operators include $and$, $contains$ and $not_contains.

1
collection_1.get(where={"$or": [{"Author": "Johnson"}, {"Author": "Jünger"}]})
1
## {'ids': ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9'], 'embeddings': None, 'documents': ['war, many individual Rotary Wing Classes suffered even heavier losses than that of the average helicopter crew. I graduated 16th of 286 men in Warrant Officer Rotary Wing Aviation Class 67-5', 'This is a meaningless update', "11 years.' 1, Statistical data provided by Vietnam Helicopter Pilots Association. Made possible only through the individual efforts of Gary Roush and the other members of the Data Base Committee as listed in the 2004 Membership Directory", 'The An Lao Valley Incident Tonight, this 19-year-old will most likely leave us. His wounds are massive. Like others before him, he thrashes about on the hard alumi- num floor bathed in blood and suffering', 'Repeatedly, he calls for his mother, not his God. At great risk to ourselves, we will push our flying abilities and this helicopter to the brink of disaster trying to save him', 'Although cloaked in darkness, we must fly hard and low, cutting every corner. In spite of all my efforts, he will likely make the transition from life to death. His soul will depart, and his earthly body will finally lie in peace, without pain', 'September 5, 1967 At 0430 hours the company night clerk awakens me and advises me that Major Eugene Beyer,! A Company’s commanding officer, has picked me to “volunteer” for an emergency night resupply mission', 'Not fully awake, I plant my feet over the side of my cot and push the mosquito netting aside while attempting to comprehend the rapid briefing being given by the clerk', 'I “roger” as though I actually under- stand all he has said, then reach across the wooden pallets covering the dirt floor between my bunk and that of Warrant Officer James Arthur Johansen. Shaking him awake, I ask him to go with me', 'Shaking him awake, I ask him to go with me. Though more asleep than awake, he agrees. 1. Eugene Beyer was later promoted to a colonel. He is now retired'], 'uris': None, 'data': None, 'metadatas': [{'Author': 'Johnson', 'Index': '0', 'Tag': 'Autobiography'}, {'Author': 'Johnson', 'Index': 1, 'Tag': 'Autobiography'}, {'Author': 'Johnson', 'Index': '2', 'Tag': 'Autobiography'}, {'Author': 'Johnson', 'Index': '3', 'Tag': 'Autobiography'}, {'Author': 'Johnson', 'Index': '4', 'Tag': 'Autobiography'}, {'Author': 'Johnson', 'Index': '5', 'Tag': 'Autobiography'}, {'Author': 'Johnson', 'Index': '6', 'Tag': 'Autobiography'}, {'Author': 'Johnson', 'Index': '7', 'Tag': 'Autobiography'}, {'Author': 'Johnson', 'Index': '8', 'Tag': 'Autobiography'}, {'Author': 'Johnson', 'Index': '9', 'Tag': 'Autobiography'}], 'included': [<IncludeEnum.documents: 'documents'>, <IncludeEnum.metadatas: 'metadatas'>]}

Filtering on keyword in documents:

1
collection_1.get(where_document={"$contains": "helicopter"})
1
## {'ids': ['0', '4'], 'embeddings': None, 'documents': ['war, many individual Rotary Wing Classes suffered even heavier losses than that of the average helicopter crew. I graduated 16th of 286 men in Warrant Officer Rotary Wing Aviation Class 67-5', 'Repeatedly, he calls for his mother, not his God. At great risk to ourselves, we will push our flying abilities and this helicopter to the brink of disaster trying to save him'], 'uris': None, 'data': None, 'metadatas': [{'Author': 'Johnson', 'Index': '0', 'Tag': 'Autobiography'}, {'Author': 'Johnson', 'Index': '4', 'Tag': 'Autobiography'}], 'included': [<IncludeEnum.documents: 'documents'>, <IncludeEnum.metadatas: 'metadatas'>]}

Filtering on multiple keywords in documents:

1
2
# Filtering on keyword in documents
collection_1.get(where_document={"$or": [{"$contains": "helicopter"}, {"$contains": "darkness"}]})
1
## {'ids': ['0', '4', '5'], 'embeddings': None, 'documents': ['war, many individual Rotary Wing Classes suffered even heavier losses than that of the average helicopter crew. I graduated 16th of 286 men in Warrant Officer Rotary Wing Aviation Class 67-5', 'Repeatedly, he calls for his mother, not his God. At great risk to ourselves, we will push our flying abilities and this helicopter to the brink of disaster trying to save him', 'Although cloaked in darkness, we must fly hard and low, cutting every corner. In spite of all my efforts, he will likely make the transition from life to death. His soul will depart, and his earthly body will finally lie in peace, without pain'], 'uris': None, 'data': None, 'metadatas': [{'Author': 'Johnson', 'Index': '0', 'Tag': 'Autobiography'}, {'Author': 'Johnson', 'Index': '4', 'Tag': 'Autobiography'}, {'Author': 'Johnson', 'Index': '5', 'Tag': 'Autobiography'}], 'included': [<IncludeEnum.documents: 'documents'>, <IncludeEnum.metadatas: 'metadatas'>]}

Combining both metadata values and document keywords:

1
collection_1.get(where_document={"$contains": "helicopter"},where={"Tag": "Autobiography"})
1
## {'ids': ['0', '4'], 'embeddings': None, 'documents': ['war, many individual Rotary Wing Classes suffered even heavier losses than that of the average helicopter crew. I graduated 16th of 286 men in Warrant Officer Rotary Wing Aviation Class 67-5', 'Repeatedly, he calls for his mother, not his God. At great risk to ourselves, we will push our flying abilities and this helicopter to the brink of disaster trying to save him'], 'uris': None, 'data': None, 'metadatas': [{'Author': 'Johnson', 'Index': '0', 'Tag': 'Autobiography'}, {'Author': 'Johnson', 'Index': '4', 'Tag': 'Autobiography'}], 'included': [<IncludeEnum.documents: 'documents'>, <IncludeEnum.metadatas: 'metadatas'>]}

Finally, we can use a retriever to implement a filter by drawing on this SO question. This filter structure is capable of filtering on multiple document and metadata keywords.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
# Defined earlier but repeated for clarity
# Instantiate a vector store object - acts as a database connection
vector_db = Chroma(collection_name="first_collection",
                  client=client,
                  embedding_function=hf_embed_func) # This is the embedding function, NOT the embeddings

# Define a retriever for talking to the database
retriever = vector_db.as_retriever()

# Define lists of metadata
author_list = ['Johnson']
tag_list = ['Autobiography']

# Create filter dictionaries for each list
author_filter = {"Author": {"$in": author_list}}
tag_filter = {"Tag": {"$in": tag_list}}

# Combine filters using $and or $or operators
combined_filter = {
    "$and": [
        author_filter,
        tag_filter
    ]
}

# Create the retriever with the combined filter
base_retriever = vector_db.as_retriever(search_kwargs={'k': 4, 'filter': combined_filter})

# Perform the query
query = "Where did Army pilots train?"
results = base_retriever.invoke(query)

# Print the results
for result in results:
    print(result)
1
2
3
4
## page_content='11 years.' 1, Statistical data provided by Vietnam Helicopter Pilots Association. Made possible only through the individual efforts of Gary Roush and the other members of the Data Base Committee as listed in the 2004 Membership Directory' metadata={'Author': 'Johnson', 'Index': '2', 'Tag': 'Autobiography'}
## page_content='war, many individual Rotary Wing Classes suffered even heavier losses than that of the average helicopter crew. I graduated 16th of 286 men in Warrant Officer Rotary Wing Aviation Class 67-5' metadata={'Author': 'Johnson', 'Index': '0', 'Tag': 'Autobiography'}
## page_content='Shaking him awake, I ask him to go with me. Though more asleep than awake, he agrees. 1. Eugene Beyer was later promoted to a colonel. He is now retired' metadata={'Author': 'Johnson', 'Index': '9', 'Tag': 'Autobiography'}
## page_content='September 5, 1967 At 0430 hours the company night clerk awakens me and advises me that Major Eugene Beyer,! A Company’s commanding officer, has picked me to “volunteer” for an emergency night resupply mission' metadata={'Author': 'Johnson', 'Index': '6', 'Tag': 'Autobiography'}

Conclusion

This post has shown how to implement RAG using LangChain and ChromaDB with HuggingFace embeddings. Once the data is loaded, chunked and the document embeddings have been determined, there are a number of ways to manipulate, update, query and filter the data collections in the vector database.

Further Reading and References

Licensed under CC BY-NC-SA 4.0
Built with Hugo
Theme Stack designed by Jimmy