Does Your Data Have a Librarian?
Imagine you have a giant pile of documents and you want to ask questions about them
Agents are clever, but out of the box they do not know your stuff (company’s files, your PDFs, notes, etc.) and you cannot simply paste 500 pages into an LLM and prompt it; it will not fit.
So, before an agent can even answer a prompt, something has to find the handful of relevant pages and hand only those over. Essentially, every complex dataset should have its own personal librarian ready to supply the agent/LLM with the relevant files to execute the prompts you give it, in the background.
This whole technique has already been coined & known as RAG (Retrieval Augmented Generation) but you can think of it as: find the right pages first, then let the agent write the answer.
This is why RAG is big and certain models solidify integrating the right transformer into your production and its set for a long time. Sentence-trasnformers all-MiniLM-L6-v2: this is the most downloaded model on all of huggingface for a reason, and this letter unpacks why and how its used.
Think of it as hiring a very fast librarian. Your mind should expand to think about all the things you never knew were possible without RAG especially in the agentic world.
The first stage is just to set up the library. Done once. You cut all your documents into small chunks, about a paragraph each. The model reads each chunk and gives it a kind of “fingerprint”. A code that captures what the paragraph is about, not just the words in it.
All those fingerprints get filed away in a special cabinet built for this (called a vector database).
Now, every paragraph you own has been quietly tagged by meaning.
Second stage, someone asks a question in live time: the model reads the question and give it a fingerprint too. It compares that fingerprint against everything in the cabinet (vector DB) and pulls out the few paragraphs that match closely, or, the prints that most likely hold the right answer or data.
Those few paragraphs get handed to the agent or LLM along with the question, essentially saying “Here’s all the relevant material, now answer the user’s prompt.” Think of a librarian helping someone with the Dewey Decimal System search, except its an agent doing that for your mountains of data you wouldnt want to read through yourself.
The agent or LLM will then respond based on real, specific paragraphs instead of guessing. This whole technique has obviously been coined RAG (Retrieval Augmented Generation) but you can think of it as: find the right pages first, then let the agent write the answer.
Every app that says you can chat with your data, is really two halves: find the right pages (RAG half) then write the eloquent answer (agent/LLM half).
That little model above is my favorite for this because of its functionality and smaller size. It functions as the finder, not the talker. It is the librarian who quietly pulls the three right pages off the shelf so the famous agent can do the talking.
It will never get the credit, but almost nothing works without it.
So why is that model above the literal most downloaded AI model out of everything in existence right now?
Its small and cheap. Runs fast on an ordinary computer. This is why builders reach for it without hesitation.
Its good enough. For most real world jobs, its results are plenty accurate.
It is the default. This is the bigger one. The most popular free tools and tutorials for building these apps use this exact model as their built in starting point. So every time a dev follows a tutorial, runs a test, or rebuilds their app, this model gets pulled down automatically (usually over and over).
Last point, think of LangChain. If you are a builder you may have heard of it, if not, you probably haven’t. Either way, good example.
You hand LangChain a stack of documents, say, your company’s 500 page handbook.
LangChain will be the thing chopping it into pieces or fingerprints, these are the chunks I mentioned. For each one of those chunks, LangChain calls all-MiniLM-L6-v2 and says “give me the meaning fingerprint of this paragraph.” The model hands back that vector in the form of a list of numbers.
LangChain then takes all those fingerprints and files them in a vector database we mentioned earlier. Lets say an employee asked “How many sick days do I get?”
Once the user starts prompting, LangChain runs the question through the exact same model to get its own fingerprint for the prompt.
The database hands back the 3-4 most relevant chunks (the ones about the user’s sick leave question) even if the employee handbook never uses the words “sick days” and instead says “paid medical absence” (That is the meaning/fingerprint matching at work)!
Langchain then takes those retrieved paragraphs and staples them onto the question before sending it to the Agent/LLM, roughly like: “using the following text from the handbook: […the relevant paragraphs…], answer the question: how many sick days do i get”? This is all happening in the background, which is why its basically used by everyone and thanked by no one building in the AI realm.
The all-MiniLM-L6-v2 is the translator at both ends: it turns your documents into fingerprints once, and turns each incoming question into a fingerprint so it can be matched against them with something like LangChain before it hits your agent.
LangChain is the conveyor belt moving everything between the translator, filing cabinet and the agent.
Adding LangChain + all-MiniLM-L6-v2 = your very own data librarian.
What do you think?
God-Willing, see you at the next letter
GRACE & PEACE







