When ChatGPT launched in 2022, chatbots were everywhere and seemed to be the answer to everything. Three years on-have they really solved our problems, or are we still searching?
I’ve been working hands-on with these tools in the Microsoft ecosystem and would like to share some thoughts. Even if you’re not a tech person and you want a clear understanding of the topic, just dive in.
⏱️Reading time:4 min
The Dream: A “ChatGPT Experience” for Internal Documents
Within large organizations, you’ve probably heard people discussing the dream of a “ChatGPT experience” for internal documents-a sort of super-specialized Google that answers any question perfectly. LLMs are impressive, but are they perfect? Are they accurate enough for company-grade usage?
The Structural Limit of RAG
RAG stands for Retrieval-Augmented Generation. It’s a buzzword, but the core idea is simple: you feed a “GPT brain” with pieces of knowledge retrieved from a database-an index-made up of document slices called “chunks”.
For those worried that their chatbot is “learning” from their knowledge base: rest assured, that’s not how it works. Each time you ask a question, the chatbot spins up a fresh GPT instance, loading the conversation history and the most relevant chunks retrieved from your index.
It’s not remembering your data; it’s just simulating a conversation using what you provide, and all of that happens in real time.
If you get this, you understand the core concept of RAG:
- The LLM doesn’t know your documents by heart.
- It only “knows” what you give it at each prompt, based on what the retrieval system finds relevant.
The Catch: Chunks and Context
But here’s the catch:
Chatbots are only as good as the chunks they retrieve and the amount of data they can fit into the “GPT brain”. The process involves converting your documents into high-dimensional vectors (another name for fancier databases), searching for the closest matches to your query, and hoping the system selects the right context for the model to generate a useful answer. If the relevant information isn’t in the retrieved chunks, the answer will miss the mark-no matter how powerful the LLM is.
The Trap: Over-Optimizing Chunking and Indexing
And here is the trap:
Commonly, when we have a problem, we solve it. So here, we want to improve the index and the quality of the chunks we are retrieving.
For example, by adding the possibility to filter the index before we retrieve our chunks so that we increase our chance to get the right chunks. We can also slice our document manually by finding the right separators in the text, crafting the perfect chunks, and then storing those chunks in the index. This is the work you may want to do, and it costs a huge amount of time, whereas by default you can rely on Azure AI Search capabilities-chunks that do the chunking job for you in a few clicks. It will not be as good as your handmade indexing, but the differences are, in the end, minor.
So what can you do? We have a problem; we should do something about it, right?
It’s counterintuitive, but personally, I would do nothing. Of course, I would remain up-to-date by using the latest models to improve performance and context window, but that’s it. I would stick with the out-of-the-box as much as possible because while you spend time fixing structural issues, the industry is evolving very fast and your problem may simply disappear with time.
Another Trap: Testing and Evaluating Chatbots
There’s another trap I want to highlight: it’s surprisingly difficult to test a chatbot and judge the accuracy of its answers.
Often, we rely on a subject matter expert (SME) to provide the “expected” answer for comparison. But sometimes, the SME’s answer is actually outdated or off, and the chatbot, pulling from the latest documentation, is correct.
Even more common is a mismatch in expectations. The chatbot answers the question literally and concisely, while the SME expects a longer or more detailed response. The SME might mark the bot as “wrong,” when in fact it answered exactly what was asked.
And here’s something people often overlook: even the interface can impact how the answer is perceived. A well-designed interface can make a concise answer feel clear and authoritative, while a clunky interface might make even a correct answer seem less trustworthy or useful.
The takeaway:
Testing chatbots isn’t just about matching answers to expectations. The way a question is asked, and even how the answer is presented, shapes our perception. For fair evaluation, make sure questions are clear and judge answers based on what was actually asked.
The Final Word
I believe that the value of chatbots is in their general adoption. A not-perfect chatbot in every pocket will probably bring more value to an organisation than a perfect chatbot in a few pockets so spend your time wisely ! (and embrace low code 🤭)