Day 6 – RAG server, embeddings and vibe coding | Thomson MO5 Development: First Steps

In a previous episode, I ran quite a few tests and experiments to understand how to code for an MO5.
I had asked the AI to summarize what we had learned into Markdown files. The underlying idea was to be able to share this experience with my new MO5 projects without having to copy .md files into each repository.

Spoiler alert: vibe coding costs money 😢

The RAG server idea

One simple way to share knowledge and context is through a RAG server.

A RAG server (Retrieval-Augmented Generation) is basically a search API, but one that is able to provide specific context to AI systems (MO5 in my case).

At first, I had a very naïve view of the implementation:

store documents in a database
perform a keyword search (like SQL LIKE)
return the passages containing those words through an API

After doing some research, this naïve vision turned out to be full of drawbacks.

Why keyword search doesn’t work

No semantic understanding: if the user asks “How do I authenticate a user?” but the document talks about session management and JWT tokens, no words match—even though the content is relevant.
Same word, different meaning: the meaning of “public key” is not the same in cryptography, networking, or databases. Lexical search cannot disambiguate context.
Fragile to rephrasing: plurals, synonyms, paraphrases, typos, etc.

In short, keyword search does not understand meaning.
It fails as soon as you rephrase or express a concept differently. It’s clearly not the right model.

Chunking, embeddings, and black magic

After doing some research on RAG servers, you quickly see the terms chunks, embeddings, and cosine similarity appear everywhere.

Before working on this project, I had no idea these concepts even existed (and yet we use them every day).

A RAG system:

uses embeddings to represent the meaning of text as vectors
compares these vectors using cosine similarity to retrieve semantically close passages, even without identical words
relies on chunking to index coherent pieces rather than entire documents

Result:

better context
less noise
fewer hallucinations
answers far more reliable than with simple text search

For me, the truly magical part is embeddings: giving a numerical representation to the meaning of a text.
It may be obvious to some, but not to me. Just imagining that someone managed to mathematically formalize the meaning of a sentence, I find that simply mind-blowing 😄

OpenAI or local?

While discussing my project with ChatGPT, it obviously recommended interfacing with OpenAI for embeddings.

Even though using an AI service is more performant (better semantics, faster, multilingual), I wanted to stay as cheap as possible.

My goal is to deploy this API on the Internet so it can be used by the community. If it ever became truly popular, AI costs could quickly limit my ambitions.

The target architecture must be able to abstract the embeddings implementation:

a local, “home-made” provider
or an AI API like OpenAI / Azure OpenAI (you never know, I might change my mind)

Local embeddings

TF-IDF

TF-IDF (Term Frequency – Inverse Document Frequency) is a classic technique for generating embeddings.

In short:

TF: how often does a word appear in the text?
IDF: is this word rare or common across all documents?

A word that is rare overall but present in a document is considered important for its meaning.

Advantages:

everything is computed locally
no external API
no AI cost

Neural model

Another option: a pre-trained neural model.

The idea:

a deep learning model transforms text into a dense vector
semantically similar texts have similar vectors, even with different words

This is usually more accurate in terms of relevance, but:

heavier
often based on Python scripts
slower response times

Deployment on Raspberry Pi

At the end of November, I started coding with Augment (Indie Plan subscription at $20/month, 40,000 credits).

API setup, unit tests, TF-IDF implementation, neural model—everything was going smoothly.

After creating my NAS, I moved the sources to my new server and wanted to deploy the API there.

I asked Augment to create a Docker image for deployment on a Raspberry Pi (I deployed the TF-IDF implementation).

We spent the evening together:

unsuitable images
API bugs
configuration issues

But around quarter to pumpkin (11:45 PM for those who don’t get the Cinderella reference), everything was working and deployed on the NAS.

Response time: ~50 ms. Very decent for a Raspberry Pi.

The neural model… and the cold shower

The next day, I thought that more relevant results would be better.
So I asked Augment to deploy the neural model.

The entire afternoon was spent on it:

incompatible images
Python version issues
bugs in the C# implementation
deployment time

When I finally looked up, it was dark outside. It was a little after 6 PM.

Good news:

everything worked
more relevant answers

Bad news:

40 seconds per response

Conclusion: hosting a neural model on a Raspberry Pi is not the idea of the century…
Rollback to the simple, fast, and efficient model.

The real cost of vibe coding

To build this API, I delegated almost everything to Augment.
I pushed vibe coding very far, even asking it to run compilation and deployment commands for me (peak laziness).

Technically, it works very well.
The project is functional.

But the downside:

no pride: it’s not really my work
I didn’t learn anything deeply
vibe coding is expensive: in one evening and one afternoon, I almost burned through my entire monthly quota

I estimate having spent around 47,000 credits on this project (over the course of three half-days, while my monthly quota is 40,000 credits), and that’s not even mentioning my carbon footprint…

Credits used

Conclusion

When moving to production, I will probably need to rely on an AI like OpenAI or Azure OpenAI to achieve better semantic analysis performance. You can’t do everything on your own, and sometimes you have to accept delegating to the professionals 😄.

A functional project, technically successful, but with a rather bitter personal aftertaste. Next time, I’ll be more involved and won’t let the AI do everything. After all, it’s my project, not its own 😄

The RAG server idea#

Why keyword search doesn’t work#

Chunking, embeddings, and black magic#

OpenAI or local?#

Local embeddings#

TF-IDF#

Neural model#

Deployment on Raspberry Pi#

The neural model… and the cold shower#

The real cost of vibe coding#

Conclusion#