Following the previous episode (/mo5-blog/days/day-6-rag-server/), I wanted to go one step further:
deploy my RAG server on the Internet.

The goal was twofold:

  • make the server accessible from the outside
  • allow coding agents (Copilot, Augment, etc.) to have a precise context to help with MO5 development (the project is described here: https://retrocomputing-ai.cloud/)

Deploying a RAG, but not just an API

To connect coding agents with my RAG server, I created an MCP server.
It acts as a standardized interface between AI tools (Copilot, Augment, etc.) and my RAG API.

In practice, setting up the MCP server was almost trivial.
I did not have to modify anything fundamental in the RAG, all the building blocks already existed.

The MCP server is neither a new engine nor a complex layer.
It is simply a specific interface / protocol that plays the role of an intermediary:

  • it receives structured requests from the coding agent
  • adapts them to the format expected by the RAG API
  • returns responses in a format directly usable by the agent

All the intelligence therefore remains on the RAG side.
The MCP only translates and orchestrates.

Since I was exposing something on the Internet anyway, I did not want to deploy just a raw API.
With a domain name, I might as well use it to:

  • host a small HTML site explaining how to configure MCP with Copilot or Augment
  • properly expose the API
  • document everything with Swagger

The MCP server sources, along with a Markdown page explaining how it works, are available here:

👉 https://github.com/thlg057/mo5-mcp-server


Hosting choice

I wanted to be able to deploy my Docker image easily.
Looking at what was available, I quickly ended up with VPS solutions.

After comparing prices and configurations, I chose:

  • Hostinger – KVM1 plan
  • 4.99 € / month
  • 1 vCPU
  • 4 GB of RAM

It is not a large configuration, but it should be sufficient for my usage.
A nice bonus: the domain name is included, perfect to give the project a real URL.

Hostinger


Deployment architecture

For URL management, I chose Caddy.

The idea is simple:

  • / → redirect to my blog
  • /api → my RAG API
  • /swagger → interactive API documentation

On the server side, I prepared a deployment directory with:

  • the blog (Hugo)
  • the API sources
  • a docker-compose.yml
  • the Caddy configuration

Everything was copied to the Hostinger VPS and installed… no particular issues, it is quite simple and well designed.


First shock: performance

On my Raspberry Pi 4, I had already noticed an issue: around 30 seconds to get a response.

According to Augment, the diagnosis was clear: the Raspberry was simply not powerful enough.

I therefore expected a clear improvement on the VPS.

First production test… disaster: still ~30 seconds, and not very good results 😬.

It was clearly time to dig deeper…


1. Performance and architecture (speed)

The problem

Initially, I was using a local embedding service:

  • the .NET code called Python scripts
  • the model was loaded on the fly

Result:

  • the model was loaded for each request
  • CPU saturation
  • unstable application

At first, I thought the issue came from the database (poorly optimized queries, missing indexes, etc.).

After adding quite a lot of logs, the verdict was clear: the database was not the issue, but the generation of chunks and embeddings.

Analysis

Loading a deep learning model, even a “small” one like E5, is a heavy operation. Doing it for every request is completely inefficient.

What was needed was an architecture where the model stays “warm”, loaded only once in memory.

What was implemented

  • Dedicated microservice
    An independent Python API, based on FastAPI, running in its own Docker container.

  • Single model loading
    The multilingual-e5-small model is loaded only once at service startup.

  • HTTP communication
    The .NET server now communicates via fast and simple JSON requests.

Result

Processing time went:

  • from several seconds per chunk
  • to a few milliseconds

Now things start to feel much better.


2. Search quality (SimilarityScore)

The problem

With the initial model:

  • similarity scores around 0.60
  • often poorly relevant results
  • generic sections like “Errors to avoid” showing up all the time

In short, the AI struggled to understand MO5-specific technical nuances.

Analysis

Two main causes:

1. The model

The original model was not performant enough for:

  • technical language
  • multilingual content (French / English)

2. Loss of context

Once split into chunks, the engine:

  • saw a list of instructions
  • but forgot which document and which section they came from

For example, it no longer knew whether it was about:

  • text mode
  • or graphics mode

What was implemented

  • Model change
    Switched to intfloat/multilingual-e5-small, the base score jumped from 0.61 to 0.86

  • Semantic enrichment
    The C# code was modified to inject:

    • the document title
    • the section title
      into each chunk sent to the AI
  • Markdown cleanup
    Removal of characters like #, **, etc. to keep only “plain” text during indexing

Result

Technical documents now consistently rank number one for hardware-related queries (for example NMI).
The AI finally understands the global context of each page.


Empirical fine tuning

I ran many tests to refine the behavior:

  • removing #, *, etc.
  • chunk size
  • overlap size
  • order of contextual fields injected

Everything was done in an empirical way, through testing and comparisons.


Valuable help from AIs

I do not know much about neural models.
On that front:

  • Gemini helped me a lot
    • deploying the Python service
    • choosing the multilingual-e5-small model

Honestly, without this help, it would have taken much longer (and probably been more painful 😅).


Testing the RAG in practice

To test the RAG concretely, the easiest way is to use the following site: 👉 https://retrocomputing-ai.cloud/

It is a blog page that explains step by step how to use the server via a coding agent (Copilot, Augment, etc.), relying on the MCP server.

If you just want to explore the API without using an agent, the Swagger documentation is available here: 👉 https://retrocomputing-ai.cloud/swagger

You will find the complete list of endpoints, request formats, and example calls to quickly test the RAG.


Current documentation status

The documentation used by the RAG is still being updated.

Following my latest explorations of the MO5 codebase, especially everything related to graphics modes, I am currently reviewing and enriching the documentation files.

This means that:

  • some parts are already very precise (especially hardware-related)
  • others will continue to evolve as content is added
  • RAG results will keep improving as the documentation grows

In short, the server is operational, but the content it relies on is still alive (and that is also what makes the experience interesting).


Conclusion

This deployment helped me understand one essential thing:

A RAG that “works” is not necessarily a RAG that is usable.

Between:

  • architecture
  • performance
  • embedding quality
  • injected context

there are many parameters to adjust.

But once the right choices are made, the gain is immediate and really satisfying.

More in the next episode 🙂