Scaling LLM-wrapped RAG (alias)AI-enabled applications
LLM-powered RAG or AI applications don’t scale smoothly by default. Rate limits, response accuracy, and prompt consistency can break performance and reliability if not handled properly.
Before you deploy or demo your system, be ready for questions like:
- How do you handle rate limits?
- How do you ensure responses improve over time?
- How do you test prompt changes without breaking things?
Rate Limits
LLMs often have rate limits — usually defined as requests per minute (RPM). If you run into rate-limit errors, here are a few approaches to consider:
Retry with Delay (Exponential Backoff)
Instead of hammering the API continuously, space out your retries. With Spring’s @Retryable
annotation, you can automatically retry failed calls. Adding exponential backoff increases the delay each time, helping you avoid hitting the limit repeatedly.
Use a Fallback Client
If your main LLM is throttled, switch to a backup provider (e.g., Anthropic, Google Vertex AI, Cohere). With Spring’s Resilience4j @CircuitBreaker
, you can detect repeated failures and route requests to a secondary provider. This keeps your system running smoothly without overwhelming one service.
Use a Queue
If real-time responses aren’t critical, queue incoming requests and process them as capacity frees up. Tools like Redis, Kafka, or in-memory queues can store requests until the rate limit allows more traffic. This way, you won’t lose any requests, and you avoid excessive retries that could worsen rate limiting.
Feedback
LLMs are always more about approximation than precision (source). So, how do we refine those approximations? Collecting feedback through likes and dislikes is a start, but it’s the dislikes — especially with detailed comments — that give us real insight into what went wrong.
Where do these comments go?
- Each row in the vector database typically has an id, an embedding, and some metadata.
- Whenever you receive a dislike, store it in a separate table — say
tblfeedback
—along with any accompanying comments. - Link these entries to the relevant id or query from your vector database.
- Next time you run a similar query, check
tblfeedback
for any related id or comment. If it exists, pass that information to the LLM so it can refine or correct the response. - If multiple comments pile up for the same id, it’s a signal that you should either fix the data manually or adjust your prompt.
This feedback loop ensures you continuously improve your application’s responses — even when the underlying LLM can’t deliver perfect precision on its own.
Test-Driven Prompt Changes
Once your LLM-powered application is in production, you’ll inevitably tweak the prompt to handle new requirements. But how do you ensure these changes don’t break existing functionality?
A simple approach: test prompts like you test code.
Steps to Validate Prompt Changes:
- Mock the RAG Responses
To save on costs (and simplify testing), mock the RAG responses or function calls instead of actually querying your vector database. - Call the LLM with the New Prompt
Pass the mocked data and the updated prompt to your LLM, generating a new response. - Compare the Old and New Responses
Convert both responses to embeddings and measure their similarity. If the cosine similarity score is above 0.9(configure your own), you can assume the new prompt hasn’t drastically broken existing behavior.
This ensures each new prompt version meets previous expectations, so you can confidently evolve your AI system without breaking old use cases.
Final Thoughts
Scaling AI applications requires handling rate limits, refining feedback loops, and testing prompt changes to avoid surprises.
For performance, Java and Golang are better suited than Python. Since LLMs are primarily offered as services, integrating them with Java is straightforward — especially with Spring AI, which is nearing its 1.0 release. If high performance is your priority, Golang is a great choice due to its efficiency and speed.
Check out Spring AI here: Spring AI
And finally a shameless self-promotion: My book on Spring AI → Spring AI For Organization