Sitemap

Scaling LLM-wrapped RAG (alias)AI-enabled applications

4 min readFeb 25, 2025

--

LLM-powered RAG or AI applications don’t scale smoothly by default. Rate limits, response accuracy, and prompt consistency can break performance and reliability if not handled properly.

Before you deploy or demo your system, be ready for questions like:

  • How do you handle rate limits?
  • How do you ensure responses improve over time?
  • How do you test prompt changes without breaking things?

Rate Limits

LLMs often have rate limits — usually defined as requests per minute (RPM). If you run into rate-limit errors, here are a few approaches to consider:

Retry with Delay (Exponential Backoff)

Instead of hammering the API continuously, space out your retries. With Spring’s @Retryable annotation, you can automatically retry failed calls. Adding exponential backoff increases the delay each time, helping you avoid hitting the limit repeatedly.

Use a Fallback Client

If your main LLM is throttled, switch to a backup provider (e.g., Anthropic, Google Vertex AI, Cohere). With Spring’s Resilience4j @CircuitBreaker, you can detect repeated failures and route requests to a secondary provider. This keeps your system running smoothly without overwhelming one service.

Use a Queue

If real-time responses aren’t critical, queue incoming requests and process them as capacity frees up. Tools like Redis, Kafka, or in-memory queues can store requests until the rate limit allows more traffic. This way, you won’t lose any requests, and you avoid excessive retries that could worsen rate limiting.

Feedback

LLMs are always more about approximation than precision (source). So, how do we refine those approximations? Collecting feedback through likes and dislikes is a start, but it’s the dislikes — especially with detailed comments — that give us real insight into what went wrong.

Where do these comments go?

  • Each row in the vector database typically has an id, an embedding, and some metadata.
  • Whenever you receive a dislike, store it in a separate table — say tblfeedback—along with any accompanying comments.
  • Link these entries to the relevant id or query from your vector database.
  • Next time you run a similar query, check tblfeedback for any related id or comment. If it exists, pass that information to the LLM so it can refine or correct the response.
  • If multiple comments pile up for the same id, it’s a signal that you should either fix the data manually or adjust your prompt.

This feedback loop ensures you continuously improve your application’s responses — even when the underlying LLM can’t deliver perfect precision on its own.

Test-Driven Prompt Changes

Once your LLM-powered application is in production, you’ll inevitably tweak the prompt to handle new requirements. But how do you ensure these changes don’t break existing functionality?

A simple approach: test prompts like you test code.

Steps to Validate Prompt Changes:

  1. Mock the RAG Responses
    To save on costs (and simplify testing), mock the RAG responses or function calls instead of actually querying your vector database.
  2. Call the LLM with the New Prompt
    Pass the mocked data and the updated prompt to your LLM, generating a new response.
  3. Compare the Old and New Responses
    Convert both responses to embeddings and measure their similarity. If the cosine similarity score is above 0.9(configure your own), you can assume the new prompt hasn’t drastically broken existing behavior.

This ensures each new prompt version meets previous expectations, so you can confidently evolve your AI system without breaking old use cases.

Test Suite for LLM

Final Thoughts

Scaling AI applications requires handling rate limits, refining feedback loops, and testing prompt changes to avoid surprises.

For performance, Java and Golang are better suited than Python. Since LLMs are primarily offered as services, integrating them with Java is straightforward — especially with Spring AI, which is nearing its 1.0 release. If high performance is your priority, Golang is a great choice due to its efficiency and speed.

Check out Spring AI here: Spring AI

And finally a shameless self-promotion: My book on Spring AISpring AI For Organization

--

--

Muthukumaran Navaneethakrishnan
Muthukumaran Navaneethakrishnan

Written by Muthukumaran Navaneethakrishnan

Software Engineer works on Java, Javascript , Golang & Clojure

No responses yet