Handling LLM Rate Limits in Go: Circuit Breakers and Durable Queues

We are currently going through an era of products needing to provide prompt-based interfaces while not having enough resources to self-host LLMs. And since talking to a computer in plain English is way easier (from a user’s perspective) than clicking around (YMMV), as product builders and engineers, we will face an issue more often than not:

Company pays for an amount of tokens, which happen to not be unlimited. Product users then consumes all those tokens. All the consecutive requests fail with a “you’re too poor to buy more tokens” error. Users start complaining.

There’s no point in sending another request (or another token) to that LLM until you either get richer, or until your quota gets reset and you can start prompting again.

History repeats itself, and while LLMs are great and intelligent and all that, for all practical purposes they can, and I’d advocate should, be treated as any other dependency over the network:

LLMs can be slow, and your requests can timeout
LLM responses can fail, not necessarily due to your input
The LLM can rate limit and/or throttle you
The LLM can go offline for whatever reason, and come back online whenever it feels like

Our goal here, as builders, is to build resilient systems that stay predictable under these circumstances.

Retries Aren’t Silver Bullets

A few days ago, I posted on LinkedIn a story of how being rate limited by an API role-playing as an LLM resulted in high error rates.

The naive, perhaps tempting solution might be to retry. But retries, even with backoff, can make things worse when the upstream is rate limiting you. A retry is not always recovery. Sometimes, it is unnecessary pressure.

If every failed request immediately becomes another request, the system does not recover. It amplifies the problem. What it needs instead is backpressure.

Luckily, for the use case we had, the service depending on the LLM didn’t have a time constraint. Assume that air resistance is negligible and that customers don’t care how long the response takes to come back, as long as the result is useful.

To work around these limitations, the solution we can go for is a combination of two patterns, which probably the title gave away too soon:

Circuit Breaker, to stop spamming the LLM when it is clearly pushing back
Durable Queue, to make sure work is not lost and can be retried later

The circuit breaker protects the upstream. The queue protects our work. Together, they give us controlled retries instead of chaos.

Implementing a Circuit Breaker

A circuit breaker answers a single question: Should we call the 3rd party ~~API~~ LLM right now?

As per some paraphrased definition, a service client should invoke a remote service via a proxy that functions in a similar fashion to an electrical circuit breaker. When the number of consecutive failures crosses a threshold, the breaker trips; the circuit opens and for the duration of a timeout period all attempts to invoke the remote service will fail immediately. After the timeout expires, the breaker allows a limited number of test requests to pass through; the circuit is now half-open. If those requests succeed, the breaker resumes normal operation and the circuit closes. Otherwise if there is a failure, the timeout period begins again.

Here’s a simple implementation of this whole logic in Go:

package breaker

var ErrCircuitOpen = errors.New("circuit breaker is open")

type CircuitState string

const (
  StateClosed   CircuitState = "closed"
  StateOpen     CircuitState = "open"
  StateHalfOpen CircuitState = "half_open"
)

type CircuitBreaker struct {
  mu sync.Mutex

  state CircuitState

  failures    int
  maxFailures int

  openedAt time.Time
  cooldown time.Duration

  shouldTrip func(error) bool
}

func NewCircuitBreaker(maxFailures int, cooldown time.Duration, shouldTrip func(error) bool) *CircuitBreaker {
  return &CircuitBreaker{
    state:       StateClosed,
    maxFailures: maxFailures,
    cooldown:    cooldown,
    shouldTrip:  shouldTrip,
  }
}

func (b *CircuitBreaker) Do(ctx context.Context, fn func(context.Context) error) error {
  if err := b.beforeCall(); err != nil {
    return err
  }

  err := fn(ctx)
  if err == nil {
    b.recordSuccess()
    return nil
  }

  if b.shouldTrip == nil || b.shouldTrip(err) {
    b.recordFailure()
  }

  return err
}

func (b *CircuitBreaker) beforeCall() error {
  b.mu.Lock()
  defer b.mu.Unlock()

  if b.state == StateOpen {
    if time.Since(b.openedAt) >= b.cooldown {
      b.state = StateHalfOpen
      return nil
    }

    return ErrCircuitOpen
  }

  return nil
}

func (b *CircuitBreaker) recordSuccess() {
  b.mu.Lock()
  defer b.mu.Unlock()

  b.failures = 0
  b.state = StateClosed
}

func (b *CircuitBreaker) recordFailure() {
  b.mu.Lock()
  defer b.mu.Unlock()

  b.failures++

  if b.state == StateHalfOpen || b.failures >= b.maxFailures {
    b.state = StateOpen
    b.openedAt = time.Now()
  }
}circuit_breaker.go

The breaker does not retry anything. We simply ask it to Do something for us, and it decides whether we are allowed to hit the upstream dependency or not. Also, this breaker is in-memory and while it can be used by multiple workers (see the mutex), it does not survive restarts. If that’s what you need (and most likely you do), look into persisting breaker’s state and using distributed locking instead of mutexes.

Be lazy, retry later

Following the principle of “not doing today what can be put off till tomorrow”, instead of tying the payload’s lifecycle to that of an HTTP request, we persist work to retry when it benefits us:

type Job struct {
  ID      string
  Payload Payload
  Attempt int
}

type Payload struct {
  Prompt string
}

type Queue interface {
  Enqueue(ctx context.Context, payload Payload) error
  ClaimNext(ctx context.Context) (*Job, error)
  Complete(ctx context.Context, jobID string) error
  RetryLater(ctx context.Context, jobID string, next time.Time) error
}queue.go

This Queue interface doesn’t tie to a specific persistence system. You can implement it to write to a database table, or a message queue, or a stream you can read from at a later time. The implementation doesn’t matter. Only the guarantee does: the payload survives failure, and can be retried at a later time.

Wiring it all together

We use a Worker instance to combine the two patterns. The worker is responsible for processing prompts, and it uses the CircuitBreaker to decide whether to call the LLM or not. If the breaker is open, or if the call fails, it enqueues the job to be retried later.

type LLMClient interface {
  ProcessPrompt(ctx context.Context, text string) (LLMResponse, error)
}

type Worker struct {
  queue   Queue
  breaker *CircuitBreaker
  llm     LLMClient
}

A worker calls the LLMClient when the CircuitBreaker is closed, and in case of failure it enqueues the Job (along with the Payload) to be retried at a future time. We can build a breaker that opens on specific errors, like:

breaker := NewCircuitBreaker(
  5,                 // Max failures threshold
  15 * time.Minute,  // Cooldown period
  func(err error) bool { // trip the breaker for HTTP 429 and 5xx
    var httpErr *HTTPError
    if errors.As(err, &httpErr) {
      return httpErr.StatusCode == http.StatusTooManyRequests ||  // HTTP 429
             httpErr.StatusCode >= http.StatusInternalServerError // HTTP 5xx
    }

    return false
  },
)

With the modules we now have, we can implement a ProcessJob behavior that enables the worker to process a prompt or enqueue for later:

func (w *Worker) ProcessJob(ctx context.Context) error {
  job, err := w.queue.ClaimNext(ctx)
  if err != nil {
    return err
  }
  if job == nil { // feels weird but I'll allow it
    return nil
  }

  var llmRes LLMResponse
  err = w.breaker.Do(ctx, func(ctx context.Context) error {
    res, err := w.llm.ProcessPrompt(ctx, job.Payload.Prompt)
    if err != nil {
      return err
    }

    llmRes = res
    return nil
  })

  if err != nil {
    return w.retryLater(ctx, job)
  }

  w.queue.Complete(ctx, job.ID)

  // TODO: process LLM Response

  return nil
}

func (w *Worker) retryLater(ctx context.Context, job *Job) error {
  next := time.Now().Add(backoff(job.Attempt)) // implementation left to the reader
  return w.queue.RetryLater(ctx, job.ID, next)
}

Takeaways

If you’re not a Go developer, the main takeaway here is to not think of LLMs as special citizens from a systems perspective. They are unreliable network dependencies. If you treat them like magic, your system will break. If you treat them like any other dependency in any distributed system, you can use all the tools you already know to work, and your system will survive.

The prompt is only half of the solution. The execution model is just as important.