AI Agent Error Handling Patterns: Building Resilient Systems
AI agents fail. Models timeout, APIs rate limit, prompts hit token limits, and responses contain unexpected formats. The difference between a prototype and production system isn't the happy path—it's how gracefully you handle errors. This guide covers proven patterns for building resilient AI agents that fail gracefully and recover automatically.
Why Error Handling Matters More for AI Agents
Traditional software errors are deterministic—the same input produces the same error. AI agents introduce unique failure modes:
- Non-deterministic outputs: The same prompt can produce different formats, lengths, and quality levels
- Model instability: Rate limits, capacity issues, and timeout fluctuations
- Context window limits: Unexpected token count growth breaks requests
- Hallucinated responses: Plausible-looking but incorrect outputs require validation
- Cascading failures: One agent's error propagates through multi-step workflows
The Error Handling Hierarchy
Level 1: Retries with Backoff
The first line of defense for transient failures:
- Exponential backoff: Wait 1s, 2s, 4s, 8s between retries
- Jitter: Add randomness to prevent thundering herd
- Max retries: Typically 3-5 attempts before giving up
- Retryable errors: Timeouts, rate limits (429), server errors (5xx)
- Non-retryable: Auth failures (401), bad requests (400), not found (404)
Implementation pattern:
async function withRetry(fn, maxRetries = 3) {
for (let i = 0; i < maxRetries; i++) {
try {
return await fn();
} catch (error) {
if (!isRetryable(error) || i === maxRetries - 1) throw error;
const delay = Math.pow(2, i) * 1000 + Math.random() * 1000;
await sleep(delay);
}
}
}
Level 2: Fallback Models
When primary model fails, route to backup:
- Cascade strategy: GPT-4 → Claude → GPT-3.5 → local model
- Cost vs quality tradeoff: Accept lower quality for availability
- Model capabilities: Ensure fallback can handle the task
- Monitoring: Track fallback frequency as health metric
Level 3: Response Validation
Catch hallucinations and malformed outputs:
- Schema validation: JSON schema for structured outputs
- Content checks: Verify required fields, valid ranges, logical consistency
- Confidence thresholds: Reject low-confidence responses
- Human escalation: Route uncertain responses to review queue
Level 4: Graceful Degradation
When all else fails, provide reduced functionality:
- Cached responses: Return previously computed results
- Simplified mode: Basic functionality without advanced features
- Transparent failure: Clear error message with retry options
- Manual fallback: Route to human agents or alternative processes
Pattern 1: Circuit Breaker
Prevent cascading failures by failing fast when a service is degraded:
States
- Closed: Normal operation, requests flow through
- Open: Failure threshold exceeded, all requests fail immediately
- Half-Open: Testing if service recovered, limited requests allowed
Configuration
- Failure threshold: 5 failures in 30 seconds
- Open duration: 60 seconds before attempting recovery
- Half-open requests: 3 test requests to verify recovery
Implementation:
class CircuitBreaker {
constructor(threshold = 5, timeout = 60000) {
this.failures = 0;
this.threshold = threshold;
this.timeout = timeout;
this.state = 'CLOSED';
this.lastFailure = null;
}
async execute(fn) {
if (this.state === 'OPEN') {
if (Date.now() - this.lastFailure > this.timeout) {
this.state = 'HALF-OPEN';
} else {
throw new Error('Circuit breaker is OPEN');
}
}
try {
const result = await fn();
this.onSuccess();
return result;
} catch (error) {
this.onFailure();
throw error;
}
}
onSuccess() {
this.failures = 0;
this.state = 'CLOSED';
}
onFailure() {
this.failures++;
this.lastFailure = Date.now();
if (this.failures >= this.threshold) {
this.state = 'OPEN';
}
}
}
Pattern 2: Timeout Budget
Allocate time across multiple operations, failing fast when budget exhausted:
- Total budget: 30 seconds for complete request
- Per-operation: Model call (20s), validation (2s), post-processing (3s)
- Early termination: Cancel remaining operations if budget exceeded
- User experience: Return partial results or cached response
Pattern 3: Dead Letter Queue
Capture failed requests for analysis and retry:
- Capture context: Input, timestamp, error type, attempt count
- Retry queue: Automatically retry with exponential backoff
- Manual review: Route persistent failures to human review
- Analytics: Track failure patterns to identify systemic issues
Pattern 4: Bulkhead
Isolate failures to prevent total system collapse:
- Connection pools: Separate pools for critical vs non-critical operations
- Rate limiters: Per-tenant or per-endpoint limits
- Resource isolation: Dedicated resources for high-priority workflows
- Graceful degradation: Non-essential features fail without affecting core
Pattern 5: Semantic Validation
Validate AI responses for correctness, not just format:
Validation Types
- Format validation: JSON schema, required fields, data types
- Range validation: Numbers within expected bounds
- Logic validation: Responses make logical sense (e.g., end date after start date)
- Factual validation: Cross-reference with trusted sources
- Consistency validation: Response aligns with conversation history
Validation Pipeline
- Parse response into structured format
- Apply schema validation
- Check business rules and constraints
- Verify against knowledge base (if applicable)
- Flag low-confidence responses for review
Error Classification Framework
Transient Errors (Retry)
- Network timeouts
- Rate limits (HTTP 429)
- Model capacity issues
- Temporary service unavailability (HTTP 503)
Persistent Errors (Escalate)
- Authentication failures (HTTP 401)
- Authorization errors (HTTP 403)
- Invalid requests (HTTP 400)
- Resource not found (HTTP 404)
Model Errors (Fallback)
- Token limit exceeded
- Content policy violations
- Context window overflow
- Response quality below threshold
Application Errors (Handle)
- Invalid user input
- Missing required context
- Business rule violations
- Unsupported operations
Observability for Error Handling
Key Metrics
- Error rate: Failed requests / total requests
- Retry rate: Requests requiring retry
- Fallback rate: Requests using backup models
- Circuit breaker trips: Frequency of OPEN state
- DLQ depth: Unprocessed failed requests
- Mean time to recovery: Average time from failure to resolution
Alerting Thresholds
- Error rate > 5% for 5 minutes
- Circuit breaker OPEN for > 2 minutes
- DLQ depth > 100 messages
- Fallback rate > 20% of traffic
Testing Error Handling
Chaos Engineering
- Latency injection: Add delays to test timeout handling
- Error injection: Force specific error codes
- Resource exhaustion: Test behavior under load
- Model degradation: Test with lower-quality model responses
Test Scenarios
- Model API returns 429 rate limit
- Response exceeds token limit
- Model returns malformed JSON
- Network timeout during request
- All fallback models unavailable
Common Mistakes
- Infinite retries: Always set maximum retry count
- No backoff: Retrying immediately compounds the problem
- Generic error messages: Users need actionable information
- Swallowing errors: Log all failures for debugging
- Over-engineering: Start simple, add complexity as needed
- Ignoring partial failures: Handle degraded responses appropriately
Best Practices Summary
- Implement retries with exponential backoff and jitter
- Use circuit breakers to fail fast during outages
- Always have fallback models configured
- Validate responses semantically, not just syntactically
- Capture failed requests in dead letter queues
- Isolate critical operations with bulkheads
- Monitor error rates and alert proactively
- Test error scenarios with chaos engineering
- Provide clear, actionable error messages to users
- Document error codes and resolution steps
Conclusion
Robust error handling transforms AI agents from fragile prototypes into production-ready systems. By implementing the patterns covered—retries with backoff, circuit breakers, fallback models, semantic validation, and graceful degradation—you build agents that fail gracefully and recover automatically.
Remember: users don't remember when everything works perfectly. They remember how you handled failure. Make it count.
Related Articles