AI Agent Error Handling Patterns: Building Resilient Systems

AI agents fail. Models timeout, APIs rate limit, prompts hit token limits, and responses contain unexpected formats. The difference between a prototype and production system isn't the happy path—it's how gracefully you handle errors. This guide covers proven patterns for building resilient AI agents that fail gracefully and recover automatically.

Why Error Handling Matters More for AI Agents

Traditional software errors are deterministic—the same input produces the same error. AI agents introduce unique failure modes:

Non-deterministic outputs: The same prompt can produce different formats, lengths, and quality levels
Model instability: Rate limits, capacity issues, and timeout fluctuations
Context window limits: Unexpected token count growth breaks requests
Hallucinated responses: Plausible-looking but incorrect outputs require validation
Cascading failures: One agent's error propagates through multi-step workflows

The Error Handling Hierarchy

Level 1: Retries with Backoff

The first line of defense for transient failures:

Exponential backoff: Wait 1s, 2s, 4s, 8s between retries
Jitter: Add randomness to prevent thundering herd
Max retries: Typically 3-5 attempts before giving up
Retryable errors: Timeouts, rate limits (429), server errors (5xx)
Non-retryable: Auth failures (401), bad requests (400), not found (404)

Implementation pattern:

async function withRetry(fn, maxRetries = 3) {
    for (let i = 0; i < maxRetries; i++) {
        try {
            return await fn();
        } catch (error) {
            if (!isRetryable(error) || i === maxRetries - 1) throw error;
            const delay = Math.pow(2, i) * 1000 + Math.random() * 1000;
            await sleep(delay);
        }
    }
}

Level 2: Fallback Models

When primary model fails, route to backup:

Cascade strategy: GPT-4 → Claude → GPT-3.5 → local model
Cost vs quality tradeoff: Accept lower quality for availability
Model capabilities: Ensure fallback can handle the task
Monitoring: Track fallback frequency as health metric

Level 3: Response Validation

Catch hallucinations and malformed outputs:

Schema validation: JSON schema for structured outputs
Content checks: Verify required fields, valid ranges, logical consistency
Confidence thresholds: Reject low-confidence responses
Human escalation: Route uncertain responses to review queue

Level 4: Graceful Degradation

When all else fails, provide reduced functionality:

Cached responses: Return previously computed results
Simplified mode: Basic functionality without advanced features
Transparent failure: Clear error message with retry options
Manual fallback: Route to human agents or alternative processes

Pattern 1: Circuit Breaker

Prevent cascading failures by failing fast when a service is degraded:

States

Closed: Normal operation, requests flow through
Open: Failure threshold exceeded, all requests fail immediately
Half-Open: Testing if service recovered, limited requests allowed

Configuration

Failure threshold: 5 failures in 30 seconds
Open duration: 60 seconds before attempting recovery
Half-open requests: 3 test requests to verify recovery

Implementation:

class CircuitBreaker {
    constructor(threshold = 5, timeout = 60000) {
        this.failures = 0;
        this.threshold = threshold;
        this.timeout = timeout;
        this.state = 'CLOSED';
        this.lastFailure = null;
    }
    
    async execute(fn) {
        if (this.state === 'OPEN') {
            if (Date.now() - this.lastFailure > this.timeout) {
                this.state = 'HALF-OPEN';
            } else {
                throw new Error('Circuit breaker is OPEN');
            }
        }
        
        try {
            const result = await fn();
            this.onSuccess();
            return result;
        } catch (error) {
            this.onFailure();
            throw error;
        }
    }
    
    onSuccess() {
        this.failures = 0;
        this.state = 'CLOSED';
    }
    
    onFailure() {
        this.failures++;
        this.lastFailure = Date.now();
        if (this.failures >= this.threshold) {
            this.state = 'OPEN';
        }
    }
}

Pattern 2: Timeout Budget

Allocate time across multiple operations, failing fast when budget exhausted:

Total budget: 30 seconds for complete request
Per-operation: Model call (20s), validation (2s), post-processing (3s)
Early termination: Cancel remaining operations if budget exceeded
User experience: Return partial results or cached response

Pattern 3: Dead Letter Queue

Capture failed requests for analysis and retry:

Capture context: Input, timestamp, error type, attempt count
Retry queue: Automatically retry with exponential backoff
Manual review: Route persistent failures to human review
Analytics: Track failure patterns to identify systemic issues

Pattern 4: Bulkhead

Isolate failures to prevent total system collapse:

Connection pools: Separate pools for critical vs non-critical operations
Rate limiters: Per-tenant or per-endpoint limits
Resource isolation: Dedicated resources for high-priority workflows
Graceful degradation: Non-essential features fail without affecting core

Pattern 5: Semantic Validation

Validate AI responses for correctness, not just format:

Validation Types

Format validation: JSON schema, required fields, data types
Range validation: Numbers within expected bounds
Logic validation: Responses make logical sense (e.g., end date after start date)
Factual validation: Cross-reference with trusted sources
Consistency validation: Response aligns with conversation history

Validation Pipeline

Parse response into structured format
Apply schema validation
Check business rules and constraints
Verify against knowledge base (if applicable)
Flag low-confidence responses for review

Error Classification Framework

Transient Errors (Retry)

Network timeouts
Rate limits (HTTP 429)
Model capacity issues
Temporary service unavailability (HTTP 503)

Persistent Errors (Escalate)

Authentication failures (HTTP 401)
Authorization errors (HTTP 403)
Invalid requests (HTTP 400)
Resource not found (HTTP 404)

Model Errors (Fallback)

Token limit exceeded
Content policy violations
Context window overflow
Response quality below threshold

Application Errors (Handle)

Invalid user input
Missing required context
Business rule violations
Unsupported operations

Observability for Error Handling

Key Metrics

Error rate: Failed requests / total requests
Retry rate: Requests requiring retry
Fallback rate: Requests using backup models
Circuit breaker trips: Frequency of OPEN state
DLQ depth: Unprocessed failed requests
Mean time to recovery: Average time from failure to resolution

Alerting Thresholds

Error rate > 5% for 5 minutes
Circuit breaker OPEN for > 2 minutes
DLQ depth > 100 messages
Fallback rate > 20% of traffic

Testing Error Handling

Chaos Engineering

Latency injection: Add delays to test timeout handling
Error injection: Force specific error codes
Resource exhaustion: Test behavior under load
Model degradation: Test with lower-quality model responses

Test Scenarios

Model API returns 429 rate limit
Response exceeds token limit
Model returns malformed JSON
Network timeout during request
All fallback models unavailable

Common Mistakes

Infinite retries: Always set maximum retry count
No backoff: Retrying immediately compounds the problem
Generic error messages: Users need actionable information
Swallowing errors: Log all failures for debugging
Over-engineering: Start simple, add complexity as needed
Ignoring partial failures: Handle degraded responses appropriately

Best Practices Summary

Implement retries with exponential backoff and jitter
Use circuit breakers to fail fast during outages
Always have fallback models configured
Validate responses semantically, not just syntactically
Capture failed requests in dead letter queues
Isolate critical operations with bulkheads
Monitor error rates and alert proactively
Test error scenarios with chaos engineering
Provide clear, actionable error messages to users
Document error codes and resolution steps

Conclusion

Robust error handling transforms AI agents from fragile prototypes into production-ready systems. By implementing the patterns covered—retries with backoff, circuit breakers, fallback models, semantic validation, and graceful degradation—you build agents that fail gracefully and recover automatically.

Remember: users don't remember when everything works perfectly. They remember how you handled failure. Make it count.