AI agents will fail. The question is whether they fail gracefully or catastrophically.
Most agent implementations treat errors as afterthoughts—add a try-catch, log the error, move on. That works for demos. It destroys production systems.
This guide covers battle-tested error handling patterns that keep your agents running when APIs timeout, models hallucinate, and external services crumble. Every pattern includes implementation details, code examples, and real-world lessons from production deployments.
Traditional software has deterministic failures. Input X always produces error Y. AI agents introduce probabilistic failures:
Different errors require different handling strategies. Categorize first, then respond.
| Category | Examples | Transient? | Primary Strategy |
|---|---|---|---|
| Network/Infrastructure | Timeouts, connection refused, DNS failures | Yes | Retry with backoff |
| Rate Limiting | 429 errors, quota exceeded | Yes | Respect headers + queue |
| Model Errors | Invalid response, malformed JSON, context overflow | Sometimes | Re-prompt + validate |
| Tool/API Errors | Invalid parameters, auth failures, business logic errors | No | Fallback or user input |
| State Errors | Lost context, corrupted memory, invalid transitions | No | Recovery + re-initialize |
| Business Logic | Policy violations, insufficient data, constraint failures | No | User clarification |
Most AI agent errors are transient. Network blips, model overload, rate limits—these resolve themselves. Automatic retry with intelligent backoff handles 60-80% of errors without user impact.
async function withRetry(fn, options = {}) {
const {
maxAttempts = 3,
baseDelay = 1000, // 1 second
maxDelay = 30000, // 30 seconds
jitter = true, // Add randomness
retryableErrors = ['ETIMEDOUT', 'ECONNRESET', '429', '503']
} = options;
let lastError;
for (let attempt = 1; attempt <= maxAttempts; attempt++) {
try {
return await fn();
} catch (error) {
lastError = error;
// Check if error is retryable
const isRetryable = retryableErrors.some(e =>
error.code === e || error.message?.includes(e)
);
if (!isRetryable || attempt === maxAttempts) {
throw error;
}
// Calculate delay with exponential backoff
let delay = Math.min(baseDelay * Math.pow(2, attempt - 1), maxDelay);
// Add jitter to prevent thundering herd
if (jitter) {
delay = delay * (0.5 + Math.random() * 0.5);
}
console.log(`Attempt ${attempt} failed, retrying in ${delay}ms`);
await sleep(delay);
}
}
throw lastError;
}
// Usage
const response = await withRetry(
() => openai.chat.completions.create({ ... }),
{ maxAttempts: 5, baseDelay: 2000 }
);
const retryStrategies = {
// Rate limits: Use Retry-After header
rateLimit: {
shouldRetry: (error) => error.status === 429,
getDelay: (error, attempt) => {
const retryAfter = error.headers?.['retry-after'];
return retryAfter ? parseInt(retryAfter) * 1000 : 60000;
},
maxAttempts: 3
},
// Network errors: Exponential backoff
network: {
shouldRetry: (error) => ['ETIMEDOUT', 'ECONNRESET', 'ENOTFOUND'].includes(error.code),
getDelay: (error, attempt) => Math.min(1000 * Math.pow(2, attempt), 30000),
maxAttempts: 5
},
// Model overload: Longer delays
modelOverload: {
shouldRetry: (error) => error.status === 503 || error.message?.includes('overloaded'),
getDelay: (error, attempt) => Math.min(5000 * Math.pow(2, attempt), 120000),
maxAttempts: 3
},
// Context overflow: Cannot retry, need different strategy
contextOverflow: {
shouldRetry: () => false,
handle: (error) => ({ action: 'truncate_context' })
}
};
async function smartRetry(fn, context = {}) {
let lastError;
for (let attempt = 1; attempt <= 10; attempt++) {
try {
return await fn();
} catch (error) {
lastError = error;
// Find matching strategy
const strategy = Object.values(retryStrategies).find(s => s.shouldRetry(error));
if (!strategy || attempt >= strategy.maxAttempts) {
// Check for special handling
if (strategy?.handle) {
return strategy.handle(error);
}
throw error;
}
const delay = strategy.getDelay(error, attempt);
console.log(`[Attempt ${attempt}] ${error.message}. Waiting ${delay}ms`);
await sleep(delay);
}
}
throw lastError;
}
When a service is down, retrying is futile—and potentially harmful (DDoS amplification). Circuit breakers detect sustained failures and "trip" to prevent cascading damage.
class CircuitBreaker {
constructor(options = {}) {
this.failureThreshold = options.failureThreshold || 5;
this.successThreshold = options.successThreshold || 2;
this.timeout = options.timeout || 60000; // 1 minute
this.state = 'CLOSED';
this.failures = 0;
this.successes = 0;
this.lastFailureTime = null;
}
async execute(fn) {
if (this.state === 'OPEN') {
if (Date.now() - this.lastFailureTime > this.timeout) {
this.state = 'HALF_OPEN';
this.successes = 0;
} else {
throw new Error('Circuit breaker is OPEN');
}
}
try {
const result = await fn();
this.onSuccess();
return result;
} catch (error) {
this.onFailure();
throw error;
}
}
onSuccess() {
this.failures = 0;
if (this.state === 'HALF_OPEN') {
this.successes++;
if (this.successes >= this.successThreshold) {
this.state = 'CLOSED';
console.log('Circuit breaker: CLOSED (service recovered)');
}
}
}
onFailure() {
this.failures++;
this.lastFailureTime = Date.now();
if (this.state === 'HALF_OPEN') {
this.state = 'OPEN';
console.log('Circuit breaker: OPEN (service still failing)');
} else if (this.failures >= this.failureThreshold) {
this.state = 'OPEN';
console.log(`Circuit breaker: OPEN (${this.failures} consecutive failures)`);
}
}
getStatus() {
return {
state: this.state,
failures: this.failures,
lastFailureTime: this.lastFailureTime
};
}
}
// Usage: Create one circuit breaker per external service
const openaiBreaker = new CircuitBreaker({ failureThreshold: 3, timeout: 30000 });
const databaseBreaker = new CircuitBreaker({ failureThreshold: 5, timeout: 60000 });
// Wrap API calls
async function callOpenAI(prompt) {
return openaiBreaker.execute(async () => {
const response = await openai.chat.completions.create({
model: 'gpt-4',
messages: [{ role: 'user', content: prompt }]
});
return response.choices[0].message.content;
});
}
class ResilientAgent {
constructor() {
this.circuitBreakers = {
llm: new CircuitBreaker({ failureThreshold: 3 }),
database: new CircuitBreaker({ failureThreshold: 5 }),
tools: new CircuitBreaker({ failureThreshold: 2 })
};
}
async processQuery(query) {
const results = {
llm: null,
database: null,
tools: []
};
// Try LLM with circuit breaker
try {
results.llm = await this.circuitBreakers.llm.execute(
() => this.callLLM(query)
);
} catch (error) {
if (error.message.includes('Circuit breaker is OPEN')) {
// LLM is down, use fallback
results.llm = await this.getFallbackResponse(query);
} else {
throw error;
}
}
// Try database with circuit breaker
try {
results.database = await this.circuitBreakers.database.execute(
() => this.queryDatabase(query)
);
} catch (error) {
if (error.message.includes('Circuit breaker is OPEN')) {
// Database is down, use cached data
results.database = await this.getCachedData(query);
}
}
return this.synthesizeResponse(results);
}
}
When retries and circuit breakers fail, you need fallback strategies. The goal: deliver degraded but functional service rather than complete failure.
class FallbackChain {
constructor(strategies) {
this.strategies = strategies; // Ordered by preference
}
async execute(context) {
const errors = [];
for (const [index, strategy] of this.strategies.entries()) {
try {
const result = await strategy.execute(context);
// Track which fallback was used
result.fallbackLevel = index;
result.fallbackName = strategy.name;
if (index > 0) {
console.log(`Using fallback: ${strategy.name} (level ${index})`);
result.degraded = true;
}
return result;
} catch (error) {
errors.push({ strategy: strategy.name, error: error.message });
console.log(`Fallback ${strategy.name} failed: ${error.message}`);
}
}
// All fallbacks exhausted
throw new Error(`All fallbacks failed: ${JSON.stringify(errors)}`);
}
}
// Example: Query resolution with fallbacks
const queryResolver = new FallbackChain([
{
name: 'primary_llm',
execute: async (ctx) => ({
answer: await callGPT4(ctx.query),
confidence: 0.95
})
},
{
name: 'secondary_llm',
execute: async (ctx) => ({
answer: await callClaude(ctx.query),
confidence: 0.90
})
},
{
name: 'cached_response',
execute: async (ctx) => {
const cached = await cache.get(similarQueryKey(ctx.query));
if (!cached) throw new Error('No cached response');
return { answer: cached, confidence: 0.70 };
}
},
{
name: 'knowledge_base_search',
execute: async (ctx) => ({
answer: await searchKnowledgeBase(ctx.query),
confidence: 0.60
})
},
{
name: 'static_apology',
execute: async (ctx) => ({
answer: "I'm experiencing technical difficulties. Please try again in a moment, or contact support if urgent.",
confidence: 0,
requiresHumanAttention: true
})
}
]);
// Usage
const result = await queryResolver.execute({ query: userQuery });
if (result.degraded) {
// Log degradation for monitoring
metrics.increment('agent.degraded_response', { level: result.fallbackLevel });
}
| Primary Model | Fallback Model | Use Case |
|---|---|---|
| GPT-4 | Claude 3 Opus | Complex reasoning |
| GPT-4 | GPT-3.5 Turbo | Speed-critical, simpler tasks |
| Claude 3 Opus | GPT-4 | Long context tasks |
| Any cloud model | Local LLaMA/Mistral | Privacy-sensitive, offline mode |
Not all features are equal. When systems fail, disable non-essential features while preserving core functionality.
const FEATURE_TIERS = {
essential: {
features: ['basic_response', 'conversation_history'],
requiredServices: ['llm', 'database'],
fallbackMessage: null // Must work, no fallback
},
important: {
features: ['tool_execution', 'contextual_memory'],
requiredServices: ['llm', 'database', 'tool_service'],
fallbackMessage: "Some features are temporarily unavailable."
},
enhanced: {
features: ['image_analysis', 'web_browsing', 'advanced_reasoning'],
requiredServices: ['llm', 'database', 'vision_api', 'web_search'],
fallbackMessage: "Advanced features are currently disabled."
}
};
class GracefulDegradation {
constructor() {
this.serviceStatus = new Map();
}
updateServiceStatus(service, isHealthy) {
this.serviceStatus.set(service, isHealthy);
}
getAvailableFeatures() {
const available = [];
const degraded = [];
for (const [tier, config] of Object.entries(FEATURE_TIERS)) {
const allServicesUp = config.requiredServices.every(
service => this.serviceStatus.get(service) !== false
);
if (allServicesUp) {
available.push(...config.features);
} else {
degraded.push(...config.features);
}
}
return { available, degraded };
}
canExecute(feature) {
const { available } = this.getAvailableFeatures();
return available.includes(feature);
}
async executeWithDegradation(feature, fn, fallback = null) {
if (this.canExecute(feature)) {
try {
return await fn();
} catch (error) {
// Feature failed, mark service as degraded
console.error(`Feature ${feature} failed:`, error.message);
return fallback;
}
}
return fallback;
}
}
// Usage in agent
class ProductionAgent {
constructor() {
this.degradation = new GracefulDegradation();
}
async processWithFeatures(query) {
const response = {
answer: null,
features: {}
};
// Essential: Always try
response.answer = await this.generateBasicResponse(query);
// Important: Try if available
if (this.degradation.canExecute('tool_execution')) {
response.features.tools = await this.degradation.executeWithDegradation(
'tool_execution',
() => this.executeTools(query),
[]
);
}
// Enhanced: Nice to have
if (this.degradation.canExecute('image_analysis') && query.hasImage) {
response.features.imageInsights = await this.degradation.executeWithDegradation(
'image_analysis',
() => this.analyzeImage(query.image),
null
);
}
// Add degradation notice if needed
const { degraded } = this.degradation.getAvailableFeatures();
if (degraded.length > 0) {
response.degradedFeatures = degraded;
response.notice = "Some features are temporarily unavailable.";
}
return response;
}
}
AI agents maintain state across interactions. When errors occur mid-task, you need recovery patterns to continue without losing context.
class StatefulAgent {
constructor() {
this.state = {
conversationId: null,
context: [],
pendingActions: [],
completedActions: [],
checkpoints: []
};
}
// Create checkpoint before risky operations
createCheckpoint(label) {
const checkpoint = {
id: generateId(),
label,
timestamp: Date.now(),
state: JSON.parse(JSON.stringify(this.state))
};
this.state.checkpoints.push(checkpoint);
// Persist to durable storage
await persistence.saveCheckpoint(checkpoint);
return checkpoint.id;
}
// Restore from checkpoint on failure
async restoreFromCheckpoint(checkpointId) {
const checkpoint = await persistence.getCheckpoint(checkpointId);
if (!checkpoint) {
throw new Error(`Checkpoint ${checkpointId} not found`);
}
this.state = checkpoint.state;
console.log(`Restored to checkpoint: ${checkpoint.label}`);
return true;
}
async executeWithRecovery(action, options = {}) {
const checkpointId = this.createCheckpoint(`before_${action.type}`);
try {
const result = await this.executeAction(action);
this.state.completedActions.push(action);
return result;
} catch (error) {
console.error(`Action failed: ${action.type}`, error);
// Restore state
await this.restoreFromCheckpoint(checkpointId);
// Add failure to context for awareness
this.state.context.push({
role: 'system',
content: `Previous action failed: ${action.type}. Error: ${error.message}`
});
// Either retry with different approach or ask user
if (options.retryWithAlternative) {
return this.executeWithRecovery(
options.retryWithAlternative,
{ ...options, retryWithAlternative: null }
);
}
throw error;
}
}
}
class RecoverableTask {
constructor(taskFn, options = {}) {
this.taskFn = taskFn;
this.options = {
maxResumptions: options.maxResumptions || 3,
persistState: options.persistState || true,
...options
};
this.state = {
status: 'pending',
progress: 0,
steps: [],
errors: [],
resumptions: 0
};
}
async execute() {
// Try to resume from persisted state
if (this.options.persistState) {
const savedState = await this.loadState();
if (savedState && savedState.status === 'interrupted') {
this.state = savedState;
console.log(`Resuming task from step ${this.state.steps.length}`);
}
}
try {
const result = await this.taskFn({
state: this.state,
reportProgress: (progress, step) => this.updateProgress(progress, step),
isInterrupted: () => this.state.status === 'interrupted'
});
this.state.status = 'completed';
await this.clearState();
return result;
} catch (error) {
this.state.errors.push({
timestamp: Date.now(),
error: error.message,
stack: error.stack
});
if (this.state.resumptions < this.options.maxResumptions) {
this.state.status = 'interrupted';
this.state.resumptions++;
await this.saveState();
throw new RecoverableError(
'Task interrupted. Can be resumed.',
{ taskId: this.state.id, resumptionsLeft: this.options.maxResumptions - this.state.resumptions }
);
}
this.state.status = 'failed';
throw error;
}
}
updateProgress(progress, step) {
this.state.progress = progress;
if (step) {
this.state.steps.push({ timestamp: Date.now(), description: step });
}
if (this.options.persistState) {
this.saveState(); // Async, fire-and-forget
}
}
async saveState() {
await persistence.saveTask(this.state.id, this.state);
}
}
When errors occur, preserving context is critical for debugging and recovery. Log enough to reconstruct what happened.
class AgentError extends Error {
constructor(message, context = {}) {
super(message);
this.name = 'AgentError';
this.context = {
timestamp: Date.now(),
conversationId: context.conversationId,
agentState: context.agentState,
action: context.action,
input: context.input,
previousActions: context.previousActions,
environment: {
model: context.model,
temperature: context.temperature,
tokensUsed: context.tokensUsed
},
recovery: context.recovery // Recovery options
};
}
toLog() {
return {
name: this.name,
message: this.message,
context: this.context,
stack: this.stack
};
}
}
// Enhanced error logging
function logAgentError(error, additionalContext = {}) {
const logEntry = {
timestamp: Date.now(),
error: error.toLog ? error.toLog() : {
name: error.name,
message: error.message,
stack: error.stack
},
context: additionalContext,
sessionId: getSessionId(),
agentVersion: getAgentVersion()
};
// Log to multiple destinations
console.error(JSON.stringify(logEntry, null, 2));
// Structured logging service
logger.error('agent_error', logEntry);
// For critical errors, send to monitoring
if (error.context?.severity === 'critical') {
monitoring.alert(logEntry);
}
}
You can't fix what you don't measure. Implement comprehensive error monitoring.
| Metric | Description | Alert Threshold |
|---|---|---|
| Error Rate | Errors / Total Requests | > 5% |
| Error Rate by Type | Errors segmented by category | Varies by type |
| Mean Time to Recovery | Average time from error to resolution | > 5 minutes (P0) |
| Cascade Rate | Errors that trigger secondary errors | > 10% |
| Fallback Usage | % of requests using degraded mode | > 20% |
| Circuit Breaker Trips | How often breakers open | > 3/hour per service |
class ErrorMonitor {
constructor() {
this.errorCounts = new Map();
this.alertThresholds = {
error_rate: { threshold: 0.05, window: 60000 },
cascade_rate: { threshold: 0.10, window: 300000 },
circuit_breaker_trips: { threshold: 3, window: 3600000 }
};
}
recordError(error, context) {
const errorType = this.categorizeError(error);
const key = `${errorType}:${Math.floor(Date.now() / 60000)}`;
this.errorCounts.set(key, (this.errorCounts.get(key) || 0) + 1);
// Check if thresholds exceeded
this.checkThresholds(errorType);
// Log to observability
this.logError(error, context);
}
checkThresholds(errorType) {
const now = Date.now();
for (const [metric, config] of Object.entries(this.alertThresholds)) {
const windowStart = now - config.window;
// Count errors in window
let count = 0;
for (const [key, value] of this.errorCounts) {
const [type, timestamp] = key.split(':');
if (parseInt(timestamp) * 60000 >= windowStart) {
count += value;
}
}
// Check threshold
if (count > config.threshold * 100) { // Assuming 100 requests baseline
this.triggerAlert(metric, { count, threshold: config.threshold });
}
}
}
triggerAlert(metric, data) {
const alert = {
timestamp: Date.now(),
metric,
data,
severity: data.count > data.threshold * 2 ? 'critical' : 'warning'
};
// Send to alerting system
alerting.notify(alert);
// Log for audit
console.warn('ALERT:', JSON.stringify(alert));
}
}
Our team has built error-resilient AI agents for 100+ production deployments. We can implement these patterns in your stack in days, not months.
Setup packages starting at $99 include:
3-5 retries for most cases. More than 5 provides diminishing returns and increases latency. Use exponential backoff with jitter to prevent thundering herd problems.
Use circuit breakers when: (1) The service is critical and failures are costly, (2) You have multiple instances making requests, (3) You need to protect a struggling service from additional load. Use simple retries for idempotent operations on isolated failures.
Implement checkpoints before each step. On failure, restore to last checkpoint and either retry with different parameters or ask user for guidance. Never leave agent in undefined state.
Generally no. Translate technical errors into user-friendly messages. Only expose error details when: (1) User action can resolve it (invalid input), (2) User needs context (service down), (3) Regulatory requirement (audit trail). Never expose internal system details.
Chaos engineering: inject failures at every layer (network, API, model, database). Use fault injection proxies to simulate rate limits, timeouts, and malformed responses. Test recovery paths as rigorously as happy paths.