When your AI agent fails at 3 AM, you don't have time to google solutions. This playbook gives you step-by-step procedures for the 7 most common AI failures—from hallucinations to security breaches. Print it. Bookmark it. Know it before you need it.
The Golden Rule of AI Emergencies
STOP. CAPTURE. ASSESS. ESCALATE.
Before you fix anything, you must:
- STOP the agent from making things worse
- CAPTURE logs, state, and context
- ASSESS severity and scope of impact
- ESCALATE to the right people if needed
Most failures become disasters because people skip step 1.
Scenario 1: AI Hallucination (False Success)
What It Looks Like
Your agent reports task completion, but no files exist. Or outputs are nonsensical. Or decisions make no logical sense.
Immediate Actions (First 5 Minutes)
- Halt the agent
- Disable cron jobs/scheduled tasks
- Revoke API access temporarily
- Stop any downstream processes consuming agent output
- Capture evidence
- Save all logs from the past 24 hours
- Export agent state (memory, context, configuration)
- Screenshot error messages or unexpected outputs
- Verify scope
- Check: Did the agent actually create/modify files?
- Check: Did the agent send emails, make API calls, or trigger actions?
- Check: Are other agents or systems affected?
Recovery Steps (Next 30 Minutes)
- Identify affected data
- List all outputs generated during the hallucination period
- Mark them for manual review or rollback
- Rollback if possible
- Revert to last known good state (you have backups, right?)
- If no rollback, manually correct critical outputs
- Root cause analysis
- Was the prompt ambiguous?
- Did the agent lack necessary context?
- Is there a model limitation or bug?
- Add guardrails before re-enabling
- Implement output verification (filesystem checks)
- Add human review for high-stakes outputs
- Create tests for this specific scenario
Scenario 2: Silent Death (Agent Stops Working)
What It Looks Like
Your cron job hasn't run in 3 days. No alerts triggered. The agent just... stopped. No error messages. No logs.
Immediate Actions
- Check agent health
- Is the process running? (
ps aux | grep agent) - Is the cron job still scheduled? (
crontab -l) - Are there zombie processes?
- Is the process running? (
- Check infrastructure
- Is the server/container running?
- Are network connections working?
- Is disk space available?
- Check recent changes
- Did someone modify the agent code?
- Were environment variables changed?
- Did API keys expire or rotate?
Recovery Steps
- Restart the agent
- Manually trigger the agent to test functionality
- If it works, add to monitoring dashboard
- If it fails, debug with verbose logging enabled
- Implement watchdog monitoring
- Create a separate system that checks for agent activity
- Alert if no activity for 2x expected interval
- Use external monitoring service (not self-monitoring)
- Add self-healing
- Create auto-restart scripts for common failure modes
- Implement health checks that trigger recovery
Scenario 3: Amnesic Loop (Agent Repeats Mistakes)
What It Looks Like
The agent makes the same mistake repeatedly. It doesn't learn from feedback. Every session is a blank slate.
Immediate Actions
- Stop the agent — Don't let it keep making the same mistake
- Check memory systems
- Is the vector database accessible?
- Is feedback being stored?
- Is context being retrieved before decisions?
- Review feedback.json
- Does it contain the reject reasons?
- Is the agent reading it before acting?
Recovery Steps
- Fix the memory pipeline
- Ensure feedback is being written to storage
- Verify retrieval is working (test with known queries)
- Check context window isn't truncating critical memory
- Add explicit memory checks
- Before major decisions, force retrieval of similar past cases
- Log memory retrieval to verify it's happening
- Create decision logging
- Log every decision with: context used, alternatives considered, outcome
- This creates audit trail for debugging amnesia
Scenario 4: Security Breach (Unauthorized Access)
What It Looks Like
Agent accessed data it shouldn't. Or API keys are compromised. Or agent is being used maliciously.
Immediate Actions (First Minute)
- Kill the agent immediately
pkill -9 agent_process- Disable all API keys the agent uses
- Revoke network access
- Rotate all credentials
- Generate new API keys
- Change database passwords
- Update secrets manager
- Assess breach scope
- What data was accessed?
- What actions were taken?
- Are other systems compromised?
Recovery Steps
- Security audit
- Review all agent access logs
- Check for lateral movement
- Identify attack vector
- Report if required
- GDPR: 72-hour breach notification requirement
- Check state/federal requirements
- Document everything for legal/compliance
- Implement safeguards
- Reduce agent permissions (principle of least privilege)
- Add rate limiting and anomaly detection
- Implement approval workflows for sensitive actions
Scenario 5: Cost Overrun (Budget Explosion)
What It Looks Like
Your $500/month AI budget hit $5,000. Agent is making excessive API calls. Token usage is through the roof.
Immediate Actions
- Impose hard budget limits
- Set API provider budget caps
- Implement circuit breakers at 80% budget
- Disable agent if budget exceeded
- Analyze cost sources
- Which operations are most expensive?
- Is there a runaway loop?
- Are retries happening excessively?
- Implement cost controls
- Add token counting before expensive operations
- Switch to cheaper models for simple tasks
- Implement aggressive caching
Scenario 6: Data Corruption (Agent Destroys Data)
What It Looks Like
Agent deleted important files. Or modified database records incorrectly. Or overwrote configuration.
Immediate Actions
- Stop the agent — Prevent further damage
- Assess corruption scope
- What data was affected?
- Is there a clean backup?
- Can changes be reversed?
- Restore from backup
- Use most recent clean backup
- Verify backup integrity before restore
- Document what was lost
Prevention for Next Time
- Implement write-ahead logging for all agent changes
- Add approval workflow for destructive operations
- Create sandbox environments for testing
- Enable version control for all agent-modified files
Scenario 7: Regulatory Violation (Compliance Breach)
What It Looks Like
Agent processed PII without consent. Or retained data too long. Or shared data with unauthorized parties.
Immediate Actions
- Stop the violating process
- Disable the agent function causing violation
- Prevent further data processing
- Document everything
- What data was affected?
- Who was impacted?
- What was the violation?
- When did it start/end?
- Consult legal/compliance
- Determine notification requirements
- Assess penalties and remediation
- Create incident report
- Implement safeguards
- Add PII detection before processing
- Implement data retention automation
- Create consent verification workflows
Post-Incident Checklist
After handling any emergency, complete this checklist within 48 hours:
- ☐ Incident documented in detail
- ☐ Root cause identified and documented
- ☐ Prevention measures implemented
- ☐ Monitoring/alerting added for this scenario
- ☐ Runbook updated with lessons learned
- ☐ Team briefed if applicable
- ☐ Users notified if affected
- ☐ Compliance/legal review completed if required
Don't Wait for the Emergency
Most AI failures are preventable with proper setup. Our professional agent setup includes emergency response planning, monitoring, and guardrails from day one.
Get Professional Setup