Bad documentation kills AI projects. When your AI expert leaves, when you need to debug a failing agent, when you want to scale to new use cases—without documentation, you're starting from zero. This guide shows you exactly what to document and how, so your AI agents remain maintainable, scalable, and transferable.
AI agents aren't like traditional software. They're probabilistic, context-dependent, and constantly evolving. This creates unique documentation challenges:
| Aspect | Traditional Software | AI Agents |
|---|---|---|
| Behavior | Deterministic (same input = same output) | Probabilistic (same input = different possible outputs) |
| Logic | Explicit code rules | Implicit patterns in training data + prompts |
| Testing | Unit tests cover all paths | Probabilistic testing, edge cases hard to predict |
| Updates | Version control for code | Version control for prompts, data, AND code |
| Debugging | Stack traces, logs | Need to reconstruct context, prompt, model state |
This means you need to document intent, not just implementation. Future maintainers need to understand why the agent behaves a certain way, not just what it does.
High-level documentation for stakeholders and new team members:
The most critical documentation for AI agents. You need to capture:
Document your agent's knowledge sources:
If you use fine-tuning or few-shot examples:
Operational documentation for keeping the agent running:
Prompts are the "code" of AI agents. Treat them like code: version control, comments, and testing.
# Prompt: Customer Service - Order Status Query
Version: 3.2
Last Updated: 2026-02-15
Author: Sarah Chen
Status: Production
## Purpose
Handle customer inquiries about order status, shipping, and delivery.
## When Used
- Triggered by: order_status intent
- Fallback from: general_inquiry when order number detected
## Prompt Text
[Your system prompt here]
## Context Variables
- {{customer_name}}: Customer's first name (from CRM)
- {{order_number}}: Extracted order number (validated format)
- {{order_status}}: Current status from OMS (enum: processing, shipped, delivered, returned)
- {{tracking_number}}: Tracking number if shipped
- {{estimated_delivery}}: Estimated delivery date
## Examples
### Example 1: Shipped Order
Input: "Where's my order #12345?"
Context: order_status = "shipped", tracking_number = "1Z999AA10123456784"
Expected Output: "Hi Sarah! Your order #12345 is on its way. Track it here: [tracking link]. Estimated delivery: Feb 20."
### Example 2: Processing Order
Input: "Is order 67890 shipped yet?"
Context: order_status = "processing"
Expected Output: "Your order #67890 is still being prepared. We'll email you when it ships (typically within 2 business days)."
### Example 3: Invalid Order Number
Input: "Check order ABC"
Context: order_number validation failed
Expected Output: "I couldn't find that order. Can you double-check the order number? It should look like #12345."
## Known Limitations
- Can only query orders from last 12 months
- No international tracking for economy shipping
- Returns/refunds require human escalation
## Performance Metrics
- Accuracy: 94% (based on human review of 500 conversations)
- CSAT: 4.2/5.0
- Escalation rate: 6%
## Change History
- v3.2 (2026-02-15): Added estimated delivery mention for shipped orders
- v3.1 (2026-02-01): Fixed issue with invalid order number handling
- v3.0 (2026-01-15): Major rewrite for tone consistency
- v2.0 (2025-12-01): Added context variables for personalization
💡 Pro tip: Store prompts in Git alongside your code. Use semantic versioning (v1.0, v1.1, v2.0) and tag releases. This makes rollback easy and creates an audit trail.
Document how you test prompts before deployment:
Your knowledge base is the agent's reference library. Document it thoroughly.
| Source | Type | Update Frequency | Priority | Owner |
|---|---|---|---|---|
| Product Catalog | Database | Real-time | Highest | Catalog Team |
| FAQ Database | Notion | Weekly | High | Support Team |
| Return Policy | Monthly | High | Legal Team | |
| Shipping Rates | API | Real-time | Medium | Logistics Team |
| Size Guides | Static HTML | Quarterly | Low | Merchandising |
Track what your agent doesn't know:
⚠️ Document knowledge gaps: Every "I don't know" response should trigger a review. Is this a permanent gap (outside scope) or a fixable gap (missing documentation)? Log these and review monthly.
If you fine-tune models or use few-shot examples, document your data thoroughly.
If humans label your data, document the guidelines they followed:
# Dataset: Customer Service Intent Classification
Version: 2.1
Created: 2025-11-15
Last Updated: 2026-02-10
## Overview
- Purpose: Train intent classification model for customer service bot
- Size: 15,000 labeled customer queries
- Labels: 12 intent categories
## Data Sources
- 8,000 historical chat transcripts (2024-2025)
- 4,000 email subject lines (2024-2025)
- 3,000 synthetic examples (generated by GPT-4)
## Label Distribution
- check_order_status: 3,200 (21%)
- return_request: 2,100 (14%)
- product_question: 1,900 (13%)
- shipping_inquiry: 1,700 (11%)
- payment_issue: 1,500 (10%)
- account_help: 1,400 (9%)
- complaint: 1,200 (8%)
- general_inquiry: 1,000 (7%)
- [other intents: 1,000 total]
## Labeling Process
- Labelers: 5 trained annotators
- Guidelines: /docs/labeling/intent-guidelines-v2.md
- Inter-rater reliability: Cohen's κ = 0.87 (good)
- Quality check: 10% random samples reviewed by senior annotator
## Known Limitations
- Under-represents non-English queries (only 3% of dataset)
- Heavy on e-commerce contexts, light on B2B scenarios
- Synthetic examples may not capture real phrasing diversity
## Performance
- Test set accuracy: 91.3%
- F1 score (macro): 0.89
- Lowest-performing intent: "complaint" (F1 = 0.78)
This is your runbook for keeping the agent healthy.
| Failure Mode | Symptoms | Debugging Steps | Fix |
|---|---|---|---|
| High latency | Response time > 10s | Check API status, context length, retrieval latency | Reduce context, optimize retrieval, add caching |
| Accuracy drop | Accuracy < 85% | Review recent conversations, check for prompt drift | Revert prompt, update training data |
| Cost spike | Daily cost > 2x baseline | Check query volume, token usage, model version | Add rate limiting, optimize prompts |
| Knowledge stale | Outdated information in responses | Check knowledge base sync status | Trigger manual sync, update sync schedule |
When should humans get involved?
| Category | Tools | Use Case |
|---|---|---|
| Version Control | Git, GitHub, GitLab | Prompt versioning, change history |
| Documentation | Notion, Confluence, GitBook | System overview, knowledge base docs |
| Prompt Management | LangSmith, PromptLayer, Humanloop | Prompt testing, versioning, monitoring |
| Data Documentation | DataHub, Amundsen, Data Catalog | Dataset metadata, lineage |
| Monitoring | Datadog, Grafana, LangSmith | Performance metrics, alerting |
# [Agent Name] Documentation
## Overview
- **Purpose:** [What problem does this agent solve?]
- **Users:** [Who interacts with this agent?]
- **Success Metrics:** [How do you measure success?]
- **Owner:** [Who maintains this agent?]
## Architecture
[High-level diagram or description]
## Prompts
- [Link to prompt repository]
- Key prompts: [List main prompts with brief descriptions]
## Knowledge Base
- Sources: [List knowledge sources]
- Update frequency: [How often is knowledge refreshed?]
- Known gaps: [What information is missing?]
## Training Data
- Datasets: [List datasets with links to documentation]
- Labeling guidelines: [Link to labeling guide]
## Monitoring
- Dashboard: [Link to monitoring dashboard]
- Alert thresholds: [List key thresholds]
- Escalation: [Link to escalation guide]
## Maintenance
- Update process: [Link to runbook]
- Common issues: [Link to troubleshooting guide]
- On-call: [Who to contact for emergencies?]
## Change Log
| Date | Change | Author |
|------|--------|--------|
| YYYY-MM-DD | [Description] | [Name] |
Real users don't follow the script. Document edge cases, failure modes, and fallback behaviors.
Prompts change. Without version history, you can't debug old conversations or roll back bad changes.
Document the context in which prompts are used. A prompt that works in one scenario may fail in another.
Documentation that's not updated is worse than no documentation (it's misleading). Set a quarterly review schedule.
Don't just document what the agent does—document why it does it that way. Future maintainers need intent, not just implementation.
Every piece of documentation needs an owner. Who updates it? Who reviews it? Who answers questions about it?
Don't document everything. Focus on high-value documentation: prompts, knowledge base, failure modes, update procedures.
I offer AI agent setup packages that include comprehensive documentation from day one. Don't wait until you're debugging a production incident to realize you needed better docs.
View Setup Packages →