March 22, 2026 · 7 min read
Building Failure Protocols Into Your OpenClaw Agent
Why most agent configs fail silently. How to write Severity 1 vs Severity 2 failure protocols in AGENTS.md. Rollback rules, the three-attempt rule, and post-mortems.
Why Agents Fail Silently
Most AI agent configurations have no failure protocols. The agent tries something, hits an error, and either retries indefinitely, moves on silently, or stops with no explanation. None of these outcomes are acceptable in a production system.
Silent failures are the worst. The agent completes what looks like a successful session. You check in the next morning. Nothing that was supposed to happen happened — and you have no log of what went wrong or why.
This is a configuration problem, not a model problem. Every production AGENTS.md should include explicit failure protocols. Here's how to write them.
Severity Levels: The Foundation
Before writing any failure protocol, you need a severity classification system. Without it, every error gets treated the same way — which means either everything escalates (noisy and useless) or nothing does (dangerous).
Here's a two-level system that works for most operators:
```
Severity Levels
### Severity 1 — Recoverable Failure
Definition: The task failed but the system is stable. No data was lost,
no external action was taken incorrectly, and the failure can be retried
or skipped without downstream impact.
Examples:
- API timeout on a non-critical read
- File not found (expected to exist but didn't)
- Draft generated but review queue is unavailable
Protocol:
1. Log the failure to /logs/errors.md with timestamp, task, and error detail
2. Retry once after 60 seconds
3. If retry fails, skip the task and flag in morning brief
4. Do not interrupt operator unless pattern repeats 3+ times
### Severity 2 — Critical Failure
Definition: The task failure has downstream impact, data may be affected,
an external action may have been taken incorrectly, or the failure affects
a Tier 3 action.
Examples:
- Email sent to wrong recipient or with wrong content
- File deleted or overwritten unexpectedly
- Billing action triggered without clear authorization
- Agent cannot determine whether an action completed or not
Protocol:
1. STOP all further execution immediately
2. Log full context to /logs/critical.md
3. Notify operator via Telegram immediately
4. Do not retry. Do not proceed. Wait for human review.
```
The key distinction: Severity 1 failures are self-managed by the agent. Severity 2 failures stop everything and escalate.
The Three-Attempt Rule
For Severity 1 failures, the three-attempt rule prevents infinite retry loops while still allowing for transient errors:
```
Retry Policy
For any Severity 1 failure:
- Attempt 1: Immediate retry
- Attempt 2: Retry after 5-minute wait
- Attempt 3: Retry after 15-minute wait
- After 3 failures: Mark task as FAILED, log to /logs/errors.md,
add to morning brief as item needing attention
- Do not attempt more than 3 times without operator input
```
Three attempts covers the most common transient issues (brief API downtime, temporary file locks, rate limiting) without creating loops that mask a real problem.
Rollback Rules
Some actions need explicit rollback protocols — defined steps the agent takes when something goes wrong mid-execution.
```
Rollback Rules
### File Modifications
If any file write fails mid-operation:
- Do not delete the partial file
- Save partial output to /tmp/rollback/ with timestamp
- Log original file state if available
- Notify operator before attempting any cleanup
### Multi-Step Sequences
If a multi-step sequence fails partway through:
- Log completed steps and failed step to /logs/errors.md
- Do not proceed to later steps
- Do not attempt to reverse completed steps without explicit operator instruction
- Report exact state: what completed, what failed, what was not attempted
### External Actions (Tier 2/3)
If an external action (email, post, API call) returns an error or ambiguous status:
- Treat as Severity 2 immediately
- Log the attempt, the response received, and the ambiguity
- Do not retry external actions automatically
- Require operator confirmation before any retry
```
The external actions rule is critical. When you're not sure whether an email sent or not, retrying is the worst thing you can do. Log the ambiguity and ask.
The Post-Mortem Pattern
For any Severity 2 failure, require a brief post-mortem log entry before resuming normal operations:
```
Post-Mortem Format
After any Severity 2 failure is resolved, log to /logs/post-mortems.md:
Date:
Task that failed:
What happened:
Root cause (if known):
Steps taken to resolve:
Rule or protocol change needed (if any):
```
This creates an audit trail and, over time, a pattern library. If the same failure mode shows up three times in your post-mortem log, it's a signal to update AGENTS.md with a preventive rule.
What a Production AGENTS.md Failure Section Looks Like
Put it all together and a complete failure protocols section looks like this — roughly 40-60 lines, clear severity definitions, explicit protocols, and rollback rules. The first time it catches something before it becomes a real problem, you'll understand why it's worth writing.
Operators who skip failure protocols are betting that nothing will go wrong. Production systems can't afford that bet.
Ready to Deploy Your Operator?
The Solopreneur Operator Kit includes all 14 files — pre-built and ready to configure in 30 minutes.
Get Your Operator Kit — $49One-time purchase. 30-day money-back guarantee.