What Broke When I Let My AI Co-Founder Run Without Me
The quiet failures of an autonomous agent, and the guardrails, checkpoints, and kill switch I added so autonomy stopped being a liability.
My AI co-founder told a prospect we offered a service we don’t offer. Confidently. In writing. I found out because the prospect replied asking about the pricing for a thing that does not exist.
That was the morning I stopped treating autonomy as a feature and started treating it as a liability I had to manage.
I’ve written before about building an AI co-founder that runs research, ops, and outreach while I sleep. That part is true and it still works. What I left out is the part where I let it run too far ahead of me, and the specific, unglamorous ways it broke. This is that issue. Not the demo reel. The maintenance log.
TL;DR
- The dangerous AI failures are quiet, not loud: confident wrong answers, not crashes.
- My CRM got three hallucinated contact records before I noticed, because nothing errored.
- A working automation drifted over six weeks until it was doing the wrong thing reliably.
- The fix was not better prompts. It was guardrails, human checkpoints, and a kill switch.
- Autonomy is safe in proportion to how fast you can catch it being wrong.
The failures that don’t announce themselves
When software breaks the normal way, you get a red error, a stack trace, a 500. You know. An AI agent operating autonomously breaks differently. It keeps going. It produces something that looks right, sounds right, and is wrong.
Three things broke for me, and not one of them threw an error.
The CRM one stung the most. I trust my pipeline data to make decisions. For about ten days, I was making decisions on records a machine invented, and the records looked exactly like the real ones because the machine is good at looking right. As a nurse, I’ve watched a normal-looking vitals chart hide a patient who’s actually crashing. Same shape of problem. The reassuring readout is the dangerous one.
Why drift is worse than a crash
A crash is a gift. It’s loud, it’s dated, it stops the line. You fix it and move on.
Drift is the opposite. It’s a working system slowly optimizing toward the wrong outcome while every individual day looks fine. My routing automation didn’t fail on a Tuesday. It got 2% more wrong each week for six weeks, and because I was watching outputs instead of behavior, I only caught it when a real prospect landed in the “cold, ignore” bucket.
Week 0
Shipped and correct
Routing matched my rules exactly
Week 2
First silent miss
Edge-case lead mis-tagged, unnoticed
Week 4
Pattern compounds
Agent generalizes the wrong rule
Week 6
Real cost lands
Warm prospect routed to cold bucket
The lesson isn’t that agents are unreliable. It’s that I was measuring the wrong thing. I checked whether the automation ran. I didn’t check whether it ran correctly. Those are different audits and I’d collapsed them into one.
What I actually added
I didn’t fix this with a cleverer prompt. Better prompts reduce how often it’s wrong. They do nothing about the times it’s wrong anyway and nobody’s looking. The fix was structural: assume it will be wrong, and make wrong cheap to catch and cheap to undo.
Three changes did most of the work.
Confidence gates on writes, not reads. The agent can read anything. But any action that writes to the CRM, sends an external message, or spends money now routes to a daily approval queue. Reading is free. Writing to the real world is gated. That single distinction killed the hallucinated-contact problem.
A behavior audit, weekly, separate from output. Every Friday I read a sample of what the agent decided and why, not just what it produced. Ten minutes. This is the check that catches drift, because drift is invisible in any single output and obvious across ten of them.
A real kill switch. One command halts every agent and every scheduled run. Before, “stop everything” meant me frantically disabling workflows one at a time while it kept acting. Now it’s one line. I’ve used it twice. Both times I was glad it existed.
Comparison
Running the agent before and after guardrails
Before
Agent writes directly to CRM, sends outreach, and adapts workflows unsupervised. Failures are silent and found days later by accident.
After
Writes and sends route through a daily approval queue. Weekly behavior audit catches drift. One command halts everything. Failures are caught in hours, not days.
What I’d tell you I’m still unsure about
I might be over-gating. The approval queue adds friction, and friction is the thing that made autonomy worth building in the first place. There’s a real chance I’ve traded too much speed for safety, and that a more confident operator would gate less. I don’t know where the line is yet. I’m watching whether the queue becomes a rubber stamp I click without reading, because a checkpoint you don’t actually check is theater, not a control.
What I am sure of: the goal was never a fully autonomous co-founder. It was a co-founder I could trust to act without me watching every move, which is a different thing. Trust isn’t the absence of oversight. It’s oversight that’s fast and cheap enough that I’m willing to let it run.
Do this before you let an agent run unwatched
Guardrails to add this week
- Separate read permissions from write permissions: gate every write to the real world
- Route high-stakes actions (CRM, money, external messages) to an approval queue
- Schedule a 10-minute weekly behavior audit, not just an output check
- Build one kill switch that halts every agent and scheduled run at once
- Log every agent action so any mistake is reversible, not forensic
If you only do one, do the kill switch. Everything else lets you catch failures faster. The kill switch lets you stop them. You want both, but you want that one first.
Free · 7-Day Action Plan
Find your highest-impact AI opportunity.
Take the AI Readiness Audit. Get a clear, practical 7-day plan you can run on Monday.
Take the audit →