StrategyBy Chelsea HulinJune 13, 20267 min read

The Handoff Is the Hard Part, Not the Prompt

The model usually works. The places where humans hand work to it, and it hands work back, are where your AI workflow actually breaks.

A client called me last month furious that “the AI is making things up.” We dug in. The model was fine. What was broken was the thing nobody photographs for the demo reel: the moment a human dumps a half-finished thought into the system, and the moment the system hands its answer back to a person who has no idea what to do with it.

That is the handoff. And after building AI workflows for clinics, contractors, and professional-services shops for two years, I’ll say it plainly: most failed AI workflows are not a model problem. They are a handoff problem. Fix the handoffs and you fix roughly four out of five “bad output” complaints without touching the prompt.

TL;DR

The model is rarely the failure point. The two handoffs around it are: messy human input going in, and AI output landing on a human who can’t act on it.
I’ve seen structured intake alone fix about 80% of “the AI is wrong” complaints, because most of those were “the AI got garbage and guessed.”
A handoff breaks in one of three places: the intake (input), the seam (format mismatch between steps), or the landing (output a human can’t use).
The cheapest fix is almost never a better model. It’s a required-fields form and a defined output shape.
Action this week: pick one live workflow, time one full run, and write down exactly what crosses each handoff line.

The model is the part that works

Here’s the reframe that changed how I build. When an operator says “the AI keeps getting it wrong,” I no longer open the prompt first. I trace the data.

Nine times out of ten the prompt is reasonable. What’s feeding it is a free-text field where a front-desk person typed “new pt, knee thing, prob needs referral” and expected a clean clinical summary on the other end. The model did its job. It took vague input and produced confident output, because that’s what these systems do. Garbage in does not produce an error. It produces garbage out, stated beautifully.

The model didn’t hallucinate. It answered the question it was actually asked, which was a worse question than anyone intended.

A workflow is a relay race, not a sprint. The model runs one leg. The dropped baton is almost always at an exchange, not mid-stride.

Where the handoff actually breaks

There are three exchange points, and each fails in its own way.

Where workflows actually break

1Human inputFree-text, assumptions, missing fields

2Intake handoffBreak 1: garbage in

3AI stepUsually fine

4Output handoffBreak 2: no defined shape

5Next humanBreak 3: can’t act on it

Break 1: the intake. A person gives the AI less than they think they’re giving it. They carry context in their head (which patient, which job site, last week’s call) and never type it. The model fills the gaps with plausible guesses. This is the single biggest source of “wrong” output I see.

Break 2: the seam between steps. One AI step hands to the next, or to an automation, in a format the next step didn’t expect. The summary step writes a paragraph; the CRM step needed five named fields. The data is technically correct and operationally useless.

Break 3: the landing. The output is genuinely good, and the human receiving it still can’t move. It’s a wall of prose when they needed a yes/no and a next action. So they ignore it, redo the work by hand, and conclude the AI “doesn’t really save time.”

Structuring the intake fixes most of it

Here’s the lived case. A three-location dental group came to me convinced their insurance-verification assistant was unreliable. Their words: “It’s wrong maybe a third of the time.” That’s a damning number for anything touching billing.

I didn’t change the model and I didn’t rewrite the prompt. I changed the front door. The intake had been a single free-text box. We replaced it with a short required-field form: patient name, DOB, carrier, member ID, procedure code. Five fields. If a field was blank, the workflow stopped and asked, instead of proceeding and guessing.

Comparison

Same model, same prompt, different front door

Before

Free-text box: ‘verify Johnson, cleaning, BCBS I think.’ AI guesses the rest. About a third of results flagged wrong.

After

Five required fields, blanks block submission. AI works from clean inputs. Flagged-wrong rate dropped to under 6% over the next month.

The accuracy problem was never the model. It was that a human, in a hurry, was handing it incomplete information, and the system was too polite to say “I’m missing the member ID.” Once the intake refused to pass garbage across the line, the so-called accuracy problem mostly evaporated.

~80%

of “bad AI output” I trace back to the intake handoff, not the model

That 80% is my own number from my own builds, not a study. Your mileage will vary by how messy your inputs are today. But the direction is consistent: the worse your intake discipline, the more the model gets blamed for problems it didn’t create.

What didn’t work

I’ll be honest about the version of this I got wrong first. My early instinct was to over-engineer the intake: twelve fields, conditional logic, validation on everything. The front-desk team hated it, filled it in carelessly to get past it, and we were right back to garbage in, now with extra steps.

The fix that stuck was the minimum set of fields the AI genuinely cannot do its job without, and nothing more. Make the form longer than the task needs and people defeat it. Required intake is a discipline, not a fortress.

The output handoff also resists a pure-prose fix. Telling the model “be concise” doesn’t reliably make output actionable. Defining the exact output shape does: a one-line verdict, the three data points behind it, and the recommended next action, in that order, every time. When the landing format is fixed, the receiving human acts on muscle memory instead of re-reading.

The map is the work

This isn’t a prompt-engineering problem, and that’s good news, because prompt-tweaking is a casino. Defining handoffs is engineering you can actually finish. Andrew Ng has made a version of this point about agentic workflows: reliability comes from the structure around the model, not from squeezing a better answer out of a single call (see his DeepLearning.AI “The Batch” writing on agentic design). The seams are the system.

So before you swap models, raise your spend, or rewrite a prompt for the fifth time, map the relay. Find the exchange points. Decide what’s allowed to cross each line.

Do this with one workflow this week

Pick one live AI workflow that’s been disappointing you
Time one full run, start to finish, and note where a human touches it
At the intake: list the fields the AI truly needs, then make them required and block on blanks
At the seam: write the exact output shape each step must produce for the next step
At the landing: rewrite the final output as verdict, then evidence, then next action
Re-run it with clean inputs before you blame the model

I could be wrong about the exact 80%. It might be 70% for your shop, or 90%. What I’m confident about is the order of operations: fix the handoffs first, judge the model second. In two years I have not once found the reverse to be the cheaper path.

Free · 7-Day Action Plan

Find your highest-impact AI opportunity.

Take the AI Readiness Audit. Get a clear, practical 7-day plan you can run on Monday.

Take the audit →