verification field guide · spoke 02 · free tool

How to set a signed AI baseline.

The number the person who owns the P&L signs before the pilot. Part of the Verification Field Guide — the buyer's-side playbook for not getting burned by industrial AI. Free, and free to use on Monday.

Why unsigned baselines are why pilots fade instead of fail

Most industrial AI pilots don't end in a decision. They end in a shrug. Six months in, the vendor demos something impressive, everyone nods, the champion moves teams, and the thing quietly stops being talked about. Nobody killed it. Nobody scaled it. It just faded.

That is not bad luck. It is the predictable result of running a test you never set. RAND interviewed 65 practitioners and put the failure rate of AI projects above 80% — and their number-one cause wasn't the models, the data, or the compute. It was "misunderstandings and miscommunications about the intent and purpose of the project." (rand.org, RRA2680-1.) MIT's NANDA project found that roughly 95% of organizations can show no measured P&L impact from their generative-AI spend. (primary PDF.) Read those two findings together and the mechanism is obvious: almost nobody wrote down what "worked" would mean, so almost nobody could tell whether it did.

You cannot pass a test you never set. A pilot without a baseline has no failure state — and a test you can't fail is a test you can't pass either. There is no line the after-number has to clear, so there is nothing to clear it against. The vendor says it's working; you have no independent number that says otherwise; the initiative drifts until attention moves on. The baseline is the line. Without it, "did the pilot work?" has no answer, only opinions — and the loudest opinion in the room usually belongs to the party being paid.

A signed baseline is the cheapest control in the entire AI-buying process. It costs you a few hours and one uncomfortable conversation, and in exchange it converts a fade into a verdict: the pilot either beat the number or it didn't, and you knew the number before you knew the answer.

What a signed baseline actually is

A signed baseline is four things, and it isn't a baseline if any one of them is missing:

The current metric — the one number the AI is supposed to move, defined exactly enough that two people measuring it would get the same figure. Not a target, not an aspiration, not the vendor's benchmark. The current value.
Measured on your data — pulled from your own system of record or sampled from your own operation, over a stated window. Never the vendor's demo numbers, never an industry average, never a figure someone remembers from a meeting.
With a stated method — how the number was produced, and whether it's [counted] (from a system of record) or [estimated] (modeled or sampled). A baseline whose method you can't state is a rumor with a decimal point.
Signed by the P&L owner, and dated — the person who owns the profit-and-loss the metric touches puts their name to it as fair, before any target or threshold is negotiated, with a date on it.

Miss the first three and you have a guess. Miss the fourth and you have a number the vendor can renegotiate the moment their results come in short. The rest of this guide is how to produce all four.

The method — measuring the number honestly

Setting a baseline is a discipline, not a spreadsheet. The rules below are the same ones the Problem-Definition Audit uses to produce a sponsor-signed baseline on a live engagement; they work exactly as well on your own, run internally.

1. Name the one metric — exactly.

Write the metric as a sentence so precise that it has a single unambiguous value. "Improve quality" is not a metric. "The false-positive rate on Line 3's automated visual inspection — the percentage of parts the current system rejects that a re-inspection confirms were actually good — over the trailing quarter" is a metric. If you can't write it that tightly, you don't yet understand what the pilot is for, and that is a finding worth having before you spend a dollar.

2. Measure it, best source first.

Three methods, in strict order of trust. Reach for the weaker one only when the stronger one genuinely isn't available:

System-of-record pull (best). The number is already logged somewhere — the QA log, the MES, the CMMS, the ERP, the historian. Pull it. No new instrumentation, least bias. Most "we can't measure this" claims dissolve once someone actually queries the system that's been recording it all along.
Time-and-motion sample (good). For a task nothing logs — minutes per manual review, re-work touches per shift — run a 2–3 week stopwatch sample of representative people doing the real work. Capture the median and the spread, not just an average; the outliers are usually where the money is.
Structured self-report (weakest — label it). Ask people how long something takes or how often it recurs. Use it only to triangulate the other two, never as the sole source behind a dollar claim. Always tag it [estimated].

3. Tag every figure [counted] or [estimated].

Beside each number, mark whether it came from a system of record ([counted]) or from a model or sample ([estimated]). This one habit does more work than any other: counted numbers carry the headline and the pass/fail line; estimated numbers can support the story but never lead it. It also stops the most common baseline lie — dressing a guess up as a fact by omitting where it came from.

4. State the window and the confounders.

Write down the exact period the baseline covers and anything happening in that period that could move the number for reasons that have nothing to do with AI: a new hire, a product-mix shift, a seasonal spike, a line that was down for a week. You are not trying to eliminate confounders — you're putting them on the record so nobody can later credit AI for something else, or blame it for a bad quarter it didn't cause.

5. Get it signed — before any target is set.

The P&L owner reviews the number and its method and signs it as fair. Crucially, this happens before anyone negotiates what "good" looks like. If you set the target first and the baseline second, you'll unconsciously pick a baseline that makes the target easy. Sign the honest current-state number while nobody yet knows how the pilot will do — that sequencing is the whole point.

The one rule that catches the most trouble: if a metric cannot be baselined cleanly, it does not enter the pilot's success test. A number you can't measure now, you can't prove was moved later — so scoring a pilot against it means the pilot can be declared a success no matter what it actually did. When a metric won't baseline, the honest move is to say so in writing and either instrument it first or pick one you can measure.

A worked example (synthetic)

Everything below is a clean-room example built for instruction — a fictional plant, invented numbers. A real baseline uses only your own records; substitute yours.

The initiative. A vendor has pitched a mid-market food manufacturer an AI vision system to replace the rules-based camera on Line 3's finished-goods inspection. The pitch: "cut false rejects and save the scrap." The COO likes it. Before signing anything, the plant manager sets a baseline on the one number the system is supposed to move.

The metric, named exactly. The false-positive rate on Line 3's automated visual inspection — the percentage of units the current system rejects that a manual re-inspection confirms were good product — measured over the trailing quarter, per unit inspected.

The measurement. The QA log already records every automated reject and every re-inspection disposition (the plant re-inspects all rejects before scrapping, because scrap is expensive). That's a system-of-record pull — [counted]. Over the quarter: 41,900 units inspected, 2,320 rejected by the camera, and 1,404 of those confirmed good on re-inspection. False-positive rate = 1,404 ÷ 41,900 = 3.35% [counted]. The team also estimates the scrap-and-re-inspection labor cost of those false rejects at roughly $58,000 for the quarter — but that figure leans on a modeled labor rate, so it's tagged [estimated] and kept out of the pass/fail line.

The window and confounders. Window: Q2, the full quarter. Noted confounders: a supplier change in May shifted the incoming-material color slightly, which plausibly inflated false rejects for ~three weeks; and Line 3 ran a new SKU for the last month. Both are on the record so the after-number can be read fairly.

The signature. The plant manager — who owns Line 3's scrap budget and answers for its yield — signs the baseline as fair and dates it, before the vendor is told what threshold the pilot has to beat. Only after that signature does the team set the target: the pilot must get false positives to ≤ 1.5% [counted] on a frozen sample drawn from the same line, with no increase in missed defects — a second baseline captured the same way, so the vendor can't buy a lower false-positive rate by quietly waving bad product through.

Notice what the signed baseline just bought: the pilot now has a line it must clear (3.35% → 1.5%), the after-number is measured the same counted way as the before, the confounders are logged, and the person accountable for the P&L put their name to the starting point before anyone knew the ending point. When the pilot reports back, "did it work?" has a real answer instead of a demo.

The template — a baseline worksheet you can fill in today

Copy this, fill the brackets on your own metric, and route it to the P&L owner for signature before the pilot starts. One worksheet per metric. If a pilot is supposed to move two numbers, do two.

SIGNED AI BASELINE WORKSHEET
Initiative: [name of the proposed AI pilot]
Prepared by: [your name / role]          Date prepared: [YYYY-MM-DD]

1. THE METRIC (exact definition)
   The one number this pilot is supposed to move, written so
   two people measuring it would get the same figure:
   [e.g. "false-positive rate on Line 3 automated inspection —
    % of units the camera rejects that re-inspection confirms
    were good — per unit inspected"]

   Direction that counts as improvement:  [lower / higher]

2. CURRENT VALUE (the baseline)
   Value:            [e.g. 3.35%]
   Counted or estimated?   [ [counted] / [estimated] ]
   (If [estimated], it may report but must NOT gate pass/fail.)

3. MEASUREMENT METHOD
   How this value was produced:
   [ ] System-of-record pull   [ ] Time-and-motion sample   [ ] Self-report (triangulate only)
   Exact calculation:  [numerator ÷ denominator, e.g.
                        1,404 confirmed-good rejects ÷ 41,900 units]
   Median / spread (if sampled):  [median ____ , p75 ____ ]

4. DATA SOURCE
   Where the numbers came from:  [system name + report/query,
                                  e.g. "QA log, reject-disposition export"]
   Who can re-run this pull:      [name / role]

5. SAMPLE WINDOW & CONFOUNDERS
   Window measured:   [start date] to [end date]
   Known confounders in this window (things that could move the
   number for non-AI reasons — record them, don't remove them):
   - [e.g. supplier color change in May, ~3 weeks]
   - [e.g. new SKU on the line for the last month]

6. THE TARGET (set AFTER this baseline is signed)
   Threshold the pilot must beat:  [e.g. "≤ 1.5% [counted]"]
   Guardrail metric (so the fix can't create a worse problem):
   [e.g. "no increase in missed-defect rate, baselined the same way"]

7. SPONSOR SIGN-OFF  (the control — see below)
   "This baseline is a fair statement of current performance."
   P&L owner (owns the profit/loss this metric touches):
   Name / role:  [__________________________]
   Signature:    [__________________________]
   Date signed:  [YYYY-MM-DD]   ← must predate the pilot start

The discipline: it's the signature, not the number, that's the control

Here is the part most teams get wrong. They treat the baseline as a measurement exercise — get the number right, move on. But a number sitting in a spreadsheet controls nothing. Anyone can revise it, forget it, or explain it away once the pilot's results are in. The control isn't the number. It's the signature.

The signature does three things a bare number cannot:

It fixes the number in time. A dated signature is a record that this was the agreed starting point before anyone knew the ending point. That sequence — baseline first, result second — is what makes the after-number unarguable. Without it, the baseline is always retroactively negotiable, and it will get negotiated the moment it's inconvenient.
It puts the number on the right desk. The signature has to come from the person who owns the P&L the metric touches — not the pilot's champion, and never the vendor. The champion wants the pilot to succeed and will drift toward a flattering baseline; the vendor wants an easy bar and will drift toward a low one. Only the P&L owner has skin in the number being true, because they're the one who inherits whatever it proves. Whose signature you require is itself a control: it forces the number onto the one desk with a reason to keep it honest.
It converts a private assumption into a shared commitment. Before signature, the baseline is one person's opinion. After, it's an agreement on the record that everyone — you, the sponsor, the vendor — is measured against. Nobody gets to quietly swap the yardstick later.

This is the same discipline audited industries have used for a century: no structure gets built from a drawing nobody checked and signed. A signature is a person staking their name on "this is right." Apply that to the one number your AI spend is supposed to move, and you've installed the cheapest, highest-leverage control in the whole decision. The Industrial AI Scoreboard scores whether your organization does this by habit; this worksheet is how you start.

FAQ

What is an AI baseline metric?

The current, measured value of the one number the AI is supposed to move — measured on your own data, with a stated method, before the pilot starts. Not a target, not the vendor's benchmark, not an industry average. The false-positive rate on your inspection line last quarter is a baseline; "we want to cut false positives" is a wish.

How do you baseline an AI project?

Name the one metric the pilot is supposed to move, defined exactly. Pull its current value from a system of record where you can; take a time-and-motion sample where you can't; use self-report only to triangulate. Tag every figure [counted] or [estimated]. Note the sample window and known confounders. Then have the person who owns the P&L it touches sign it as fair, and date it — before any target or threshold is negotiated.

Why does the baseline have to be signed before the pilot?

Because a baseline set after you've seen the pilot's results isn't a baseline — it's a number chosen to make the result look good. Signing it first, before anyone knows how the pilot performs, is what makes the after-number unarguable. The signature converts a private assumption into a shared commitment nobody can quietly revise once the numbers are in.

Who should sign the AI baseline?

The person who owns the P&L the metric touches — the plant manager, the ops VP, the CFO for their line — not the person running the pilot and not the vendor. The pilot's owner is incentivized to pick a flattering baseline; the vendor is incentivized to pick a low one. Only the P&L owner has a reason to want the number to be true, because they are the one who has to live with what it proves.

What if the metric can't be measured cleanly?

That is itself the finding. A metric you cannot baseline cleanly should not enter the pilot's success test — you would be judging the after against a guess. If the number lives nowhere in a system of record and can't be sampled, the honest move is to say so in writing and either instrument it first or pick a metric you can actually measure. A pilot scored against an un-measurable baseline can be declared a success no matter what it does.

Where this fits

This is spoke 02 of the Verification Field Guide, Tektari's buyer's-side playbook for industrial AI. A signed baseline is one control; it works best alongside the others — a written acceptance test your team can run without the vendor, and the whole-organization read the Industrial AI Scoreboard gives you in twenty questions. If you want the baseline set for you on one real initiative — sponsor-signed, on your data, with a written acceptance test attached — that's the Problem-Definition Audit: three weeks, $12,500 fixed, and $2,500 of the fee is only invoiced when the scoped pilot passes the test your team runs. We sell no implementation, so the number we hand you isn't a pitch for the next thing.

See where you rank — take the Scoreboard

twenty questions · free · no sales call

See which band you're in.

The Field Guide is free and stays free. If you want your own organization scored against the four disciplines that decide whether industrial AI survives — problem definition, data plumbing, vendor exposure, workforce readiness — start with the Scoreboard.

Take the Scoreboard

That doesn't look like an email — try name@company.com

✓ done — the Scoreboard link is one click away.

No spam. The link, then the quarterly index. Unsubscribe anytime.