How to write an AI acceptance test.
The acceptance test is the one page that decides whether "delivered" means what you think it means — or what the vendor needs it to mean to get paid. This is the buyer's-side version: a threshold table you fill in, a measurement procedure your own team runs on your own data, a sampling rule the vendor can't cherry-pick, a binary pass/fail, and the contract clause that makes it stick. Two fill-in-the-blank templates below — one for a back-office workflow, one for predictive maintenance.
the test nobody wrote is the failure
More than 80% of AI projects fail — roughly twice the rate of IT projects that don't involve AI — and the number-one documented cause isn't the model. It's misunderstanding what the project was for: RAND interviewed 65 practitioners and put the top failure at "misunderstandings and miscommunications about the intent and purpose of the project." (rand.org, RRA2680-1.) The dominant outcome, when a project does ship, is a pilot with no number to show for it: MIT's NANDA project found roughly 95% of organizations show no measured P&L impact from their generative-AI spend. (primary PDF — preliminary and contested, so read it as "almost nobody can show the money," not "95% fail.")
"No measured impact" is not a technology problem. It's a measurement problem — and it starts before the pilot, at the acceptance test that nobody wrote. Here is the mechanism, in three moves:
- 1 The vendor defines "delivered." The word "accepted" appears in the SOW with no test behind it, so it means whatever the vendor's status update says it means. The demo looked great; the invoice is due; nobody wrote down the line the system actually had to cross.
- 2 The vendor grades their own homework. Performance gets "measured" on data the vendor prepared, on items they saw during setup, scored by the people who built it. "95% accurate" is true — of that data, that day.
- 3 The vendor gets paid either way. Because "delivered" was never a number your team could compute, there is no clean fail state. The pilot doesn't get killed; it fades. You pay, you learn little, and the failure never shows up as a failure — it shows up in the 95%.
The acceptance test is the counter-move to all three. It takes the pen back: you define "delivered," as a number; your team measures it, on your data; and the pass/fail rule is arithmetic, not opinion. You do not need to understand the model to write it. You need to know your own workflow, your own baseline, and what "worth paying for" is in hours or dollars. This guide is the how.
This is spoke 03 of the Verification Field Guide — the buyer's-side playbook for not getting burned by industrial AI. Sibling tools: the free Industrial AI Scoreboard tells you which discipline you're weakest in; the Problem-Definition Audit is where Tektari writes this exact test for you, with part of our own fee riding on it.
the anatomy of an acceptance test
Five parts. Each one closes a specific way the test can be gamed. Skip any of them and the vendor writes the missing part for you — in their favour, when you're not looking.
What "good enough" is, as numbers.
Each metric gets a worst tolerable and a target, both set against a baseline the sponsor has signed as fair. The worst-tolerable column is the line below which you walk; the target is what you're actually paying for. Only [counted] metrics gate the decision — an [estimated] number can be reported, never used to pass or fail.
Measured on YOUR data, by YOUR team.
Before the pilot runs, your own people adjudicate the correct answer for every item in the sample — the golden set — by someone who isn't the pilot's operator. The system runs in its normal configuration, no hand-tuning per item. The test is executable from the document alone, with the vendor not in the room.
So it can't be cherry-picked.
Items are drawn from your live system of record — never demo data, never anything the vendor saw during setup or tuning. A minimum size, random within the window, stratified to match your real mix, drawn and frozen before the system touches it. The sample decides the score; the vendor doesn't get to choose it after seeing results.
Binary. No "substantially met."
Pass = every primary threshold met on the sample. Fail = any single primary threshold missed. No weighting, no averaging — averaging is exactly how a system launders one critical miss behind strong numbers elsewhere. And the dispute path: re-adjudicate against the frozen golden set; ties go to the buyer.
Every test needs a written answer to "what if we disagree on the result," or the whole thing collapses into an argument the day the numbers land. Keep it mechanical: a scoring dispute (did we score these items right?) is re-adjudicated against the frozen golden set by your named adjudicator, whose call stands — the golden set was frozen precisely so this is arithmetic, not debate. A procedure dispute (was the sample or run done wrong?) gets exactly one re-run on a fresh sample, same thresholds; that result is final. Unresolved after one re-run: the vendor doesn't get the pass. Ties go to the buyer — an acceptance test is your instrument, not a collections mechanism for the vendor.
template · class 1 — a document / back-office workflow
Use this for document triage and routing, quote or order intake and entry, reporting rollups, invoice/PO matching — text-and-records workflows, the class the failure literature shows carries the best measured ROI. Fill in every [bracketed blank]. Copy it into your SOW as an exhibit.
Threshold table
ACCEPTANCE TEST — CLASS 1: [workflow name, e.g. supplier-invoice matching]
Workflow (exact): [one sentence: receipt of X -> completed output Y]
Baseline signed by: [sponsor name] on [date] ("this baseline is fair")
RULES (frozen before pilot starts):
- Every gating threshold gates on a [counted] metric from a system of
record or a countable test sample. An [estimated] metric may appear as
a secondary, reported line — it can never gate the decision.
- Every target must beat the baseline by a margin the sponsor agrees is
worth the pilot's cost.
- Thresholds are frozen in writing before the pilot processes any item.
# | Metric | Definition (exact) | Baseline [counted] | Worst tolerable | Target | Gating?
----+-----------------------+---------------------------------------------+---------------------+---------------------+---------------------+---------
T1 | Correct-handling rate | % of sample items handled correctly vs. the | [___]% (human, from | >= [___]% | >= [___]% | PRIMARY
| | adjudicated golden-set answer | [QA log/source]) | | | gates
T2 | Exception rate | % of items the system flags for a human, | [___]% or n/a | <= [___]% AND 100% | <= [___]% | PRIMARY
| | AND % of flagged items that reach a queue | | of flags reach a | | gates
| | | | human queue | |
T3 | Cycle time per item | Median elapsed time, receipt -> completed | [___] min | <= [___] min median | <= [___] min median | PRIMARY
| | output, on the measurement sample | | | | gates
T4 | Critical-error rate | % of sample with a defined CRITICAL error | [___]% | <= baseline, hard | <= [___]% | PRIMARY
| | (wrong customer / wrong amount / wrong | | cap [___]% | | gates
| | routing class — enumerate: [___]) | | | |
T5 | Hours-saved projection| Modeled from T3 x volume, with attribution | [estimated] | reported | reported | secondary
| | and adoption discounts applied | | | | never gates
PASS = every PRIMARY threshold meets at least its "worst tolerable" value on
the measurement sample.
FAIL = any single PRIMARY threshold misses its "worst tolerable" value.
Binary. No weighting, no averaging, no "substantially met."How to read the two columns. Worst tolerable is the line the system must clear to be accepted at all — below it, you walk, and no other number saves the pilot. Target is what "worth the money" looks like and what you negotiate the price and the renewal against. Keep them separate on purpose: a vendor who quietly aims for "worst tolerable" and calls it done is telling you something, and the two-column table makes that visible instead of letting a single "≥ 95%" hide the difference.
Measurement-on-your-data procedure
PROCEDURE — CLASS 1 1. WHO RUNS IT The team named in the pilot definition: [names/roles]. The vendor is not present and touches no data during the test. The test must be executable from this document alone. 2. GOLDEN SET (built before the pilot runs) [Named adjudicator], who is NOT the pilot's operator, records the correct answer for every item in the measurement sample — what the correct triage / entry / rollup IS. The golden set is frozen and dated: [date]. 3. RUN The pilot system processes the frozen sample in its normal operating configuration. No hand-tuning per item. No cherry-picked reprocessing. 4. SCORE Each item scored against the golden set per the T1–T4 definitions. Recorded one row per item on the attached scoring sheet — the raw sheet, not a summary, is the record. 5. REPORT Results computed per threshold, dated, sent to both parties. The pass/fail rule is applied mechanically to the numbers.
Sampling rule
SAMPLING RULE — CLASS 1 (frozen before the pilot processes any item)
SOURCE Items drawn from our live system of record for [workflow].
NEVER vendor-supplied demo data. NEVER items the vendor saw
during setup or tuning.
SIZE Minimum N = 100 items OR four consecutive weeks of actual
volume, whichever is larger.
(If four weeks of volume is under 100 items, this workflow is
too thin for this test — say so; do not shrink the sample.)
SELECTION Random within the window [date range], stratified to match the
baseline mix: [document types / order sizes / customers /
report periods]. Record selection method and seed/date: [___].
FREEZE Sample drawn and frozen BEFORE the system processes it, and held
out of any vendor tuning.template · class 2 — predictive maintenance / condition monitoring
Use this for vibration- and condition-based failure prediction, alarm triage, work-order recommendation from sensor trends, and defer-vs-pull calls on rotating assets — the class where a fluent model most easily launders a real alarm into "monitor later." The key difference from Class 1: the golden set is a frozen event ledger of what the assets actually did, adjudicated against inspection and work-order records — never against the model's own confidence score. Worked figures below are synthetic (a fictional reliability twin) and must never be read as real.
Threshold table
ACCEPTANCE TEST — CLASS 2: [asset population, e.g. line-3 gearboxes + drives]
Current triage practice (the baseline): [calendar PM / tech reads raw trends]
Baseline signed by: [sponsor name] on [date] ("this baseline is fair")
RULES (frozen before pilot starts):
- Every gating threshold gates on a [counted] outcome adjudicated against a
frozen EVENT LEDGER of what the assets actually did (failure, degradation
confirmed by inspection/work-order, or healthy-through-window) — NOT
against the model's own confidence score.
- Every target must beat the current triage practice by a margin the
sponsor agrees is worth the pilot's cost.
- Thresholds frozen in writing before the pilot scores any asset-window.
# | Metric | Definition (exact) | Baseline [counted] | Worst tolerable | Target | Gating?
----+------------------------+---------------------------------------------+---------------------+---------------------+---------------------+---------
T1 | Detection sensitivity | % of adjudicated true degradation/failure | [___]% (current | >= [___]% | >= [___]% | PRIMARY
| (true-positive rate) | events the system flagged BEFORE the | catch rate, from | | | gates
| | lead-time horizon [e.g. ISO 20816 Zone C] | WO + failure recs) | | |
T2 | False-alarm rate | % of the system's "act now / pull" alarms | [___]% | <= [___]% | <= [___]% | PRIMARY
| | that adjudication shows were NOT real | | | | gates
| | (asset ran healthy through the window) | | | |
T3 | Lead time before | Median time, first correct flag -> | [___] days | >= [___] days | >= [___] days | PRIMARY
| functional failure | adjudicated failure/forced-removal, across | | median | median | gates
| | true positives (the usable P-F warning) | | | |
T4 | Critical-miss rate | % of adjudicated TIER-ZERO events (long- | [___] of last [n] | 0 misses, hard cap | 0 misses | PRIMARY
| | lead, no-spare assets — enumerate: [___]) | | [___]% | | gates
| | the system failed to flag in time | | | |
T5 | Downtime / cost- | Modeled from T1/T3 x criticality and | [estimated] | reported | reported | secondary
| avoidance projection | downtime exposure, with attribution and | | | | never gates
| | adoption discounts (NOT the vendor's | | | |
| | headline avoided-cost claim) | | | |
PASS = every PRIMARY threshold meets at least its "worst tolerable" value.
FAIL = any single PRIMARY threshold misses. A system that hits sensitivity
but floods the team with false alarms (T2), or catches routine assets
but misses one tier-zero event (T4), FAILS. Binary.
NOTE ON T4: a defer-vs-pull decision on a tier-zero asset is a HUMAN sign-off,
not a model output. T4 measures only whether the system FLAGGED IN TIME. This
test never certifies the model to make the pull decision itself.Measurement-on-your-data procedure
PROCEDURE — CLASS 2 1. WHO RUNS IT The team named in the pilot definition — typically [reliability engineer / PdM analyst] plus the [CMMS owner]. The vendor is not present and touches no data. Executable from this document alone. 2. GOLDEN SET = THE FROZEN EVENT LEDGER (built before the pilot runs) For every asset-window in the sample, [named reliability engineer], who is NOT the pilot's operator, adjudicates what the asset ACTUALLY DID: true degradation/failure (confirmed by inspection, work-order closure, or an ISO 14224 failure record) or healthy-through-window. Adjudication uses the governing instrumented and physical record (vibration trend, oil analysis, tear-down, closed work order) — NEVER the pilot model's own output. Ledger frozen and dated: [date]. (Prospective run on live assets: seal each predicted state BEFORE the outcome is known, then adjudicate against what actually happened.) 3. RUN The pilot scores the frozen sample in its normal configuration. No hand- tuning per asset. No cherry-picked reprocessing. No threshold changes after the ledger is frozen. 4. SCORE Each asset-window scored against the frozen ledger per T1–T4 (flagged in time / missed / false alarm; lead time computed only on true positives). One row per asset-window on the attached scoring sheet — the raw sheet is the record. 5. REPORT Results computed per threshold, dated, sent to both parties. Pass/fail rule applied mechanically.
Sampling rule
SAMPLING RULE — CLASS 2 (sample + ledger frozen before the pilot scores)
SOURCE Asset-windows drawn from our live condition-monitoring history and
CMMS for [target asset population]. NEVER vendor-supplied demo
signals. NEVER assets or windows the vendor saw during setup/tuning.
SIZE Minimum N = 30 adjudicated events (confirmed degradation/failure
cases) OR the full trailing 12 months of the population's alarm-
and-failure history, whichever is larger.
(Fewer than 30 adjudicable events in 12 months = too event-thin
for a defensible test on this template. Say so; do not score
against a handful of events and call it validated.)
SELECTION ALL adjudicated true events in the window are included (never
sample away failures), PLUS a stratified draw of healthy asset-
windows to set the false-alarm denominator — stratified to the
baseline asset mix: [criticality tier / asset class / duty cycle].
Record selection method and window dates: [___].
FREEZE Sample and event ledger drawn and frozen BEFORE the system scores
them (or, prospective run, each prediction sealed before its
outcome is known), held out of any vendor tuning.The illustrative figures in these tables (catch rates, zone boundaries, day counts) are placeholders and, in Tektari's own teaching material, come from a fictional reliability twin — an invented plant, invented tags, invented dollars — never a real operation. When you fill this in, every number comes from your own asset register, alarm history, and adjudicated events. Nothing here is a claim about any real plant, and nothing you keep should be either.
the clause: "'delivered' means this test passes on our data"
A perfect acceptance test does nothing if the SOW still lets "accepted" mean "the vendor said so." One clause fixes that: it makes the test the only definition of delivered, and ties money to it. Draft business language — have your own counsel adapt and confirm it.
SOW — ACCEPTANCE (attach the acceptance test as Exhibit A) 1. DEFINITION OF DELIVERED. "Delivered" and "Accepted," wherever used in this Statement of Work, mean solely and exactly that the system achieves a PASS on the Acceptance Test set out in Exhibit A, measured on [Buyer]'s data by [Buyer]'s personnel per the procedure, sampling rule, and pass/fail definition stated in that Exhibit. No other event, sign-off, demonstration, status report, or elapsed time constitutes acceptance. 2. WHO MEASURES. [Buyer] runs the Acceptance Test. [Vendor] performs no measurement and is not present during the test. The sample is drawn from [Buyer]'s live system of record after the effective date, and excludes any item [Vendor] accessed during setup, configuration, or tuning. 3. FROZEN BEFORE PILOT. The thresholds, procedure, sampling rule, and pass/ fail definition in Exhibit A are frozen in writing before the pilot processes any item, and are not altered thereafter except by written agreement of both Parties. 4. CONSEQUENCE OF PASS/FAIL. [ milestone payment / renewal / production roll- out ] of $[___] is [ due / triggered ] only upon a PASS. On a FAIL, [Buyer] may, at its sole option, [ require one remediation cycle and re-test / terminate for non-acceptance and owe nothing further / ___ ]. 5. DISPUTE PATH. Scoring disputes are re-adjudicated against the frozen golden set by [Buyer]'s named adjudicator, whose determination stands. Procedure disputes are resolved by one re-run on a fresh sample drawn per Exhibit A, same thresholds; that result is final. If unresolved after one re-run, the result is deemed a FAIL for purposes of Section 4. [ ties go to Buyer ] 6. WHAT [BUYER] KEEPS. On termination for any reason, [Buyer] retains [ the frozen golden set, the scoring records, the baseline sheet, and (___) ] as its property, usable with any other vendor or an internal build.
The three lines that carry the clause. Section 1 kills the vendor's private definition of "delivered." Section 2 stops the vendor grading their own homework — your team measures, on a post-signing sample the vendor never saw. Section 4 removes "paid either way" — money moves on a pass, not on a milestone date. Everything else is plumbing. If you get only those three into the contract, you've closed the three failure moves from section 01. And note section 6: whatever the outcome, you keep the golden set and the baseline — so a walk-away leaves you with a test you can run against the next vendor, not a dependency on this one.
this is exactly what the audit produces — with our own fee riding on it
Everything above is the buyer's-side playbook, given away, because a firm that sells no implementation and takes no vendor money can publish the one thing vendors won't. But writing a good acceptance test for a real workflow is work: measuring an honest baseline, setting thresholds that beat it by a margin worth paying for, freezing a sample your vendor can't game, and getting your sponsor to sign the number before anyone negotiates a threshold. That is the Problem-Definition Audit — three weeks, $12,500 fixed, one AI decision turned into a sponsor-signed baseline and a written acceptance test your team can run without us.
And here is the part that proves we mean it: $2,500 of that fee is at risk against this test. The final tranche is invoiced only when the scoped pilot passes the acceptance test your team runs on your own data — never if it fails, never if the pilot is never run. We put our own money on the same pass/fail line we just taught you to write. A "don't run this pilot" finding is a fee-costing recommendation for us; that is the point.
- $10,000 Invoiced at kickoff, before Week 1.
- $2,500 Invoiced only when the scoped pilot passes its pre-agreed written acceptance test — never invoiced if the pilot never runs, and never if it fails.
frequently asked
What is an AI acceptance test?
An AI acceptance test is the written definition of "delivered" for an AI system: the specific, measurable thresholds the system must hit on your own data, the procedure to measure them, the sampling rule that decides which items get tested, and a binary pass/fail rule — all agreed and frozen in writing before the pilot runs. Without it, the vendor defines "delivered," grades their own homework, and gets paid either way. With it, whether the system passes is a matter of arithmetic your own team can run, not a matter of the vendor's opinion.
Who writes the acceptance test — the buyer or the vendor?
The buyer, or an independent party the buyer controls — never the vendor alone. The whole point is to take the pen back. A vendor-written acceptance test measures what the vendor's system happens to be good at, on data the vendor got to see, scored by the vendor. You do not have to know the technology to write the test; you have to know your own workflow, your own baseline number, and what "worth paying for" means in hours or dollars. The vendor can review and negotiate thresholds, but the buyer holds the pen and the buyer's team runs the test.
How do I test an AI vendor's claims before I sign?
Do not test the demo. Write the acceptance test first, attach it to the SOW as the definition of "delivered," and require that the test runs on a sample drawn from your live system of record after signing — items the vendor never saw during setup or tuning. Set every gating threshold against a baseline number your sponsor has signed as fair. Make the pass/fail rule binary: every primary threshold met is a pass; any single one missed is a fail. That turns "our AI is 95% accurate" from a claim into a number your own people compute on your own data.
What should an AI acceptance test include in a statement of work?
Attach the acceptance test as an exhibit and make the SOW say, in the acceptance clause, that "delivered" and "accepted" mean exactly that the exhibit's test passes on the buyer's data — with no other definition of acceptance, and payment or renewal tied to it. Include the threshold table, the measurement procedure, the sampling rule, the binary pass/fail definition, and a dispute path where ties go to the buyer. Freeze all of it before the pilot starts, and state that the sample is drawn from live records the vendor did not see during setup or tuning.
What is a good pass/fail rule for an AI acceptance test?
Binary and un-averaged: a pass means every primary (gating) threshold is met on the measurement sample; a fail means any single primary threshold is missed. No weighting, no averaging across thresholds, no "substantially met." Averaging lets a system launder a critical miss — a wrong-customer error, or a missed failure on a high-criticality asset — behind strong numbers elsewhere. Keep secondary metrics like projected hours saved as reported-only lines that never gate the decision.
How do I keep the vendor from cherry-picking the test sample?
Write the sampling rule into the test and freeze it. Draw items from your live system of record — never vendor-supplied demo data, and never items the vendor saw during setup or tuning. Set a minimum sample size large enough to mean something. Select at random within the window, stratified to match your real mix, and record the selection method and date. Draw and freeze the sample before the system processes it, and hold it out of any vendor tuning. The sample decides the score; the vendor does not get to choose it after seeing their results.
the rest of the field guide
Five spokes, one hub. The acceptance test is the crown jewel; the others cover the decisions around it.
See where you rank first.
The acceptance test is one of four disciplines the free Industrial AI Scoreboard measures — twenty scored questions, no sales call. See which one is your weakest before you spend a dollar.
Take the ScoreboardThat doesn't look like an email — try name@company.com