Step 01 · Source
Find a real published study.
Every workflow begins with a peer-reviewed paper, a government survey, or an industry benchmark report. We extract the verbatim question, the cited population, and the published outcome — those three are the ground truth we will calibrate against. The full list of papers we've indexed is on the benchmark page.
Step 02 · Population
Build the population from real distributions, not guesses.
We start with audited population microdata — age, household composition, income, education, occupation, geography, ethnicity. Each simulated agent is a draw from those distributions with the exact joint frequencies the published study's population had. No synthetic personas, no LLM-imagined demographics. Full provenance of the underlying datasets is available under NDA — see the footer.
Step 03 · Ask the question
Run the published stimulus, unchanged.
The agents see the question the way real respondents saw it. Word-order and framing are preserved verbatim. We do not re-prompt or paraphrase the stimulus to nudge accuracy — every run is judged against the literal published wording.
Step 04 · Calibrate
Three independent runs. Take the worst.
Each workflow is run three times against the same population (K=3). We compute per-KPI MAPE accuracy (mean absolute percent error) for each run and require both the mean and the minimum to clear a 90% floor before the workflow is allowed to publish. A single lucky run never makes it through. Variance is real and the gate is honest about it.
Step 05 · Publish
Wins and misses, same prominence.
Every workflow that clears the gate gets an auto-generated case study with the per-agent transcripts, segment breakdown, top objections, and round-by-round sentiment. Every workflow that misses gets the same treatment on the benchmark page — we publish the miss, the delta, and the methodology version that produced it. The miss disappears from the customer surface only after a new calibration loop closes it.