Methodology · in the clear

Demographically grounded. Inspectable. Replayable.

Two hundred strangers don't represent a country. Two hundred thousand demographically grounded agents do. The methodology is the agreement. The benchmark is the audit.

Step 01 · Source

Find a real published study.

Every workflow begins with a peer-reviewed paper, a government survey, or an industry benchmark report. We extract the verbatim question, the cited population, and the published outcome — those three are the ground truth we will calibrate against. The full list of papers we've indexed is on the benchmark page.

Step 02 · Population

Build the population from real distributions, not guesses.

We start with audited population microdata — age, household composition, income, education, occupation, geography, ethnicity. Each simulated agent is a draw from those distributions with the exact joint frequencies the published study's population had. No synthetic personas, no LLM-imagined demographics. Full provenance of the underlying datasets is available under NDA — see the footer.

Step 03 · Ask the question

Run the published stimulus, unchanged.

The agents see the question the way real respondents saw it. Word-order and framing are preserved verbatim. We do not re-prompt or paraphrase the stimulus to nudge accuracy — every run is judged against the literal published wording.

Step 04 · Calibrate

Three independent runs. Take the worst.

Each workflow is run three times against the same population (K=3). We compute per-KPI MAPE accuracy (mean absolute percent error) for each run and require both the mean and the minimum to clear a 90% floor before the workflow is allowed to publish. A single lucky run never makes it through. Variance is real and the gate is honest about it.

Step 05 · Publish

Wins and misses, same prominence.

Every workflow that clears the gate gets an auto-generated case study with the per-agent transcripts, segment breakdown, top objections, and round-by-round sentiment. Every workflow that misses gets the same treatment on the benchmark page — we publish the miss, the delta, and the methodology version that produced it. The miss disappears from the customer surface only after a new calibration loop closes it.

Accuracy, in three measures.

Every claim about accuracy on this site can be traced back to one of the three rows below. There is no fourth number we are hiding.

Measure	Plain English	Source
MAPE (mean absolute percent error)	Per-KPI distance between predicted and published, expressed as a percent.	Mathematically equivalent to the CLI exporter's _mape_pct.
K-sample gate (K=3)	Mean of three independent runs must clear floor; minimum must clear floor − 2 percentage points.	CLAUDE.md I-4.
Default accuracy floor	90% per published KPI; configurable per workflow when justified.	docs/launch/LAUNCH_READINESS.md.

The dimensions, exactly.

The population layer pulls from audited demographic distributions; the question layer comes from the cited paper. Both are linked from every case study. Full provenance of the underlying datasets is available under NDA.

Dimension	What we use it for
Age × income × geography × household	Joint demographic backbone — the cohorts every workflow weights against.
Education × occupation × industry × wage	Workplace and economic distributions for B2B, professional-services, and labor-market scenarios.
Ethnicity × language × nativity × tenure	Population diversity bands used for region-specific weighting.
Published study population	The paper's own target population (defines the weighting on top of the demographic backbone).

Full data provenance

The underlying datasets, vendor agreements, refresh cadences, and license terms are documented in the data provenance brief. Available to existing or prospective customers under NDA — email security@simulix.com and we'll send the watermarked PDF within one business day.

Known limitations

What this approach is not.

• Simulix is not a substitute for IRB-reviewed primary research. It is a way to ask one careful question a thousand ways before you commission a panel.
• Predictions are only as good as the demographic distribution the workflow targets. If a study's population is not representable in our audited distributions, the workflow is rejected at calibration time.
• LLM behaviour drifts as upstream model versions change. We run K=3 every week against the same gate and re-publish when accuracy moves more than 2 percentage points.
• We do not simulate non-U.S. populations at launch. EU coverage is on the roadmap — see changelog.

The proof

Public benchmark ledger.

Every prediction, every outcome, every miss recorded with the same prominence as every hit.

The receipts

Auto-generated case studies.

One per workflow that clears the gate. Per-agent transcripts, segment breakdowns, round-by-round sentiment.