Beyond Replication - AIdeSolutions

Executive Summary

When organizations evaluate AI-enabled health economic modeling tools, the question most frequently posed is whether the AI can replicate a published model. The appeal is understandable: replication points to a known output, invites a straightforward comparison, and produces a clean verdict. This white paper argues that the replication standard is the wrong benchmark, that it conflates two fundamentally different competencies, and that procurement decisions built around it risk systematically selecting for the wrong capabilities.

The replication standard draws its apparent rigor from double-programming practices originally in clinical trial analysis, where independent implementations of the same specification are compared to detect coding errors. Applied to AI-generated health economic models, the analogy breaks down. When GPT-4 was provided with detailed methodological descriptions and parameter values from published oncology models, it reproduced the published incremental cost-effectiveness ratios to within 1% across a majority of runs. That result is technically impressive, but it demonstrates something narrow: the ability to implement a pre-specified model accurately when given its full specification. The analytical task that health economic modelers actually perform, constructing a defensible model from a decision problem and a clinical evidence base, is a different challenge entirely. The replication question conflates these two competencies, treating one as evidence of the other.

The case against replication as a gold standard deepens when the status of published models is examined directly. Every cost-effectiveness model embeds structural choices, including health state definitions, cycle lengths, time horizons, extrapolation methods, and comparator specifications, that are not determined by evidence alone. Bojke et al. demonstrated that cost-effectiveness conclusions are highly sensitive to these choices and that legitimate, experienced modelers will resolve them differently when presented with the same clinical evidence base. Fewer than 16% of health technology assessment reports explicitly address structural uncertainty, meaning the published record systematically understates the range of defensible answers any given decision problem admits. Parameter derivation choices, including utility source selection and cost inflation methods, are frequently underdocumented, and software architecture introduces further variability that is almost never reported. Published models are therefore one coordinate in a space of defensible models, not a ground truth against which AI outputs can be measured.

The appropriate evaluative standard follows from this analysis. What should be assessed is whether the AI-generated model is structurally defensible, not whether it is numerically identical to a prior implementation. This means interrogating whether the conceptual model is appropriate for the decision problem, whether input data were sourced and applied with documented justification, whether structural choices are transparent and contestable, and whether uncertainty is characterized honestly. The AdViSHE validation framework, developed by 47 international health economics experts, evaluates models across precisely these dimensions. Reproducing a published ICER addresses, at best, a narrow slice of that framework.

For HEOR, Medical Affairs, and Market Access leaders charged with evaluating AI modeling tools, the practical implication is direct. The evaluation question should shift from "Does this AI produce our number?" to "Does this AI build defensible models?" Those are separable questions, and only the latter predicts whether the tool will deliver durable analytical value across the decision problems an organization will actually face.

Key Takeaways

Flawed Benchmark: The replication standard conflates the ability to implement a pre-specified mathematical model with the analytical task of constructing a defensible model from a clinical evidence base.
Structural Uncertainty: Published models are not ground truths; they embed countless structural and parameter choices that legitimate modelers resolve differently. Relying on them as gold standards ignores this accepted variation.
Integrated Evaluation Framework: Evaluation should focus on structural defensibility—transparency, source justification, and characterization of uncertainty—aligning with established guidelines like the AdViSHE validation framework.
Changing the Question: Procurement and evaluation questions should shift from "Does this AI produce our number?" to "Does this AI build defensible models?"

Download the Full Paper

Get the PDF version for offline reading and sharing with your team.

↓ Download PDF Share on LinkedIn

Developed by Aide Solutions LLC.