Methods Benchmark Can Aid Trust In Observational Research, Per Recent OHDSI Study

Martijn Schuemie

The prevalence of electronic healthcare data allows researchers the opportunity to study the effects of medical treatments. However, confidence in the results of such observational research is typically low, for example, because different studies on the same question often produce conflicting results, even when using the same data. We need to answer the question “to what extent can we trust observational research?”

Led by Martijn Schuemie, OHDSI researchers recently published “How Confident Are We About Observational Findings in Healthcare: A Benchmark Study” in the Harvard Data Science Review to tackle this important issue. This paper presents the OHDSI Methods Benchmark to evaluate five methods commonly used in observational research (new-user cohort, self-controlled cohort, case-control, case-crossover, and self-controlled case series designs) over a network of four large databases standardized to the OMOP Common Data Model.

Using both negative and positive controls (questions where the answer is known), a set of metrics and open-source software tools developed within the OHDSI community, the research team determined that most commonly used approaches to effect-estimation observational studies are falling short of expected confidence levels. Selection bias, confounding, and misspecification are among the sources of systematic error that plagues the validity of potentially important findings within the healthcare community.

How can we trust in observational findings moving forward? One solution is a technique developed within OHDSI called ‘empirical calibration’ (, which adjusts the results and the confidence we can have in the results based on what was observed for a set of control questions.

“Our results show that simply assuming that an observational study design will produce the right answer is little more than wishful thinking,” Schuemie says. “For every study, we need to measure the potential for bias through the use of controls, and calibrate our estimates accordingly.”

Through the Benchmark, the researchers were able to show that using empirical calibration it is possible to distinguish between study designs that merely produce noise and those that are informative. Particularly self-controlled designs such as the self-controlled case series performed best in many scenarios, although there is no silver bullet.

The methods evaluated in this paper are part of the OHDSI Methods Library (, a set of open-source R packages that is available for all data standardized in the OMOP Common Data Model. The OHDSI community believes in the values of open science and transparency, and all results are publicly available in its GitHub repositories.

Additional authors are Soledad Cepeda, Marc A. Suchard, Jianxiao Yang, Yuxi Tian, Alejandro Schuler, Patrick B. Ryan, David Madigan, and George Hripcsak.