Merck & Co., Inc., Kenilworth, NJ, USA


Since the Covid-19 pandemic and vaccine rollout, the interest and need for more real-world studies surrounding infectious diseases and vaccines has increased dramatically. The OHDSI community and OMOP common data model support robust observational studies across multiple datasets and institutions. However, the quality issues of vaccine related concepts in the OMOP vocabulary pose a significant barrier to efficient and high-quality studies. Following our previous quality assessment of vaccine vocabularies and concepts related to influenza, pneumococcal disease, and shingles, we expand the evaluation to all vaccine types, but with a focus on the “Maps to” relationships between source concept and standard vaccine concepts.


The OMOP vocabulary defines a set of source_concept_id – standard_concept_id pairs, referred to as “mappings”, in the concept_relationship table. To evaluate all vaccine mappings, we started by identifying all vaccine concepts in the OMOP vocabulary (v5.0 29-OCT-2020) using a three-step rule-based approach. First, we combined iterative regular expressions-based pattern search and manual review to identify a starting set of vaccine concepts. Next, we added additional concepts that were related to the starting set using any OMOP vocabulary defined relationship (e.g., Maps to, Mapped from, RxNorm-CVX, Trade name of, Brand name of). Finally, we added all descendants of any concept identified in the previous two steps. After each step we manually reviewed the vaccine concept list for accuracy and completeness.

Once the comprehensive list of vaccines concepts was complete, we extracted all “Maps to” relationships from the concept_relationship table where both the source and target concept IDs were in our set of vaccine concepts. We counted occurrences of each source_concept_id - standard_concept_id pair in the drug_exposure and procedure_occurrence tables in five large OMOP CDM databases we have access at Merck, namely Truven CCAE, Optum Clinformatics, Humana, Premier and CPRD.

Vaccine mappings that occurred at least once in one of the five databases were manually reviewed for accuracy by a clinical expert. The findings were then communicated to the OHDSI vocabulary team resulting in the correction of incorrect mappings, discussion about vaccine specific mapping issues, and the formation of a new OHDSI vaccine vocabulary workgroup.


We found 15,932 vaccine-related concepts in 32 vocabularies in the OMOP concept table (Table 1). From these concepts we extracted 15,220 “Maps to” relationships and reviewed 1,170 source_concept_id – standard_concept_id pairs with at least one occurrence in any of the five CDM datasets we have access to. The clinical expert on our team identified potential problems with 104 mappings (8.89% of the mappings reviewed), as summarized in Table 2.

Table 1. Vaccine vocabulary usage in 5 real-world databases

Vocabulary Number of vaccine concepts Number and percent of vaccine concepts used in real world datasets
NDC 1,427 1,000 (70.1%)
RxNorm 2,920 420 (14.4%)
Read 257 257 (100%)
SNOMED 474 213 (44.9%)
Gemscript 289 148 (51.2%)
RxNormExtension 8,229 130 (1.6%)
CVX 161 120 (74.5%)
CPT4 136 118 (86.8%)
HCPCS 22 22 (100%)
ICD9Proc 6 6 (100%)
ICD10PCS 2 2 (100%)
AMT 118 0 (0%)
ATC 66 0 (0%)
BDPM 128 0 (0%)
CIEL 61 0 (0%)
CTD 10 0 (0%)
dm+d 356 0 (0%)
DPD 130 0 (0%)
GCN_SEQNO 220 0 (0%)
GGR 171 0 (0%)
HemOnc 3 0 (0%)
ICD9ProcCN 4 0 (0%)
JMDC 26 0 (0%)
KDC 3 0 (0%)
MeSH 32 0 (0%)
Multum 53 0 (0%)
NCCD 8 0 (0%)
NDFRT 30 0 (0%)
Nebraska Lexicon 49 0 (0%)
OPCS4 1 0 (0%)
SPL 261 0 (0%)
VA Product 176 0 (0%)

Table 2: Identified Mapping Issues

Mapping issue category Definition and Example Number and % with mapping issues
Lack of complete mapping A vaccine mapping that did not capture all components or ingredients of the vaccine. Example: [45488921] Third low dose diphtheria, tetanus and inactivated polio vaccination maps to [529411] tetanus and [529303] diphtheria but not polio. 68 (65.4%)
Incorrect mapping A vaccine mapping where the standard concept is not synonymous with the source concept. Example: [21601291] "hemophilusinfluenzae B, purified antigen conjugated; systemic" maps to [515671] "Neisseria meningitidis" 21 (20.1%)
Imprecise mapping A vaccine mapping where important information is either removed or added. Example: [2213439] "Influenza virus vaccine, trivalent (IIV3), split virus, 0.25 mL dosage, for intramuscular use" maps to [40213153] "Influenza, seasonal, injectable" which drops information about dosage, route, and valence. 6 (5.8%)
Questionable mapping A vaccine mapping that is not necessarily incorrect but should be reviewed by the vocabulary team. Example: [2213449] "Rabies vaccine, for intramuscular use" maps to CVX concept [40213209] "rabies vaccine, unspecified formulation". CVX concept [40213208] "rabies, intramuscular injection" is a standard concept that would be a better fit but has been retired by CDC. 9 (8.7%)


Retrieving all vaccine-related concepts and evaluating the “Maps to” relationships provides fundamental understanding of how the vocabulary quality impacts health outcomes research and paves the way to build a better ontology for vaccine-related concepts. We found that only 15% of vaccine concepts and 12% of all vaccine “maps to” relationships we identified in the OMOP vocabulary occurred in at least one of five commonly used real-world databases which suggests an opportunity for prioritization of future vaccine vocabulary quality improvement work. Further discussion of the mapping issues in the OHDSI vaccine vocabulary workgroup has led to concrete improvements in how vaccine concepts are represented in the OMOP vocabulary.


The complete list of all vaccine concepts and all vaccine mappings we identified can be downloaded as csv files using the links below.