1. MRL, Merck & Co., Inc., Kenilworth, NJ 07033, USA
  2. Odysseus Data Services, Cambridge, MA 02142, USA
  3. Corresponding author

Abstract

Vaccination against communicable diseases is one of the pillars of a modern healthcare system and has contributed to longer, healthier lives for people around the world. Given their widespread use and broad efficacy, there is significant interest in conducting vaccine-related health outcomes research. Moreover, vaccine development has unique challenges compared to other therapeutic modalities that can be addressed using observational studies.

In most electronic health records (EHR) and claims databases in the United States, vaccine administration is recorded as a “procedure” but is treated as a “drug” in the Observational Medical Outcomes Partnership Common Data Model (OMOP CDM). Given this difference in coding, studying vaccines using data converted to the OMOP CDM can increase the chance of errors during analysis. In order to ensure reliable and reproducible analyses from observational clinical data, we performed a careful examination and evaluation of vaccine-related concepts in the OMOP vocabularies and their associated mappings from source concepts to standard concepts using three different vaccines.

We identified several issues that could impact the quality of vaccine-related health outcomes study using OMOP CDM. These include (1) vaccine codes are assigned to the procedure, drug, or observation domain, (2) changes in the “standard concept” status of some concepts over time, and (3) common issues in many medical ontologies, such as the lack of hierarchy, one-to-one exact mapping, and clear naming conventions. We believe that it is impactful to document and communicate such findings with the OHDSI community in the hopes of identifying opportunities for future improvement.

Background

Vaccination plays a major part in controlling and eliminating infectious diseases and is considered to be the most cost-effective medical intervention to fight against infectious diseases [1]. Although vaccine efficacy and safety are typically evaluated in randomized controlled trials (RCTs), observational studies are also critical to confirm RCT results, determine vaccine effectiveness, and identify adverse drug events (ADEs). Multiple EHR and claims databases can be used for an observational study, but different datasets often use different coding schemes and data formats. Using the OMOP CDM, researchers are able to transform multiple datasets into a common format with a common code scheme. This enables the efficient utilization of diverse data sources, even across multiple organizations, for identifying robust effectiveness and safety signals from observational studies, including vaccines [2]. For example, Boland et al studied vaccine-related ADEs using a set of OMOP CDM data and identified a statistically significant association between swine flu vaccination and certain ADEs [3]. In this study, we were motivated to thoroughly and systematically review vaccine-related concepts in the OMOP vocabulary and the mapping from source concepts to standard concepts. We chose to study vaccinations against three diseases: influenza virus, pneumococcal disease, and shingles (reactivation of herpes zoster). The influenza vaccine is updated annually, while vaccines against pneumococcal disease and shingles both have long protection periods. Our goal is to evaluate the use of the OMOP CDM for vaccine-related longitudinal outcome studies.

Methods

We identified vaccination concepts from commonly used vocabularies (e.g. NDC, CPT4, HCPCS, RxNorm, ICD9Proc, and ICD10PCS) through multiple rounds of keyword searching and manual inspection. The OMOP vocabulary tables were downloaded from Athena on September 10th, 2020. We found a total of 2092 concepts, including 1825 influenza, 221pneumococcal, and 46 shingles vaccine concepts (Supplemental Table S1). Then we traced how these source concepts were mapped to standard concepts and organized these concepts by domain and hierarchy. To measure the impact of the issues, we also examined the occurrences of these concepts in the Truven MarketScan® Commercial Claims and Encounters (CCAE) and Medicare Supplemental (MDCR) Database (1/1/2011 to 9/30/2019).

Results

Procedure codes are mapped to other domains in OMOP vocabularies

When building a cohort, a user usually starts with filtering an event in one of the domains - drug, procedure, measure, or observation. In raw data, vaccination is often recorded using procedure codes, e.g. CPT4 and HCPCS. Therefore, the user may only use procedure domain concepts to define a vaccination event. However, in OMOP vocabularies, those vaccine procedure codes can be assigned to the procedure, drug, or observation domain (Table 1). This may confuse less-experienced users. Only using the procedure domain will miss most of the vaccination records. Users who noticed "Drugs administered as part of a Procedure, such as chemotherapy or vaccines" in the documentation of the DRUG_EXPOSURE table and only use the drug domain to search vaccine concepts will miss a significant amount of records. The issue can be avoided by using all the three domains to search concepts and define events. Proper documentation and training of the nuance would be helpful for researchers when searching and selecting the most comprehensive and meaningful concepts for their study.

Table 1. OMOP domain of vaccine concepts and their occurrences in Truven data
Vocabulary
Assigned
Domain
No. of
Source Codes
No. of
Occurrences
Percentage of
Occurrences
Influenza vaccine
ATC Drug 2 0 0.0%
CPT4 Drug 29 49,295,355 69.8%
CPT4 Observation 2 66,843 0.1%
CVX Drug 19 0 0.0%
HCPCS Drug 11 11,104,383 15.7%
HCPCS Observation 3 1,476,645 2.1%
ICD10PCS Procedure 2 1,786 0.0%
ICD9Proc Procedure 1 0 0.0%
NDC Drug 624 8,664,479 12.3%
RxNorm Drug 1,132 0 0.0%
Pneumococcal vaccine
ATC Drug 1 0 0.0%
CPT4 Drug 3 13,894,808 76.6%
CPT4 Observation 1 2,548,639 14.1%
CVX Drug 5 0 0.0%
HCPCS Drug 2 1,420,506 7.8%
HCPCS Observation 1 3,911 0.0%
NDC Drug 43 269,080 1.5%
RxNorm Drug 165 0 0.0%
Shingles vaccine
CPT4 Drug 2 1,543,985 51.7%
CVX Drug 2 0 0.0%
NDC Drug 19 1,444,024 48.3%
RxNorm Drug 23 0 0.0%

Inappropriate mapping from non-standard concepts to standard concepts

Some concepts in OMOP vocabularies were selected or created as the 'standard' representation of clinical events. For example, MESH code D001281, CIEL code 148203, SNOMED code 49436004, ICD9CM code 427.31, and Read code G573000 all define “Atrial fibrillation” in the condition domain, but only the SNOMED concept is standard [4]. Those standard concepts serve the primary basis for all standardized analytics and users should use the standard concepts to define events and query the OMOP data. If a user has a specific code in mind (e.g. ICD9CM code 427.31), they need to find the corresponding standard concept (e.g. SNOMED code 49436004) using Athena or Atlas. Also, when converting data to the OMOP CDM format, all source concept codes must be mapped to standard concepts. Ideally, the standard concepts will contain the same information as the source codes and no information is lost during the OMOP conversion. However, we found several instances of inappropriate mappings from non-standard concepts to standard concepts that may cause information loss or a change in meaning (Supplemental Table S2) and summarized them in Table 2. 15 out of 31 influenza vaccine CPT4 codes were mapped to less granular standard concepts, and the majority of the influenza vaccine records in our Truven data were impacted.

Table 2. Summary of the inappropriate mappings from non-standard concepts to standard concepts
Vocabulary
Inappropriate
Mapping
No. of
Source Codes
No. of
Occurrences
Percentage of
Occurrences
Influenza vaccine
CPT4 map to a less granular concept 15 46,448,324 65.8%
HCPCS map to a less granular concept 2 2,332 0.0%
NDC different dosage or volume 1 0 0.0%
NDC map to a less granular concept 24 69 0.0%
NDC wrong mapping 3 0 0.0%
RxNorm map to a less granular concept 38 0 0.0%
RxNorm wrong mapping 2 0 0.0%
Pneumococcal vaccine
NDC different dosage or volume 1 0 0.0%
RxNorm different dosage or volume 6 0 0.0%
RxNorm map to a more granular concept 1 0 0.0%
Shingles vaccine
CPT4 map to a less granular concept 1 318,001 10.6%

Unclear hierarchy of standard concepts

The hierarchical structure of standard and classification concepts helps users navigate the concept relationships and allows them to more easily query and retrieve concepts. Atlas users can define a concept set using a single high-level concept, and then include all its descendant concepts simply by a click, so that they don't have to manually search and filter hundreds or even thousands of concepts. For example, by searching "influenza vaccine" in Atlas with Truven data, we found ATC code J07BB is the top concept sorted by descendant record count (DRC). A user may define a concept set using this concept and expect its descendants to include all concepts related to influenza vaccine. However, the hierarchy is not as complete as a user might expect, so simply using the high-level concept could miss many concepts that should be but are not the descendants of that high-level concept.

First, we took all the standard concepts in Table S1 for each vaccine type and checked if they shared a common ancestor that could be potentially used as a high-level concept to query all concepts for a specific vaccine type. The hierarchy cannot cross domains, so we only focused on the drug domain for simplicity and high coverage of vaccine-related records. Unfortunately, we didn't identify any high-level concepts whose descendants cover all relevant standard concepts (Table 3). For example, among influenza vaccine related ancestors, the ATC classification code J07BB, "Influenza vaccines" has the highest number of descendants (825), but still less than the 859 standard concepts for influenza vaccine we identified in total. If using the ATC code as a high-level concept to define influenza vaccine events, 34 concepts would be missed.

Table 3. Top five ancestor concepts shared by all standard concepts in each vaccine type
Ancestor Concept ID Ancestor Concept Name No. of Standard Concepts Identified1 No. of Identified Concepts Sharing the Ancestor
Influenza vaccine
VACCINES 859 825
VIRAL VACCINES 859 825
Influenza vaccines 859 825
ANTIINFECTIVES FOR SYSTEMIC USE 859 825
influenza, inactivated, split virus or surface antigen; systemic 859 761
Pneumococcal vaccine
VACCINES 119 114
BACTERIAL VACCINES 119 114
Pneumococcal vaccines 119 114
ANTIINFECTIVES FOR SYSTEMIC USE 119 114
pneumococcus, purified polysaccharides antigen; systemic 119 63
Shingles vaccine
VACCINES 15 13
VIRAL VACCINES 15 13
Varicella zoster vaccines 15 13
ANTIINFECTIVES FOR SYSTEMIC USE 15 13
zoster vaccine recombinant 15 10

1 Only including the standard concepts in the drug domain among what we identified in Table S1. Ideally, there should be an ancestor that has all the concepts as descendants in one vaccine type.

Furthermore, if the hierarchy accurately represents the relationships among concepts, it is expected that a less granular concept should have relatively more descendants. To check this, we counted the number of descendants of the standard and classification concepts in Table S1. We found some cases where a less granular concept has fewer descendants than a more granular concept (Table 4). For instance, “Influenza, seasonal, injectable, preservative free (40213154)” has 142 descendants, while “Influenza, seasonal, injectable (40213153)”, a broader concept, has fewer descendants (92).

Table 4. Cases where a less granular concept has fewer descendants than a more granular concept
Concept ID Concept Name No. of Descendants
Influenza vaccine
Influenza, seasonal, injectable, preservative free 142
Influenza, seasonal, injectable 92
influenza virus vaccine, unspecified formulation 1
influenza virus vaccine, whole virus 1
Admin of influenza vaccine 1
Pneumococcal vaccine
pneumococcal polysaccharide vaccine, 23 valent 17
pneumococcal vaccine, unspecified formulation 1
Shingles vaccine
varicella-zoster virus vaccine live (Oka-Merck) strain 29800 UNT/ML [Zostavax] 9
Zostavax Injectable Product 5

As a result of the incomplete hierarchy, users cannot rely on it to easily build high-quality vaccine cohorts.

Complex naming conventions for standard concepts

Many source vocabularies do not have a clear or common naming convention for vaccine concepts. OMOP concept names may contain the vaccine type, drug brand name, the virus ingredient, or any mixture of these. This makes it difficult to include all relevant concepts by keyword searching and filtering, and a few examples are listed below. Furthermore, the unclear hierarchy makes it even harder to properly define a vaccine concept set. Users have to manually verify thousands of concepts, which is error-prone and time-consuming. If the OHDSI community uses a single and clear naming convention when selecting and creating the standard concepts, then the efficiency and quality could be significantly increased for defining vaccine related concept sets and cohorts.

  • Singles and chickenpox are caused by the varicella-zoster virus, and both shingles and chickenpox vaccine concepts may contain "varicella-zoster virus" string in their names so that it is tricky to distinguish the two vaccines. One way is to use the drug brand name because there are only two approved shingles vaccines in the US, Zostavax and Shingrix. If no drug brand name available, we used "glycoprotein E" or "recombinant" to filter because Shingrix has a unique ingredient, recombinant varicella zoster virus (VZV) glycoprotein E, that does not share with chickenpox vaccines. However, since Varilrix, a chickenpox vaccine, and Zostavax, a shingles vaccine, both have live zoster virus in their ingredients, we had to exclude many concepts. In some vocabularies, e.g. CVX, "varicella" means specifically chickenpox and "zoster" represents shingles, but not all vocabularies adopt this naming convention.
  • The pneumococcal vaccine concept names may include "pneumococcal vaccine", "pneumococcal conjugate vaccine", "pneumococcal polysaccharide vaccine", "Prevnar", "Pneumovax ", and "Streptococcus pneumoniae" following with specific serotypes. Using "pneumococcal vaccine" keyword does not retrieve all relevant concepts.
  • In addition to suffering from the same concept name inconsistency issues as pneumococcal and shingles vaccines, influenza vaccine concepts are even more complex to query because of the large number and yearly updated influenza vaccine concepts. For example, searching "influenza" will also retrieve "Haemophilus influenzae" and "parainfluenza" related concepts; some concepts do not even have "influenza" in their names, e.g. 40756945 H1N1 immunization administration; many concepts containing influenza vaccine information in their names can be irrelevant and should be excluded (Supplemental Table S3).

Frequent vocabulary changes

Understandably, the concept and concept relationship tables need frequent updates to include new concepts and improve mapping quality. However, as far as we know, there is no online archive that stores historical versions to enable reproducibility and traceability. This could be a potential hazard for reproducible longitudinal studies across multiple years and network studies when the participants use different vocabulary versions.

Both domain assignment and source to standard concept mapping can change over time, leading to differences in cohort definitions and analysis results depending on when the vocabulary tables were downloaded from Athena. Comparing two versions of the concept tables we downloaded at different dates from Athena (04/30/2020 versus 09/10/2020), we found that the domain of five influenza vaccine concepts changed from Procedure or Observation to Drug (Table 5). We also found some non-standard concepts were mapped to different standard concepts in different concept_relationship table versions and standard concepts in the older version became non-standard in the later version (Supplemental Table S4). Vaccine related cohort definitions created using one version of the concept tables may be inadequate if used on an OMOP CDM created using a different version of the OMOP vocabulary tables. This could impact network studies when participants are using the same cohort definition but their OMOP data were created using different versions of OMOP vocabularies.

Table 5. Domain changes of influenza vaccine concepts and their occurrences in Truven data
Concept ID
Domain
Previous Domain
Concept Name
Vocabulary
Concept Code
No. of
Occurrences
Drug Procedure Influenza virus vaccine, pandemic formulation, H1N1 CPT4 90663 360
Drug Procedure Immunization, active; influenza virus vaccine. (Deprecated) CPT4 90724 357
Drug Observation Admin of influenza vaccine HCPCS Q0034 0
Drug Procedure H1N1 immunization administration (intramuscular, intranasal), including counseling when performed CPT4 90470 1139
Drug Procedure Influenza virus vaccine, whole virus, for intramuscular or jet injection use (Deprecated) CPT4 90659 301

As changes accumulate over time, the reproducibility and compatibility of analyses can be significantly impacted. One of our internal Truven data sets (01/01/1996-09/30/2018) was converted to OMOP CDM using a copy of the OMOP vocabulary previously downloaded from Athena in 2019. We calculated the number and percentage of vaccine records in this dataset affected by changes in source to standard concept mapping between the 2019 and 2020 vocabulary versions (Table 6). We found that 89.2% of influenza and 89.5% of pneumococcal vaccine records no longer contained standard concepts, which means if a user defines the cohort using the most up-to-date vocabulary, the majority of influenza and pneumococcal vaccine records will be excluded.

Table 6. Number and percentage of records affected by the source to standard mapping changes in Truven data with 2019 OMOP version
Vocabulary
Standard Concept
Changed?
No. of
Source Codes
No. of
Occurrences
Percentage of
Occurrences
Influenza vaccine
CPT4 Yes 24 73,492,151 88.0%
CPT4 No 2 61,254 0.1%
HCPCS Yes 1 675,791 0.8%
HCPCS No 1 1,389,722 1.7%
NDC Yes 29 329,507 0.4%
NDC No 265 7,554,915 9.0%
Pneumococcal vaccine
CPT4 Yes 3 24,018,150 89.3%
CPT4 No 1 2,626,536 9.8%
HCPCS Yes 1 3,496 0.0%
NDC Yes 4 38,825 0.1%
NDC No 11 206,652 0.8%
Shingles vaccine
NDC No 7 1,147,974 100.0%

The vocabulary changes significantly increase the work of quality control and building analytic workflow with backward and forward compatibility. If an institute wants to update the vocabularies, they not only need to update the vocabulary tables but also must re-do the OMOP conversion for all the data sets to keep everything consistent. Additionally, cohort and concept set definitions may need to be changed as a result of changes in source to standard concept mappings as we observed with influenza and pneumococcal vaccinations. This burden may discourage many institutes from adopting the OMOP CDM, keeping the data and vocabulary up-to-date, or relying on standard concepts in analyses.

Discussion

The complexity of vaccine concepts, especially the seasonal influenza virus vaccine, requires a revisit of the concept mappings. Epidemiologists in the OHDSI community could work together to further discuss and define some recommended concept sets for each vaccine, improve concept relationship mapping, monitor the impact of concept status and relationship changes, provide additional training to users, and document the limitations of the current standard concepts for reliable and reproducible observational studies.

References

  1. The value of vaccines. Nature Reviews Microbiology. 2008;6(1):2.
  2. Overhage JM, Ryan PB, Reich CG, Hartzema AG, Stang PE. Validation of a common data model for active safety surveillance research. J Am Med Inform Assoc. 2012;19(1):54-60.
  3. Boland MR, Tatonetti NP. Are all vaccines created equal? Using electronic health records to discover vaccines associated with clinician-coded adverse events. AMIA Summits on Translational Science Proceedings. 2015;2015:196.
  4. The Book of OHDSI. Observational Health Data Sciences and Informatics.

Supplemental Tables

Table S1: Vaccine concepts we found in the selected vocabularies
Table S2: Inappropriate mappings from non-standard vaccine concepts to standard concepts
Table S3: Irrelevant concepts that contains influenza vaccine information in their names
Table S4. Standard concept mapping changes of the vaccine concepts