User Tools

Site Tools


documentation:etl_best_practices

ETL creation best practices

This document describes some of the best practices we have developed over the years when trying to create an ETL (Extract, Transform, Load) process to convert data into the OMOP Common Data Model (CDM). We have found it best to split the process into four distinct activities:

  1. Data experts and CDM experts together design the ETL
  2. People with medical knowledge create the code mappings
  3. A technical person implements the ETL
  4. All are involved in quality control

1. Data experts and CDM experts together design the ETL

Designing the ETL requires in-depth knowledge of the source data, but it also requires knowledge of the CDM, and having someone with experience in past ETLs to the OMOP CDM can speed up the design activity. Ideally, the data and CDM experts should sit down together at the same location in a one- or two-day session.

We have developed two tools that have proven to be helpful for this activity: White Rabbit and Rabbit-in-a-Hat.

White Rabbit

Often the documentation of the source data is not detailed enough to fully design the ETL, and on many occasions the documentation has even been found to be inconsistent with the real data! To make sure we have the right starting point, White Rabbit can create a scan of the source data. This scan contains the following information

  1. The list of tables in the source database
  2. The list of fields per table
  3. The list of distinct values found in a field (unique or infrequently occurring values are not included)
  4. The frequency with which a value occurs

The list of tables and fields give insight into the structure of the data, whereas the values and their frequencies can help identify how information is recorded (e.g. whether gender is encoded as ‘m’ and ‘f’ or ‘1’ and ‘2’, or whether ICD-9 codes are stored with or without periods), including any possible special values (e.g. ‘999’ which could be used to identify missing numbers).

Rabbit-in-a-Hat

When the data and CDM experts sit down together, they would like to have the information on both the source data and the CDM at their fingertips. This is why Rabbit-in-a-Hat was developed. This tool is designed to be projected on a screen during an interactive session, and shows both the source data (as derived from the White Rabbit scan) and the CDM. It allows connections to be drawn between tables and fields, and add comments regarding the logic of transformations.

We have found it best to start by defining the relationships between the tables in the source data and CDM first, and only later go back to define the relationships between fields. Another best practice is to not only record the final design decisions that were made, but also the reasoning that was used to come to the decisions. Otherwise, later on the discussions may be been forgotten and have to be repeated.

At the end of the session, when the design in Rabbit-in-a-Hat is complete, a Word document is automatically generated that follows the OMOP template for ETL documentation. This document should contain sufficient detail to be the full specifications for implementing the ETL.

2. People with medical knowledge create the code mappings

Often the source data uses coding systems that are not in the OMOP Vocabulary. In this case, a mapping has to be created from the source coding system to one of the appropriate OMOP standard vocabularies (for example, for drugs the OMOP standard vocabulary is RxNorm, and for conditions it is SNOMED).

Mapping can be a daunting task, especially when there are many codes in the source coding system. There are several things that can be done to make the task easier:

  • Focus on the most frequently used codes. A code that is never used or maybe just once or twice is not worth the effort of mapping, since it will never be used in a real study.
  • Make use of existing information whenever possible. For example, many national drug coding systems have been mapped to ATC. Although ATC is not detailed enough for many purposes, the concept relationships between ATC and RxNorm can be used to make good guesses of what the right RxNorm codes are.
  • Use Usagi.

Usagi is a tool to aid the manual process of creating a code mapping. It can make suggested mappings based on textual similarity of code descriptions. If the source codes are only available in a foreign language, we have found that Google Translate often gives surprisingly good translation of the terms into English. Usagi allows the user to search for the appropriate target concepts if the automated suggestion is not correct. Finally, the user can indicate which mappings are approved to be used in the ETL.

3. A technical person implements the ETL

Once the design and code mappings are completed, the ETL process can be implemented in a piece of software. The choice of technology to implement the ETL depends greatly on the expertise available at the site. We have seen people use SQL, SAS, C#, Java, and Kettle. All have their advantages and disadvantages, and none are usable if there is nobody at the site who is familiar with the technology.

(Note: after several independent attempts, we have given up on developing the ‘ultimate’ user-friendly ETL tool. It is always the case that tools like that work really well for 80% of the ETL, but for the remaining 20% of the ETL some low-level code needs to be written that is specific to a source database)

4. All are involved in quality control

We can only use the converted data if we are confident about the quality of the ETL. We have found that there are several ways to evaluate the quality of the data in the CDM instance. Interestingly, on more than one occasion, the issues we identified were not due to the ETL, but were already present in the source data!

We recommend at least the following steps in quality control (in random order):

  • Review of the ETL design document, computer code, and code mappings. Any one person can make mistakes, so always at least one other person should review what the first person did.
  • Manually compare all information on a sample of persons in the source and target data.
  • Compare overall counts in the source and target data.
  • Replicate a study that has already been performed on the source data on the CDM version.
  • Use Achilles.

Achilles is a tool for exploring descriptive statistics such as patient demographics and the prevalence of conditions, drugs and procedures. It allows one to easily spot a wide range of data-quality issues, for example a weird age distribution that is the result of erroneous parsing of the date-of-birth, or time periods where a particular type of data is missing.

One part of Achilles is Achilles Heel, a report that lists errors, which are data quality issues that we believe are so serious they have to be resolved, and warnings that list potential data quality issues that at least need to be investigated and expained. The errors and warnings come from a long list of pre-programmed checks, for example testing whether people have events before birth or after death, whether concept identifiers refer to existing concepts, etc.

documentation/etl_best_practices.txt · Last modified: 2017/06/28 20:35 by lee