This document describes some of the best practices we have developed over the years when trying to create an ETL (Extract, Transform, Load) process to convert data into the OMOP Common Data Model (CDM). We have found it best to split the process into four distinct activities:
Designing the ETL requires in-depth knowledge of the source data, but it also requires knowledge of the CDM, and having someone with experience in past ETLs to the OMOP CDM can speed up the design activity. Ideally, the data and CDM experts should sit down together at the same location in a one- or two-day session.
Often the documentation of the source data is not detailed enough to fully design the ETL, and on many occasions the documentation has even been found to be inconsistent with the real data! To make sure we have the right starting point, White Rabbit can create a scan of the source data. This scan contains the following information
The list of tables and fields give insight into the structure of the data, whereas the values and their frequencies can help identify how information is recorded (e.g. whether gender is encoded as ‘m’ and ‘f’ or ‘1’ and ‘2’, or whether ICD-9 codes are stored with or without periods), including any possible special values (e.g. ‘999’ which could be used to identify missing numbers).
When the data and CDM experts sit down together, they would like to have the information on both the source data and the CDM at their fingertips. This is why Rabbit-in-a-Hat was developed. This tool is designed to be projected on a screen during an interactive session, and shows both the source data (as derived from the White Rabbit scan) and the CDM. It allows connections to be drawn between tables and fields, and add comments regarding the logic of transformations.
We have found it best to start by defining the relationships between the tables in the source data and CDM first, and only later go back to define the relationships between fields. Another best practice is to not only record the final design decisions that were made, but also the reasoning that was used to come to the decisions. Otherwise, later on the discussions may be been forgotten and have to be repeated.
At the end of the session, when the design in Rabbit-in-a-Hat is complete, a Word document is automatically generated that follows the OMOP template for ETL documentation. This document should contain sufficient detail to be the full specifications for implementing the ETL.
Often the source data uses coding systems that are not in the OMOP Vocabulary. In this case, a mapping has to be created from the source coding system to one of the appropriate OMOP standard vocabularies (for example, for drugs the OMOP standard vocabulary is RxNorm, and for conditions it is SNOMED).
Mapping can be a daunting task, especially when there are many codes in the source coding system. There are several things that can be done to make the task easier:
Usagi is a tool to aid the manual process of creating a code mapping. It can make suggested mappings based on textual similarity of code descriptions. If the source codes are only available in a foreign language, we have found that Google Translate often gives surprisingly good translation of the terms into English. Usagi allows the user to search for the appropriate target concepts if the automated suggestion is not correct. Finally, the user can indicate which mappings are approved to be used in the ETL.
Once the design and code mappings are completed, the ETL process can be implemented in a piece of software. The choice of technology to implement the ETL depends greatly on the expertise available at the site. We have seen people use SQL, SAS, C#, Java, and Kettle. All have their advantages and disadvantages, and none are usable if there is nobody at the site who is familiar with the technology.
(Note: after several independent attempts, we have given up on developing the ‘ultimate’ user-friendly ETL tool. It is always the case that tools like that work really well for 80% of the ETL, but for the remaining 20% of the ETL some low-level code needs to be written that is specific to a source database)
We can only use the converted data if we are confident about the quality of the ETL. We have found that there are several ways to evaluate the quality of the data in the CDM instance. Interestingly, on more than one occasion, the issues we identified were not due to the ETL, but were already present in the source data!
We recommend at least the following steps in quality control (in random order):
Achilles is a tool for exploring descriptive statistics such as patient demographics and the prevalence of conditions, drugs and procedures. It allows one to easily spot a wide range of data-quality issues, for example a weird age distribution that is the result of erroneous parsing of the date-of-birth, or time periods where a particular type of data is missing.
One part of Achilles is Achilles Heel, a report that lists errors, which are data quality issues that we believe are so serious they have to be resolved, and warnings that list potential data quality issues that at least need to be investigated and expained. The errors and warnings come from a long list of pre-programmed checks, for example testing whether people have events before birth or after death, whether concept identifiers refer to existing concepts, etc.