User Tools

Site Tools


documentation:vocabulary:data_etl

Data ETL

The most important impact the Standardized Vocabularies have on the ETL process from raw to CDM-formatted data is the Domain of each Concept. Irrespective from which source table a record comes, or what coding scheme it is represented by, the destination table will be determined by the domain_id of the respective Concept. Any ETL will have to follow the following logic for process every record in the source data:

  • Retrieve the source code from the record.
  • Find the Concept that corresponds to the source code. In most cases, that is done by looking up the source code in the concept_code field with the correct content of the vocabulary_id field in the CONCEPT table. Sometimes, the source code needs to be manipulated to find the right match. For example, ICD-9-CM codes have a dot after the 2nd or 3rd character, but in source data they are often stored without the dot. Or NDC codes come in a variety of formats (with dashes and asterisks or without, 9 or 11 digit), making the mapping process more complicated. Look into the specification of each Vocabulary for specific recommendations of the lookup process.
  • Map to a Standard Concept (standard_concept='S') by retrieving all the active (invalid_reason field should be NULL) records from the CONCEPT_RELATIONSHIP table. Use concept_id_1 for the Concept you want to map and concept_id_2 for the destination Concept, with relationship_id='Maps to'. If the source code is a local code and you have a SOURCE_TO_CONCEPT_MAP table for these, determine the destination Concept from the target_concept_id field. The destination Concept could be the Concept itself if it happens to be a Standard Concept, or a Concept in another vocabulary. In most cases, a source code maps to a single destination Standard Concept, but in some cases it could be two or three.
  • Write a record in the corresponding CDM table for each destination Standard Concept based on the content of the domain_id field. Place the ID of the Standard Concept into the <domain>_concept_id, the Source Concept into the <domain>_source_concept_id and the source code into the <domain>_source_value field. Most data tables require a start_date and a Type Concept, and some have more fields that need consideration. See the details of each table in the CDM Specifications. The corresponding table/field combination for each Domain is as follows:
domain_idCDM tableFieldComment
GenericAnyAnyGeneric Concepts can be in any field that ends in concept_id.
GenderPERSONgender_concept_id
RacePERSONrace_concept_id
EthnicityPERSONethnicity_concept_id
VisitVISIT_OCCURRENCEvisit_concept_id
ProcedurePROCEDURE_OCCURRENCEprocedure_concept_id
ModifierPROCEDURE_OCCURRENCEmodifier_concept_id
DrugDRUG_EXPOSUREdrug_concept_id
RouteDRUG_EXPOSUREroute_concept_id
UnitMEASUREMENT or OBSERVATION or SPECIMENunit_concept_idUnits are used in different contexts. *
DeviceDEVICE_EXPOSUREdevice_concept_id
ConditionCONDITION_OCCURRENCEcondition_concept_id
MeasurementMEASUREMENTmeasurement_concept_id
Meas Value OperatorMEASUREMENToperator_concept_id
Meas ValueMEASUREMENTvalue_as_concept_id
ObservationOBSERVATIONobservation_concept_id
RelationshipFACT_RELATIONSHIPrelationship_concept_id
Place of ServiceCARE_SITEplace_of_service_concept_id
Provider SpecialtyPROVIDERspecialty_concept_id
CurrencyMEASUREMENT or OBSERVATION or SPECIMENcurrency_concept_idCurrency values appear in any of the *_COST tables. *
Revenue CodePROCEDURE_COSTrevenue_code_concept_id
SpecimenSPECIMENspecimen_concept_id
Spec Anatomic SiteSPECIMEN or MEASUREMENT or OBSERVATIONanatomic_site_concept_id or value_as_concept_id or value_as_concept_idAnatomical Site Concepts are used to characterize the origin of a Specimen, but also the result of a Measurement or Observation. *
Spec Disease StatusSPECIMENdisease_status_concept_id

* If there is more than one potential destination table the ETL needs to identify the context in which a Standard Concept is used, and select the right table from this table.

  • For some Source Concepts there is no mapping to a Standard Concept (there is no record in the CONCEPT_RELATIONSHIP or SOURCE_TO_CONCEPT_MAP tables), usually because the Source Concept is too generic or otherwise ill-defined. In these cases, the domain_id field of the Source Concept itself should be used to place the record in the right table. Some Source Concepts have combination Domains, such as “Device/Procedure”. In these cases, write a record into each of the combination Domain table (in this case to the DEVICE_OCCURRENCE and PROCEDURE_OCCURRENCE tables). As the Standard Concept write Concept 0 (concept_id=0) into the respective <domain>_concept_id field.

If in the mapping process a Concept of the “Domain” or “Metadata” Domains are retrieved, an error has snuck into the mapping table. Please report those cases in the CDM-Builder.

The same is true if a Type Concept Domain, like “Obs Period Type”, “Death Type”, “Visit Type”, “Procedure Type”, etc. is produced. Though these Type Concepts are valid concepts and have to be placed into the <domain>_type_concept_id field of the respective CDM tables, they cannot be the result of the mapping process. They denote the origin of the record and the selection of Type Concepts should be hard-wired into the ETL process.

documentation/vocabulary/data_etl.txt · Last modified: 2015/01/24 14:23 by cgreich