User Tools

Site Tools


documentation:vocabulary:data_etl

This is an old revision of the document!


Data ETL

The most important impact the Standardized Vocabularies have on the ETL process from raw to CDM-formatted data is the Domain of each Concept. Irrespective from which source table a record comes, or what coding scheme it is represented by, the destination table will be determined by the domain_id of the respective Concept. Any ETL will have to follow the following logic for process every record in the source data:

  • Retrieve the source code from the record.
  • Find the Concept that corresponds to the source code. In most cases, that is done by looking up the source code in the concept_code field with the correct content of the vocabulary_id field in the CONCEPT table. Sometimes, the source code needs to be manipulated to find the right match. For example, ICD-9-CM codes have a dot after the 2nd or 3rd character, but in source data they are often stored without the dot. Or NDC codes come in a variety of formats (with dashes and asterisks or without, 9 or 11 digit), making the mapping process more complicated. Look into the specification of each vocabulary for specific recommendations of the lookup process.
  • Map to a Standard Concept (standard_concept='S') by retrieving all the active (invalid_reason field should be NULL) records from the CONCEPT_RELATIONSHIP table. Use concept_id_1 for the Concept you want to map and concept_id_2 for the destination Concept, with relationship_id='Maps to'. If the source code is a local code and you have a SOURCE_TO_CONCEPT_MAP table for these, determine the destination Concept from the target_concept_id field. The destination Concept could be the Concept itself if it happens to be a Standard Concept, or a Concept in another vocabulary. In most cases, a source code maps to a single destination Standard Concept, but in some cases it could be two or three.
  • Write a record in the corresponding CDM table for each destination Standard Concept based on the content of the domain_id field. Place the ID of the Standard Concept into the <domain>_concept_id, the Source Concept into the <domain>_source_concept_id and the source code into the <domain>_source_value field. Most data tables require a start_date and a Type Concept, and some have more fields that need consideration. See the details of each table in the CDM Specifications. The corresponding table/field combination for each Domain is as follows:
domain_idCDM tableFieldComment
GenericAnyAnyGeneric Concepts can be in any field that ends in concept_id.
GenderPERSONgender_concept_id
RacePERSONrace_concept_id
EthnicityPERSONethnicity_concept_id
VisitVISIT_OCCURRENCEvisit_concept_id
ProcedurePROCEDURE_OCCURRENCEprocedure_concept_id
ModifierPROCEDURE_OCCURRENCEmodifier_concept_id
DrugDRUG_EXPOSUREdrug_concept_id
RouteDRUG_EXPOSUREroute_concept_id
UnitMEASUREMENT or OBSERVATION or SPECIMENunit_concept_idUnits are used in different contexts. *
DeviceDEVICE_EXPOSUREdevice_concept_id
ConditionCONDITION_OCCURRENCEcondition_concept_id
MeasurementMEASUREMENTmeasurement_concept_id
Meas Value OperatorMEASUREMENToperator_concept_id
Meas ValueMEASUREMENTvalue_as_concept_id
ObservationOBSERVATIONobservation_concept_id
RelationshipFACT_RELATIONSHIPrelationship_concept_id
Place of ServiceCARE_SITEplace_of_service_concept_id
Provider SpecialtyPROVIDERspecialty_concept_id
CurrencyVISIT_COST or PROCEDURE_COST or DRUG_COST or DEVICE_COSTcurrency_concept_idCurrency values appear in any of the *_COST tables. *
Revenue CodePROCEDURE_COSTrevenue_code_concept_id
SpecimenSPECIMENspecimen_concept_id
Spec Anatomic SiteSPECIMEN or MEASUREMENT or OBSERVATIONanatomic_site_concept_id or value_as_concept_id or value_as_concept_idAnatomical Site Concepts are used to characterize the origin of a Specimen, but also the result of a Measurement or Observation. *
Spec Disease StatusSPECIMENdisease_status_concept_id
  • For some Source Concepts there is no mapping to a Standard Concept (there is no record in the CONCEPT_RELATIONSHIP or SOURCE_TO_CONCEPT_MAP tables), usually because the Source Concept is too generic or otherwise ill-defined. In these cases, the domain_id field of the Source Concept itself should be used to place the record in the right table. Some Source Concepts have combination Domains, such as “Device/Procedure”. In these cases, write a record into each of the combination Domain table (in this case to the DEVICE_OCCURRENCE and PROCEDURE_OCCURRENCE tables). As the Standard Concept write Concept 0 (concept_id=0) into the respective *_concept_id field.

If the Domain of the destination Concept is “Domain” or “Metadata”, an error has occurred in the construction of the mapping table. Please report in the CDM-Builder forum if you believe there is such an error in the Standardized Vocabularies data.

The same is true if you find a Type Concept Domain, like “Obs Period Type”, “Death Type”, “Visit Type”, “Procedure Type”, etc.These Type Concepts are valid concepts and have to be placed into the respective *_type_concept_id field of the respective CDM tables. However, they should never be introduced as part of a mapping process, as there is no equivalent information in the source data, because they denote the origin of the record. Type Concepts are hard-wired in the ETL process, because it depends on the structure of the source data how to assign Type Concepts to records. Please also report if you find a situation like that.

documentation/vocabulary/data_etl.1422104234.txt.gz · Last modified: 2015/01/24 12:57 by cgreich