User Tools

Site Tools


documentation:vocabulary:data_etl

This is an old revision of the document!


Data ETL

The most important impact the Standardized Vocabularies have on the ETL process from raw to CDM-formatted data is the Domain of each Concept. Irrespective from which source table a record comes, or what coding scheme it is represented by, the destination table will be determined by the domain_id of the respective Concept. Any ETL will have to follow the following logic for process every record in the source data:

  • Retrieve the source code from the record.
  • Find the Concept that corresponds to the source code. In most cases, that is done by looking up the source code in the concept_code field with the correct content of the vocabulary_id field in the CONCEPT table. Sometimes, the source code needs to be manipulated to find the right match. For example, ICD-9-CM codes have a dot after the 2nd or 3rd character, but in source data they are often stored without the dot. Or NDC codes come in a variety of formats (with dashes and asterisks or without, 9 or 11 digit), making the mapping process more complicated. Look into the specification of each vocabulary for specific recommendations of the lookup process.
  • Once the Concept corresponding to the source code is established, map it to a Standard Concept (standard_concept='S') by retrieving all the active (invalid_reason field should be NULL) records from the CONCEPT_RELATIONSHIP table. Use concept_id_1 for the Concept you want to map and concept_id_2 for the destination Concept, with relationship_id='Maps to'. If the source code is a local code and you have a SOURCE_TO_CONCEPT_MAP table for these, determine the destination Concept from the target_concept_id field. The destination Concept could be the Concept itself if it happens to be a Standard Concept, or a Concept in another vocabulary. In most cases, a source code maps to a single destination Standard Concept, but in some cases it could be two or three.
  • For each destination Standard Concepts, determine the content of the domain_id field and write a record in the corresponding CDM table. Most Standardized Clinical Data Tables require a start_date and a Type Concept (see below). Some have more fields that need be filled or left empty. See the details of each table in the CDM Specifications. The corresponding table/field combination for each Domain is as follows:

^domain_id^CDM table^Field^Comment^

GenericAnyAnyGeneric Concepts can be in any field that ends in concept_id.
GenderPERSONgender_concept_id
RacePERSONrace_concept_id
EthnicityPERSONethnicity_concept_id
VisitVISIT_OCCURRENCEvisit_concept_id
ProcedurePROCEDURE_OCCURRENCEprocedure_concept_id
ModifierPROCEDURE_OCCURRENCEmodifier_concept_id
DrugDRUG_EXPOSUREdrug_concept_id
RouteDRUG_EXPOSUREroute_concept_id
UnitMEASUREMENT or OBSERVATION or SPECIMENunit_concept_idUnits are used in different contexts. *
DeviceDEVICE_EXPOSUREdevice_concept_id
ConditionCONDITION_OCCURRENCEcondition_concept_id
MeasurementMEASUREMENTmeasurement_concept_id
Meas Value OperatorMEASUREMENToperator_concept_id
Meas ValueMEASUREMENTvalue_as_concept_id
ObservationOBSERVATIONobservation_concept_id
RelationshipFACT_RELATIONSHIPrelationship_concept_id
Place of ServiceCARE_SITEplace_of_service_concept_id
Provider SpecialtyPROVIDERspecialty_concept_id
CurrencyVISIT_COST or PROCEDURE_COST or DRUG_COST or DEVICE_COSTcurrency_concept_idCurrency values could appear in any of the *_COST tables. *
Revenue CodePROCEDURE_COSTrevenue_code_concept_id
SpecimenSPECIMENspecimen_concept_id
Spec Anatomic SiteSPECIMEN or MEASUREMENT or OBSERVATIONanatomic_site_concept_id or value_as_concept_id or value_as_concept_idAnatomical site Concepts can be used to characterize the origin of a Specimen, or the result of a Measurement or Observation. *
Spec Disease StatusSPECIMENdisease_status_concept_id
  • For some Source Concepts, the Standardized Vocabularies do not provide a mapping to a Standard Concept (there is no record in the CONCEPT_RELATIONSHIP or SOURCE_TO_CONCEPT_MAP tables), usually because the Source Concept is too generic or otherwise ill-defined. In these cases, the domain_id field of the Source Concept itself should be used to place the record in the right table. Some Source Concepts have combination Domains, such as “Device/Procedure”. In these cases, write a record into each of the combination Domain table (in this case to the DEVICE_OCCURRENCE and PROCEDURE_OCCURRENCE tables). As the Standard Concept write Concept 0 (concept_id=0) into the respective *_concept_id field.

If the Domain of the destination Concept is “Domain” or “Metadata”, an error has occurred in the construction of the mapping table. Please report in the CDM-Builder forum if you believe there is such an error in the Standardized Vocabularies data.

The same is true if you find a Type Concept Domain, like “Obs Period Type”, “Death Type”, “Visit Type”, “Procedure Type”, etc.These Type Concepts are valid concepts and have to be placed into the respective *_type_concept_id field of the respective CDM tables. However, they should never be introduced as part of a mapping process, as there is no equivalent information in the source data, because they denote the origin of where a current record is coming from. Type Concepts are hard-wired in the ETL process, because it depends on the structure of the source data how to assign Type Concepts to records. Please also report if you find a situation like that.

documentation/vocabulary/data_etl.1422103191.txt.gz · Last modified: 2015/01/24 12:39 by cgreich