User Tools

Site Tools


documentation:vocabulary:data_etl

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Next revision
Previous revision
Last revision Both sides next revision
documentation:vocabulary:data_etl [2015/01/24 12:39]
cgreich created
documentation:vocabulary:data_etl [2015/01/24 13:09]
cgreich
Line 1: Line 1:
 ===== Data ETL ===== ===== Data ETL =====
 The most important impact the Standardized Vocabularies have on the ETL process from raw to CDM-formatted data is the Domain of each Concept. Irrespective from which source table a record comes, or what coding scheme it is represented by, the destination table will be determined by the domain_id of the respective Concept. Any ETL will have to follow the following logic for process every record in the source data: The most important impact the Standardized Vocabularies have on the ETL process from raw to CDM-formatted data is the Domain of each Concept. Irrespective from which source table a record comes, or what coding scheme it is represented by, the destination table will be determined by the domain_id of the respective Concept. Any ETL will have to follow the following logic for process every record in the source data:
-  * Retrieve the source code from the record. +  ​* ** Retrieve the source code** from the record. 
-  * Find the Concept that corresponds to the source code. In most cases, that is done by looking up the source code in the concept_code field with the correct content of the vocabulary_id field in the [[documentation:​cdm:​concept|CONCEPT]] table. Sometimes, the source code needs to be manipulated to find the right match. For example, [[documentation:​vocabulary:​icd9cm|ICD-9-CM]] codes have a dot after the 2nd or 3rd character, but in source data they are often stored without the dot. Or [[documentation:​vocabulary:​ndc|NDC]] codes come in a variety of formats (with dashes and asterisks or without, 9 or 11 digit), making the mapping process more complicated. Look into the specification of each vocabulary for specific recommendations of the lookup process. +  ​* **Find the Concept** that corresponds to the source code. In most cases, that is done by looking up the source code in the concept_code field with the correct content of the vocabulary_id field in the [[documentation:​cdm:​concept|CONCEPT]] table. Sometimes, the source code needs to be manipulated to find the right match. For example, [[documentation:​vocabulary:​icd9cm|ICD-9-CM]] codes have a dot after the 2nd or 3rd character, but in source data they are often stored without the dot. Or [[documentation:​vocabulary:​ndc|NDC]] codes come in a variety of formats (with dashes and asterisks or without, 9 or 11 digit), making the mapping process more complicated. Look into the specification of each [[documentation:​vocabulary:​vocabularies|Vocabulary]] ​for specific recommendations of the lookup process. 
-  * Once the Concept corresponding to the source code is established,​ map it to a Standard Concept (standard_concept='​S'​) by retrieving all the active (invalid_reason field should be NULL) records from the [[documentation:​cdm:​concept_relationship|CONCEPT_RELATIONSHIP]] table. Use concept_id_1 for the Concept you want to map and concept_id_2 for the destination Concept, with relationship_id='​Maps to'. If the source code is a local code and you have a [[documentation:​cdm:​source_to_concept_map|SOURCE_TO_CONCEPT_MAP]] table for these, determine the destination Concept from the target_concept_id field. The destination Concept could be the Concept itself if it happens to be a Standard Concept, or a Concept in another vocabulary. In most cases, a source code maps to a single destination Standard Concept, but in some cases it could be two or three. +  * **Map to a Standard Concept** (standard_concept='​S'​) by retrieving all the active (invalid_reason field should be NULL) records from the [[documentation:​cdm:​concept_relationship|CONCEPT_RELATIONSHIP]] table. Use concept_id_1 for the Concept you want to map and concept_id_2 for the destination Concept, with relationship_id='​Maps to'. If the source code is a local code and you have a [[documentation:​cdm:​source_to_concept_map|SOURCE_TO_CONCEPT_MAP]] table for these, determine the destination Concept from the target_concept_id field. The destination Concept could be the Concept itself if it happens to be a Standard Concept, or a Concept in another vocabulary. In most cases, a source code maps to a single destination Standard Concept, but in some cases it could be two or three. 
-  * For each destination Standard ​Concepts, determine ​the content of the domain_id field and write a record in the corresponding CDM table. Most Standardized Clinical Data Tables ​require a start_date and a Type Concept ​(see below). Some have more fields that need be filled or left empty. See the details of each table in the [[documentation:​cdm:​standardized_clinical_data_tables|CDM Specifications]]. The corresponding table/field combination for each Domain is as follows:+  * **Write a record in the corresponding CDM table** for each destination Standard ​Concept based on the content of the domain_id field. Place the ID of the Standard Concept into the <​domain>​_concept_id,​ the Source Concept into the <​domain>​_source_concept_id ​and the source code into the <​domain>​_source_value field. Most data tables ​require a start_date and a Type Concept, and some have more fields that need consideration. See the details of each table in the [[documentation:​cdm:​standardized_clinical_data_tables|CDM Specifications]]. The corresponding table/field combination for each Domain is as follows: 
 ^domain_id^CDM table^Field^Comment^ ^domain_id^CDM table^Field^Comment^
 |Generic|Any|Any|Generic Concepts can be in any field that ends in concept_id.| |Generic|Any|Any|Generic Concepts can be in any field that ends in concept_id.|
Line 25: Line 26:
 |Place of Service|[[documentation:​cdm:​care_site|CARE_SITE]]|place_of_service_concept_id| | |Place of Service|[[documentation:​cdm:​care_site|CARE_SITE]]|place_of_service_concept_id| |
 |Provider Specialty|[[documentation:​cdm:​provider|PROVIDER]]|specialty_concept_id| | |Provider Specialty|[[documentation:​cdm:​provider|PROVIDER]]|specialty_concept_id| |
-|Currency|[[documentation:​cdm:​visit_cost|VISIT_COST]] or [[documentation:​cdm:​procedure_cost|PROCEDURE_COST]] or [[documentation:​cdm:​drug_cost|DRUG_COST]] or [[documentation:​cdm:​device_cost|DEVICE_COST]]|currency_concept_id|Currency values ​could appear in any of the *_COST tables. *|+|Currency|[[documentation:​cdm:​visit_cost|VISIT_COST]] or [[documentation:​cdm:​procedure_cost|PROCEDURE_COST]] or [[documentation:​cdm:​drug_cost|DRUG_COST]] or [[documentation:​cdm:​device_cost|DEVICE_COST]]|currency_concept_id|Currency values appear in any of the *_COST tables. *|
 |Revenue Code|[[documentation:​cdm:​procedure_cost|PROCEDURE_COST]]|revenue_code_concept_id| | |Revenue Code|[[documentation:​cdm:​procedure_cost|PROCEDURE_COST]]|revenue_code_concept_id| |
 |Specimen|[[documentation:​cdm:​specimen|SPECIMEN]]|specimen_concept_id| | |Specimen|[[documentation:​cdm:​specimen|SPECIMEN]]|specimen_concept_id| |
-|Spec Anatomic Site|[[documentation:​cdm:​specimen|SPECIMEN]] or [[documentation:​cdm:​measurement|MEASUREMENT]] or [[documentation:​cdm:​observation|OBSERVATION]]|anatomic_site_concept_id or value_as_concept_id or value_as_concept_id|Anatomical ​site Concepts ​can be used to characterize the origin of a Specimen, ​or the result of a Measurement or Observation. *|+|Spec Anatomic Site|[[documentation:​cdm:​specimen|SPECIMEN]] or [[documentation:​cdm:​measurement|MEASUREMENT]] or [[documentation:​cdm:​observation|OBSERVATION]]|anatomic_site_concept_id or value_as_concept_id or value_as_concept_id|Anatomical ​Site Concepts ​are used to characterize the origin of a Specimen, ​but also the result of a Measurement or Observation. *|
 |Spec Disease Status|[[documentation:​cdm:​specimen|SPECIMEN]]|disease_status_concept_id| | |Spec Disease Status|[[documentation:​cdm:​specimen|SPECIMEN]]|disease_status_concept_id| |
  
-  ​For some Source Concepts, the Standardized Vocabularies do not provide a mapping to a Standard Concept (there is no record in the CONCEPT_RELATIONSHIP or SOURCE_TO_CONCEPT_MAP tables), usually because the Source Concept is too generic or otherwise ill-defined. In these cases, the domain_id field of the Source Concept itself should be used to place the record ​in the right table. Some Source Concepts have combination Domains, such as "​Device/​Procedure"​. In these cases, write record into each of the combination Domain table (in this case to the DEVICE_OCCURRENCE and PROCEDURE_OCCURRENCE tables). As the Standard ​Concept write Concept 0 (concept_id=0) into the respective *_concept_id field. +If there is more than one potential destination table the ETL needs to identify ​the context ​in which a Standard Concept is usedand select ​the right table from this table.
- +
-If the Domain of the destination ​Concept is "​Domain"​ or "​Metadata"​an error has occurred in the construction of the mapping ​table. Please report in the [[http://​forums.ohdsi.org/​c/​cdm-builders/​l/​latest|CDM-Builder]] forum if you believe there is such an error in the Standardized Vocabularies data.+
  
-The same is true if you find Type Concept ​Domain, like "Obs Period Type", "Death Type", "Visit Type", "​Procedure Type", etc.These Type Concepts are valid concepts and have to be placed into the respective ​*_type_concept_id field of the respective CDM tables. However, they should never be introduced as part of a mapping process, as there is no equivalent information ​in the source data, because ​they denote ​the origin of where a current record ​is coming fromType Concepts are hard-wired in the ETL processbecause it depends on the structure ​of the source data how to assign Type Concepts to recordsPlease also report if you find a situation like that.+  * For some Source Concepts there is **no mapping to Standard ​Concept** (there is no record ​in the CONCEPT_RELATIONSHIP or SOURCE_TO_CONCEPT_MAP tables)usually ​because the Source Concept ​is too generic or otherwise ill-definedIn these cases, the domain_id field of the Source Concept itself should be used to place the record in the right table. Some Source ​Concepts ​have combination Domains, such as "​Device/​Procedure"​. In these cases, write a record into each of the combination Domain table (in this case to the DEVICE_OCCURRENCE and PROCEDURE_OCCURRENCE tables)As the Standard Concept write Concept 0 (concept_id=0) into the respective <​domain>​_concept_id field.
  
 +If in the mapping process a Concept of the "​Domain"​ or "​Metadata"​ Domains are retrieved, an error has snuck into the mapping table. Please report those cases in the [[http://​forums.ohdsi.org/​c/​cdm-builders/​l/​latest|CDM-Builder]].
  
 +The same is true if a Type Concept Domain, like "Obs Period Type", "Death Type", "Visit Type", "​Procedure Type", etc. is produced. Though these Type Concepts are valid concepts and have to be placed into the <​domain>​_type_concept_id field of the respective CDM tables, they cannot be the result of the mapping process. They denote the origin of the record and the selection of Type Concepts should be hard-wired into the ETL process.
documentation/vocabulary/data_etl.txt · Last modified: 2015/01/24 14:23 by cgreich