User Tools

Site Tools


documentation:vocabulary:data_etl

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
Last revision Both sides next revision
documentation:vocabulary:data_etl [2015/01/24 12:46]
cgreich
documentation:vocabulary:data_etl [2015/01/24 13:09]
cgreich
Line 2: Line 2:
 The most important impact the Standardized Vocabularies have on the ETL process from raw to CDM-formatted data is the Domain of each Concept. Irrespective from which source table a record comes, or what coding scheme it is represented by, the destination table will be determined by the domain_id of the respective Concept. Any ETL will have to follow the following logic for process every record in the source data: The most important impact the Standardized Vocabularies have on the ETL process from raw to CDM-formatted data is the Domain of each Concept. Irrespective from which source table a record comes, or what coding scheme it is represented by, the destination table will be determined by the domain_id of the respective Concept. Any ETL will have to follow the following logic for process every record in the source data:
   * ** Retrieve the source code** from the record.   * ** Retrieve the source code** from the record.
-  * **Find the Concept** that corresponds to the source code. In most cases, that is done by looking up the source code in the concept_code field with the correct content of the vocabulary_id field in the [[documentation:​cdm:​concept|CONCEPT]] table. Sometimes, the source code needs to be manipulated to find the right match. For example, [[documentation:​vocabulary:​icd9cm|ICD-9-CM]] codes have a dot after the 2nd or 3rd character, but in source data they are often stored without the dot. Or [[documentation:​vocabulary:​ndc|NDC]] codes come in a variety of formats (with dashes and asterisks or without, 9 or 11 digit), making the mapping process more complicated. Look into the specification of each vocabulary for specific recommendations of the lookup process.+  * **Find the Concept** that corresponds to the source code. In most cases, that is done by looking up the source code in the concept_code field with the correct content of the vocabulary_id field in the [[documentation:​cdm:​concept|CONCEPT]] table. Sometimes, the source code needs to be manipulated to find the right match. For example, [[documentation:​vocabulary:​icd9cm|ICD-9-CM]] codes have a dot after the 2nd or 3rd character, but in source data they are often stored without the dot. Or [[documentation:​vocabulary:​ndc|NDC]] codes come in a variety of formats (with dashes and asterisks or without, 9 or 11 digit), making the mapping process more complicated. Look into the specification of each [[documentation:​vocabulary:​vocabularies|Vocabulary]] ​for specific recommendations of the lookup process.
   * **Map to a Standard Concept** (standard_concept='​S'​) by retrieving all the active (invalid_reason field should be NULL) records from the [[documentation:​cdm:​concept_relationship|CONCEPT_RELATIONSHIP]] table. Use concept_id_1 for the Concept you want to map and concept_id_2 for the destination Concept, with relationship_id='​Maps to'. If the source code is a local code and you have a [[documentation:​cdm:​source_to_concept_map|SOURCE_TO_CONCEPT_MAP]] table for these, determine the destination Concept from the target_concept_id field. The destination Concept could be the Concept itself if it happens to be a Standard Concept, or a Concept in another vocabulary. In most cases, a source code maps to a single destination Standard Concept, but in some cases it could be two or three.   * **Map to a Standard Concept** (standard_concept='​S'​) by retrieving all the active (invalid_reason field should be NULL) records from the [[documentation:​cdm:​concept_relationship|CONCEPT_RELATIONSHIP]] table. Use concept_id_1 for the Concept you want to map and concept_id_2 for the destination Concept, with relationship_id='​Maps to'. If the source code is a local code and you have a [[documentation:​cdm:​source_to_concept_map|SOURCE_TO_CONCEPT_MAP]] table for these, determine the destination Concept from the target_concept_id field. The destination Concept could be the Concept itself if it happens to be a Standard Concept, or a Concept in another vocabulary. In most cases, a source code maps to a single destination Standard Concept, but in some cases it could be two or three.
-  * **Write a record in the corresponding CDM table** for each destination Standard Concept based on the content of the domain_id field. Most Standardized Clinical Data Tables ​require a start_date and a Type Concept ​(see below). Some have more fields that need consideration. See the details of each table in the [[documentation:​cdm:​standardized_clinical_data_tables|CDM Specifications]]. The corresponding table/field combination for each Domain is as follows:+  * **Write a record in the corresponding CDM table** for each destination Standard Concept based on the content of the domain_id ​field. Place the ID of the Standard Concept into the <​domain>​_concept_id,​ the Source Concept into the <​domain>​_source_concept_id and the source code into the <​domain>​_source_value ​field. Most data tables ​require a start_date and a Type Concept, and some have more fields that need consideration. See the details of each table in the [[documentation:​cdm:​standardized_clinical_data_tables|CDM Specifications]]. The corresponding table/field combination for each Domain is as follows:
  
 ^domain_id^CDM table^Field^Comment^ ^domain_id^CDM table^Field^Comment^
Line 26: Line 26:
 |Place of Service|[[documentation:​cdm:​care_site|CARE_SITE]]|place_of_service_concept_id| | |Place of Service|[[documentation:​cdm:​care_site|CARE_SITE]]|place_of_service_concept_id| |
 |Provider Specialty|[[documentation:​cdm:​provider|PROVIDER]]|specialty_concept_id| | |Provider Specialty|[[documentation:​cdm:​provider|PROVIDER]]|specialty_concept_id| |
-|Currency|[[documentation:​cdm:​visit_cost|VISIT_COST]] or [[documentation:​cdm:​procedure_cost|PROCEDURE_COST]] or [[documentation:​cdm:​drug_cost|DRUG_COST]] or [[documentation:​cdm:​device_cost|DEVICE_COST]]|currency_concept_id|Currency values ​could appear in any of the *_COST tables. *|+|Currency|[[documentation:​cdm:​visit_cost|VISIT_COST]] or [[documentation:​cdm:​procedure_cost|PROCEDURE_COST]] or [[documentation:​cdm:​drug_cost|DRUG_COST]] or [[documentation:​cdm:​device_cost|DEVICE_COST]]|currency_concept_id|Currency values appear in any of the *_COST tables. *|
 |Revenue Code|[[documentation:​cdm:​procedure_cost|PROCEDURE_COST]]|revenue_code_concept_id| | |Revenue Code|[[documentation:​cdm:​procedure_cost|PROCEDURE_COST]]|revenue_code_concept_id| |
 |Specimen|[[documentation:​cdm:​specimen|SPECIMEN]]|specimen_concept_id| | |Specimen|[[documentation:​cdm:​specimen|SPECIMEN]]|specimen_concept_id| |
-|Spec Anatomic Site|[[documentation:​cdm:​specimen|SPECIMEN]] or [[documentation:​cdm:​measurement|MEASUREMENT]] or [[documentation:​cdm:​observation|OBSERVATION]]|anatomic_site_concept_id or value_as_concept_id or value_as_concept_id|Anatomical ​site Concepts ​can be used to characterize the origin of a Specimen, ​or the result of a Measurement or Observation. *|+|Spec Anatomic Site|[[documentation:​cdm:​specimen|SPECIMEN]] or [[documentation:​cdm:​measurement|MEASUREMENT]] or [[documentation:​cdm:​observation|OBSERVATION]]|anatomic_site_concept_id or value_as_concept_id or value_as_concept_id|Anatomical ​Site Concepts ​are used to characterize the origin of a Specimen, ​but also the result of a Measurement or Observation. *|
 |Spec Disease Status|[[documentation:​cdm:​specimen|SPECIMEN]]|disease_status_concept_id| | |Spec Disease Status|[[documentation:​cdm:​specimen|SPECIMEN]]|disease_status_concept_id| |
  
-  ​For some Source Concepts, ​the Standardized Vocabularies do not provide a mapping ​to a Standard Concept ​(there ​is no record in the CONCEPT_RELATIONSHIP or SOURCE_TO_CONCEPT_MAP tables)usually because the Source Concept is too generic or otherwise ill-defined. In these cases, the domain_id field of the Source Concept itself should be used to place the record in the right table. Some Source Concepts have combination Domains, such as "​Device/​Procedure"​. In these cases, write a record into each of the combination Domain ​table (in this case to the DEVICE_OCCURRENCE and PROCEDURE_OCCURRENCE tables). As the Standard Concept write Concept 0 (concept_id=0) into the respective *_concept_id field.+If there is more than one potential destination table the ETL needs to identify the context in which a Standard Concept is usedand select ​the right table from this table.
  
-If the Domain of the destination ​Concept is "​Domain" ​or "​Metadata"​an error has occurred in the construction ​of the mapping table. Please report in the [[http://​forums.ohdsi.org/​c/​cdm-builders/​l/​latest|CDM-Builder]] forum if you believe there is such an error in the Standardized Vocabularies data. +  * For some Source Concepts there is **no mapping to a Standard Concept** (there is no record in the CONCEPT_RELATIONSHIP or SOURCE_TO_CONCEPT_MAP tables), usually because ​the Source ​Concept is too generic ​or otherwise ill-defined. In these cases, the domain_id field of the Source Concept itself should be used to place the record ​in the right tableSome Source Concepts have combination Domainssuch as "Device/Procedure"​. In these caseswrite a record ​into each of the combination Domain table (in this case to the DEVICE_OCCURRENCE and PROCEDURE_OCCURRENCE tables)As the Standard Concept write Concept 0 (concept_id=0) into the respective <​domain>​_concept_id field.
- +
-The same is true if you find a Type Concept Domain, like "Obs Period Type", "Death Type", "Visit Type", "​Procedure ​Type", ​etc.These Type Concepts are valid concepts and have to be placed ​into the respective *_type_concept_id field of the respective CDM tables. However, they should never be introduced as part of a mapping process, as there is no equivalent information ​in the source data, because they denote the origin of where a current record is coming fromType Concepts are hard-wired in the ETL process, because it depends on the structure of the source data how to assign Type Concepts to records. Please also report if you find a situation like that.+
  
 +If in the mapping process a Concept of the "​Domain"​ or "​Metadata"​ Domains are retrieved, an error has snuck into the mapping table. Please report those cases in the [[http://​forums.ohdsi.org/​c/​cdm-builders/​l/​latest|CDM-Builder]].
  
 +The same is true if a Type Concept Domain, like "Obs Period Type", "Death Type", "Visit Type", "​Procedure Type", etc. is produced. Though these Type Concepts are valid concepts and have to be placed into the <​domain>​_type_concept_id field of the respective CDM tables, they cannot be the result of the mapping process. They denote the origin of the record and the selection of Type Concepts should be hard-wired into the ETL process.
documentation/vocabulary/data_etl.txt · Last modified: 2015/01/24 14:23 by cgreich