User Tools

Site Tools


documentation:vocabulary:data_etl

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision Both sides next revision
documentation:vocabulary:data_etl [2015/01/24 13:03]
cgreich
documentation:vocabulary:data_etl [2015/01/24 13:05]
cgreich
Line 2: Line 2:
 The most important impact the Standardized Vocabularies have on the ETL process from raw to CDM-formatted data is the Domain of each Concept. Irrespective from which source table a record comes, or what coding scheme it is represented by, the destination table will be determined by the domain_id of the respective Concept. Any ETL will have to follow the following logic for process every record in the source data: The most important impact the Standardized Vocabularies have on the ETL process from raw to CDM-formatted data is the Domain of each Concept. Irrespective from which source table a record comes, or what coding scheme it is represented by, the destination table will be determined by the domain_id of the respective Concept. Any ETL will have to follow the following logic for process every record in the source data:
   * ** Retrieve the source code** from the record.   * ** Retrieve the source code** from the record.
-  * **Find the Concept** that corresponds to the source code. In most cases, that is done by looking up the source code in the concept_code field with the correct content of the vocabulary_id field in the [[documentation:​cdm:​concept|CONCEPT]] table. Sometimes, the source code needs to be manipulated to find the right match. For example, [[documentation:​vocabulary:​icd9cm|ICD-9-CM]] codes have a dot after the 2nd or 3rd character, but in source data they are often stored without the dot. Or [[documentation:​vocabulary:​ndc|NDC]] codes come in a variety of formats (with dashes and asterisks or without, 9 or 11 digit), making the mapping process more complicated. Look into the specification of each vocabulary for specific recommendations of the lookup process.+  * **Find the Concept** that corresponds to the source code. In most cases, that is done by looking up the source code in the concept_code field with the correct content of the vocabulary_id field in the [[documentation:​cdm:​concept|CONCEPT]] table. Sometimes, the source code needs to be manipulated to find the right match. For example, [[documentation:​vocabulary:​icd9cm|ICD-9-CM]] codes have a dot after the 2nd or 3rd character, but in source data they are often stored without the dot. Or [[documentation:​vocabulary:​ndc|NDC]] codes come in a variety of formats (with dashes and asterisks or without, 9 or 11 digit), making the mapping process more complicated. Look into the specification of each [[documentation:​vocabulary:​vocabularies|Vocabulary]] ​for specific recommendations of the lookup process.
   * **Map to a Standard Concept** (standard_concept='​S'​) by retrieving all the active (invalid_reason field should be NULL) records from the [[documentation:​cdm:​concept_relationship|CONCEPT_RELATIONSHIP]] table. Use concept_id_1 for the Concept you want to map and concept_id_2 for the destination Concept, with relationship_id='​Maps to'. If the source code is a local code and you have a [[documentation:​cdm:​source_to_concept_map|SOURCE_TO_CONCEPT_MAP]] table for these, determine the destination Concept from the target_concept_id field. The destination Concept could be the Concept itself if it happens to be a Standard Concept, or a Concept in another vocabulary. In most cases, a source code maps to a single destination Standard Concept, but in some cases it could be two or three.   * **Map to a Standard Concept** (standard_concept='​S'​) by retrieving all the active (invalid_reason field should be NULL) records from the [[documentation:​cdm:​concept_relationship|CONCEPT_RELATIONSHIP]] table. Use concept_id_1 for the Concept you want to map and concept_id_2 for the destination Concept, with relationship_id='​Maps to'. If the source code is a local code and you have a [[documentation:​cdm:​source_to_concept_map|SOURCE_TO_CONCEPT_MAP]] table for these, determine the destination Concept from the target_concept_id field. The destination Concept could be the Concept itself if it happens to be a Standard Concept, or a Concept in another vocabulary. In most cases, a source code maps to a single destination Standard Concept, but in some cases it could be two or three.
   * **Write a record in the corresponding CDM table** for each destination Standard Concept based on the content of the domain_id field. Place the ID of the Standard Concept into the <​domain>​_concept_id,​ the Source Concept into the <​domain>​_source_concept_id and the source code into the <​domain>​_source_value field. Most data tables require a start_date and a Type Concept, and some have more fields that need consideration. See the details of each table in the [[documentation:​cdm:​standardized_clinical_data_tables|CDM Specifications]]. The corresponding table/field combination for each Domain is as follows:   * **Write a record in the corresponding CDM table** for each destination Standard Concept based on the content of the domain_id field. Place the ID of the Standard Concept into the <​domain>​_concept_id,​ the Source Concept into the <​domain>​_source_concept_id and the source code into the <​domain>​_source_value field. Most data tables require a start_date and a Type Concept, and some have more fields that need consideration. See the details of each table in the [[documentation:​cdm:​standardized_clinical_data_tables|CDM Specifications]]. The corresponding table/field combination for each Domain is as follows:
documentation/vocabulary/data_etl.txt ยท Last modified: 2015/01/24 14:23 by cgreich