This shows you the differences between two versions of the page.
Both sides previous revision Previous revision | Next revision Both sides next revision | ||
documentation:vocabulary:data_etl [2015/01/24 13:03] cgreich |
documentation:vocabulary:data_etl [2015/01/24 13:05] cgreich |
||
---|---|---|---|
Line 2: | Line 2: | ||
The most important impact the Standardized Vocabularies have on the ETL process from raw to CDM-formatted data is the Domain of each Concept. Irrespective from which source table a record comes, or what coding scheme it is represented by, the destination table will be determined by the domain_id of the respective Concept. Any ETL will have to follow the following logic for process every record in the source data: | The most important impact the Standardized Vocabularies have on the ETL process from raw to CDM-formatted data is the Domain of each Concept. Irrespective from which source table a record comes, or what coding scheme it is represented by, the destination table will be determined by the domain_id of the respective Concept. Any ETL will have to follow the following logic for process every record in the source data: | ||
* ** Retrieve the source code** from the record. | * ** Retrieve the source code** from the record. | ||
- | * **Find the Concept** that corresponds to the source code. In most cases, that is done by looking up the source code in the concept_code field with the correct content of the vocabulary_id field in the [[documentation:cdm:concept|CONCEPT]] table. Sometimes, the source code needs to be manipulated to find the right match. For example, [[documentation:vocabulary:icd9cm|ICD-9-CM]] codes have a dot after the 2nd or 3rd character, but in source data they are often stored without the dot. Or [[documentation:vocabulary:ndc|NDC]] codes come in a variety of formats (with dashes and asterisks or without, 9 or 11 digit), making the mapping process more complicated. Look into the specification of each vocabulary for specific recommendations of the lookup process. | + | * **Find the Concept** that corresponds to the source code. In most cases, that is done by looking up the source code in the concept_code field with the correct content of the vocabulary_id field in the [[documentation:cdm:concept|CONCEPT]] table. Sometimes, the source code needs to be manipulated to find the right match. For example, [[documentation:vocabulary:icd9cm|ICD-9-CM]] codes have a dot after the 2nd or 3rd character, but in source data they are often stored without the dot. Or [[documentation:vocabulary:ndc|NDC]] codes come in a variety of formats (with dashes and asterisks or without, 9 or 11 digit), making the mapping process more complicated. Look into the specification of each [[documentation:vocabulary:vocabularies|Vocabulary]] for specific recommendations of the lookup process. |
* **Map to a Standard Concept** (standard_concept='S') by retrieving all the active (invalid_reason field should be NULL) records from the [[documentation:cdm:concept_relationship|CONCEPT_RELATIONSHIP]] table. Use concept_id_1 for the Concept you want to map and concept_id_2 for the destination Concept, with relationship_id='Maps to'. If the source code is a local code and you have a [[documentation:cdm:source_to_concept_map|SOURCE_TO_CONCEPT_MAP]] table for these, determine the destination Concept from the target_concept_id field. The destination Concept could be the Concept itself if it happens to be a Standard Concept, or a Concept in another vocabulary. In most cases, a source code maps to a single destination Standard Concept, but in some cases it could be two or three. | * **Map to a Standard Concept** (standard_concept='S') by retrieving all the active (invalid_reason field should be NULL) records from the [[documentation:cdm:concept_relationship|CONCEPT_RELATIONSHIP]] table. Use concept_id_1 for the Concept you want to map and concept_id_2 for the destination Concept, with relationship_id='Maps to'. If the source code is a local code and you have a [[documentation:cdm:source_to_concept_map|SOURCE_TO_CONCEPT_MAP]] table for these, determine the destination Concept from the target_concept_id field. The destination Concept could be the Concept itself if it happens to be a Standard Concept, or a Concept in another vocabulary. In most cases, a source code maps to a single destination Standard Concept, but in some cases it could be two or three. | ||
* **Write a record in the corresponding CDM table** for each destination Standard Concept based on the content of the domain_id field. Place the ID of the Standard Concept into the <domain>_concept_id, the Source Concept into the <domain>_source_concept_id and the source code into the <domain>_source_value field. Most data tables require a start_date and a Type Concept, and some have more fields that need consideration. See the details of each table in the [[documentation:cdm:standardized_clinical_data_tables|CDM Specifications]]. The corresponding table/field combination for each Domain is as follows: | * **Write a record in the corresponding CDM table** for each destination Standard Concept based on the content of the domain_id field. Place the ID of the Standard Concept into the <domain>_concept_id, the Source Concept into the <domain>_source_concept_id and the source code into the <domain>_source_value field. Most data tables require a start_date and a Type Concept, and some have more fields that need consideration. See the details of each table in the [[documentation:cdm:standardized_clinical_data_tables|CDM Specifications]]. The corresponding table/field combination for each Domain is as follows: |