''!! This cookbook is outdated - do not follow these instructions !!'' Vocabulary Users download vocabularies from Athena: [[https://athena.ohdsi.org/]] ---- ====== Background ====== The CDM Vocabulary is maintained by the Athena vocabulary team led by Christian Reich. This page provides a high level guide or "cookbook" for the ETL process that they follow to periodically update the CDM vocabulary when new source vocabulary data sets are released. This ETL process will be improved and extended with new vocabulary sources over time. It is somewhat complex, it has very specific pre-requisites, it requires a good knowledge of the CDM schema & source vocabulary data sets and it includes some manual steps. This ETL is not intended to be executed by typical users of the CDM Vocabulary. If you just need a copy of the CDM Vocabulary then go to the CDM vocabulary download website here: [[http://www.ohdsi.org/web/athena/]]. ====== OMOP Vocabulary ETL data update cookbook ====== ==== Prerequisites ==== ==== Network connectivity ==== Internet access is required to download source dataset files from external websites and to download the ETL source code from github: https://github.com/OHDSI/Vocabulary-v5.0 ==== Utilities ==== A zip file decompression program is required (e.g. 7-zip for windows or gzip/gunzip for linux) ==== DBMS ==== An Oracle 12 DBMS is required. Notes. Oracle XE cannot be used because it has an 11 GB database size limitation. In order to minimize network latency for database loads and data transformation SQL scripts, it is recommended to host the Oracle DBMS and the data set files to be loaded on the same server. * Create a database/user called PRODV5. This database schema is used for data set and concept staging tables and is where the final output V5 CDM vocabulary tables will be created and populated. * Create a database/user called PRODV4. This database schema is where the final output V4 CDM vocabulary tables will be created and populated. * Create a database/user called umls. This database schema is for the UMLS data feed staging tables. * Note. As a general principle to be followed when running the create_source_tables.sql scripts in this ETL process: You will get the best performance by first running the create table statements, then loading the tables and then running the create index statements. The reason is that it is faster to add an index to a populated table than to load data into a table that already has an existing index. ==== Hardware ==== 2 Gig of RAM on server (minimum) 400 GB disk space for database and source files ==== An existing copy of the Vocabulary tables populated with vocabulary data in your Oracle 12 database ==== This ETL process is an update process which merges new data into an existing populated vocabulary database. If you want to refresh V4 vocabulary data you will need both V4 and V5 vocabularies (V4 is refreshed from V5). If you only want to refresh V5 vocabulary data then V4 is not required and you can skip the V4 update step in the ETL process. If you have no existing Vocabulary data you can begin by downloading the V4 and/or V5 vocabulary export files from here: [[http://www.ohdsi.org/web/athena/]]. V4.5 Oracle DDL and CTL statements can be found here: [[https://github.com/OHDSI/CommonDataModel/tree/master/Version4/Oracle]] V5 Oracle DDL and CTL statements can be found here: https://github.com/OHDSI/CommonDataModel/tree/master/Oracle Note. The Oracle CTL statements for the vocabulary data loads are in the vocabimport sub-directories. This table shows the web sites to visit and the files to download for this ETL process. ^ Data Feed ^Website ^Website Download Page ^Example Download File Name^ | UMLS | UMLS Website | http://www.nlm.nih.gov/research/umls/licensedcontent/umlsknowledgesources.html| http://download.nlm.nih.gov/umls/kss/2014AB/full/2014ab-1-meta.nlm | | UMLS | UMLS Website | http://www.nlm.nih.gov/research/umls/licensedcontent/umlsknowledgesources.html | http://download.nlm.nih.gov/umls/kss/2014AB/full/2014ab-2-meta.nlm | | SNOMED | UMLS Website | http://www.nlm.nih.gov/research/umls/licensedcontent/snomedctfiles.html | http://download.nlm.nih.gov/umls/kss/IHTSDO20150131/SnomedCT_RF2Release_INT_20150131.zip | | SNOMED | UK HSCIC Website | https://isd.hscic.gov.uk/trud3/user/authenticated/group/0/pack/26/subpack/102/releases | https://isd.hscic.gov.uk/artefact/trud3/1kkxe04jo40gw5q3xs0obolnvx/SNOMEDCT2/18.0.0/UK_SCT2CLFULL/uk_sct2clfull_18.0.0_20141001000001.zip | | SNOMED | UK HSCIC Website | https://isd.hscic.gov.uk/trud3/user/authenticated/group/0/pack/6/subpack/24/releases | https://isd.hscic.gov.uk/artefact/trud3/15sgy3n9hin6pvu7rwv11go46t/NHSBSA/6.1.0/NHSBSA_DMD/nhsbsa_dmd_6.1.0_20150608000001.zip | | SNOMED | UK HSCIC Website | https://isd.hscic.gov.uk/trud3/user/authenticated/group/0/pack/6/subpack/25/releases | https://isd.hscic.gov.uk/artefact/trud3/1mebfzrdolgi1kwge9xh0ddgtl/NHSBSA/6.3.0/NHSBSA_DMDBONUS/nhsbsa_dmdbonus_6.3.0_20150622000001.zip | | RxNorm, NDFRT, VA Product, VA Class, ATC, MeSH, ICD10, GCN_SEQNO, ETC, Indication | UMLS website | http://www.nlm.nih.gov/research/umls/rxnorm/docs/rxnormfiles.html | http://download.nlm.nih.gov/umls/kss/rxnorm/RxNorm_full_03022015.zip | | HCPCS| CMS website | http://www.cms.gov/Medicare/Coding/HCPCSReleaseCodeSets/Alpha-Numeric-HCPCS.html | https://www.cms.gov/Medicare/Coding/HCPCSReleaseCodeSets/Downloads/2016-Alpha-Numeric-HCPCS-File.zip | | ICD9CM and ICD9Proc | CMS website | http://www.cms.gov/Medicare/Coding/ICD9ProviderDiagnosticCodes/codes.html| http://www.cms.gov/Medicare/Coding/ICD9ProviderDiagnosticCodes/Downloads/ICD-9-CM-v32-master-descriptions.zip | | ICD10CM | CDC website | http://www.cdc.gov/nchs/icd/icd10cm.htm | ftp://ftp.cdc.gov/pub/Health_Statistics/NCHS/Publications/ICD10CM/20xx/ICD10CM_FY20xx_code_descriptions.zip | | LOINC| LOIC website | https://loinc.org/downloads/files/loinc-table-csv-text-format| http://loinc.org/downloads/files/loinc-table-csv-text-format/loinc-table-file-csv-text-format/download | | LOINC| LOIC website | https://loinc.org/downloads/files/loinc-multiaxial-hierarchy| http://loinc.org/downloads/files/loinc-multiaxial-hierarchy/loinc-multiaxial-hierarchy-file/download | | LOINC| LOIC website | https://loinc.org/downloads/files/loinc-panels-and-forms-file | http://loinc.org/downloads/files/loinc-panels-and-forms-file/loinc-panels-and-forms-file-all-selected-panels-and-forms/download | | LOINC| LOIC website | https://loinc.org/news/draft-loinc-snomed-ct-mappings-and-expression-associations-now-available.html | http://loinc.org/downloads/files/loinc-snomed-ct-expression-associations-technology-preview/loinc-snomed-ct-expression-association-and-map-sets-file-technology-preview-rf2-format/download | | LOINC| UMLS website | http://www.nlm.nih.gov/research/umls/mapping_projects/loinc_to_cpt_map.html | http://download.nlm.nih.gov/umls/kss/mappings/LNC215_TO_CPT2005/LNC215_TO_CPT2005_MAPPINGS.zip | | NDC-SPL| FDA website | http://www.fda.gov/Drugs/InformationOnDrugs/ucm142438.htm | http://www.fda.gov/downloads/Drugs/DevelopmentApprovalProcess/UCM070838.zip | | NDC-SPL| Dailymed website | http://dailymed.nlm.nih.gov/dailymed/spl-resources-all-drug-labels.cfm | ftp://public.nlm.nih.gov/nlmdata/.dailymed/dm_spl_release_human_rx_part1.zip | | NDC-SPL| Dailymed website | http://dailymed.nlm.nih.gov/dailymed/spl-resources-all-mapping-files.cfm | ftp://public.nlm.nih.gov/nlmdata/.dailymed/rxnorm_mappings.zip | | READ| UK HSCIC website | https://isd.hscic.gov.uk/trud3/user/authenticated/group/0/pack/1/subpack/21/releases. | select latest complete released file on the web page | | READ| UK HSCIC website | https://isd.hscic.gov.uk/trud3/user/authenticated/group/0/pack/9/subpack/9/releases | select latest complete released file on the web page | | DRG| CMS website | http://www.cms.gov/Medicare/Medicare-Fee-for-Service-Payment/AcuteInpatientPPS/Acute-Inpatient-Files-for-Download.html | select latest complete released file on the web page | ==== Licenses/registrations required for access to some source datasets ==== * UML license * MedDRA license * User registration at the LOINC website * User registration at the UK HSCIC website ==== Vocabulary Updates ETL scripts ==== Create a directory e.g. called /vocabETL on your server (on a Windows server use the name C:\vocabETL - make similar windows file path substitutions in subsequent instructions in this document) Use a web browser or the command-line program wget to download the following zip file into the vocabETL directory: https://github.com/OHDSI/Vocabulary-v5.0/archive/master.zip unzip the master.zip file using 7-Zip on Windows or unzip on Linux. This will result in the following directory structure: /Vocabulary-v5.0-master/ * CPT4 * HCPCS * ICD10 * ICD9CM * ICD10CM * ICD9Proc * LOINC * MedDRA * NDC_SPL * Read * RxNorm * SNOMED * UMLS * DRG * MeSH * working * Final_Assembly * Conversion_from_v4 There is one sub-directory containing the ETL scripts for each data feed, plus 3 additional sub-directories explained below. ==== “Working” directory ==== There are a number of scripts in this sub-directory (Also see the process overview diagram later in this document). * 'DevV5_DDL.sql' – create the V5 CDM vocabulary tables database schema * 'generic_update.sql' - populate and merge the data from the staging tables into the output V5 CDM vocabulary schema tables. * 'manual_changes.sql' and 'manual_changes_15-Mar-2015.sql' – manual scripts to make some vocabulary data inserts/updates in the CDM vocabulary tables. * ‘v5-to-v4.sql’ - transform the v5 CDM vocabulary tables data into v4 format data to populate the V4 CDM database vocabulary tables. * Additionally there are some utility scripts that do NOT need to be run as part of the vocabulary ETL process: * 'devv4-to-prodv4.sql' and 'devv5-to-prodv5.sql' – migrate CDM V4 or V5 database schema changes from the development database environment to the production database environment * 'fast_recreate_schema.sql' – rapidly recreate the core CDM vocabulary tables database schema: concept, concept_relationship and concept_synonym === “Final_Assembly” directory === The 2 scripts in this sub-directory are run to populate the CONCEPT_ANCESTOR and DRUG_STRENGTH CDM vocabulary tables from the staging tables. === “Conversion_from_v4” directory === The ‘Conversion_from_v4’ scripts are NOT part of the ongoing vocabulary ETL process and should NOT be run. ==== ETL Process Overview ==== The UMLS knowledge sources, SNOMED and RxNorm are the core set of vocabularies that must be loaded first. The processing of the other vocabularies is dependent on first having those three vocabularies loaded. See the diagram below for a high level overview of the process. {{:documentation:vocabulary:cdm_vocab_etl_overview.png?300|}} ==== ETL Process Steps ==== ==== Step 1 ==== === Overview === Download, the UML datasets from the UMLS website. Decompress the gzipped files. Run the ETL script to create the UML database staging tables schema and load the datasets into the UMLS staging tables. === Details === - Log into the UMLS website. - Navigate to the page http://www.nlm.nih.gov/research/umls/licensedcontent/umlsknowledgesources.html and download both the files YYYYab-1-meta.nlm and YYYYab-2-meta.nlm into the vocabETL/vocabulary-v5.0-master/UML sub-directory - An example pair of file names would be: 2014ab-1-meta.nlm and 2014ab-2-meta.nlm - The two files are actually zip files even though they don't have a .zip file extension. - Connect to the PRODV5 SQL database in an Oracle SQL client and run the create_source_tables.sql script to create the UML staging tables. (skip this step if the staging tables already exists). Each table has the same name as the UML dataset that will populate it (see files listed in below step). - Extract (unzip) the YYYYab-1-meta.nlm and YYYYab-2-meta.nlm files into the current directory (vocabETL/vocabulary-v5.0-master/UML) using unzip and gzip on Windows or gunzip on Linux . Then unzip the extracted files (each file is gzipped) Note. There are multiple sets of extracted files. Each set of files must be concatenated together to create a single file. A. Linux commands to unzip and concatenate files (run in shell): cd /VOCABETL/Vocabulary-v5.0-master/UMLS/YYYYAB\META where YYYY is the year (e.g. 2014). gunzip *-meta.nlm gunzip *.gz --cat MRCONSO.RRF.aa MRCONSO.RRF.ab > MRCONSO.RRF --cat MRHIER.RRF.aa MRHIER.RRF.ab > MRHIER.RRF --cat MRREL.RRF.aa MRREL.RRF.ab MRREL.RRF.ac MRREL.RRF.ad > MRREL.RRF --cat MRSAT.RRF.aa MRSAT.RRF.ab MRSAT.RRF.ac MRSAT.RRF.ad > MRSAT.RRF B. Alternative commands to run for Windows (run in powershell): cd C:\VOCABETL\Vocabulary-v5.0-master\UMLS\2014AB\META where YYYY is the year (e.g. 2014). unzip *-meta.nlm gzip -d *.gz MRCONSO.RRF.aa.gz MRCONSO.RRF.ab.gz MRHIER.RRF.aa.gz MRHIER.RRF.ab.gz MRREL.RRF.aa.gz MRREL.RRF.ab.gz MRREL.RRF.ac.gz MRREL.RRF.ad.gz MRSAT.RRF.aa.gz MRSAT.RRF.ab.gz MRSAT.RRF.ac.gz MRSAT.RRF.ad.gz cmd /c copy MRCONSO.RRF.aa + MRCONSO.RRF.ab MRCONSO.RRF cmd /c copy MRHIER.RRF.aa + MRHIER.RRF.ab MRHIER.RRF cmd /c copy MRREL.RRF.aa + MRREL.RRF.ab + MRREL.RRF.ac + MRREL.RRF.ad MRREL.RRF cmd /c copy MRSAT.RRF.aa +MRSAT.RRF.ab + MRSAT.RRF.ac + MRSAT.RRF.ad MRSAT.RRF Run the following commands in a linux shell or Windows command-line to load each file into the staging table in PRODV5 with the same name. Note. Each file has a separate control file which is in the vocabETL/vocabulary-v5.0-master/UML sub-directory and the full path to the data sets to be loaded must be specified in the command (replace with the database password of the PRODV5 user in the sqlldr statements below). Note. You can review and then ignore any single .bad record for each file which just contains the Windows EOF file char added by the above batch copy statements. sqlldr 'PRODV5/' CONTROL=MRCONSO.ctl LOG=MRCONSO.log BAD=MRCONSO.bad sqlldr 'PRODV5/' CONTROL=MRHIER.ctl LOG=MRHIER.log BAD=MRHIER.bad sqlldr 'PRODV5/' CONTROL=MRMAP.ctl LOG=MRMAP.log BAD=MRMAP.bad sqlldr 'PRODV5/' CONTROL=MRREL.ctl LOG=MRREL.log BAD=MRREL.bad sqlldr 'PRODV5/' CONTROL=MRSAT.ctl LOG=MRSAT.log BAD=MRSAT.bad sqlldr 'PRODV5/' CONTROL=MRSMAP.ctl LOG=MRSMAP.log BAD=MRSMAP.bad ==== Step 2 ==== === Overview === Download, the SNOMED datasets from the website. Decompress the gzipped files. Run the ETL script to create the staging tables schema and load the datasets. === Notes === * step 14 of load_stage.sql: update concept_stage from concept. Before running this step run the below statement so that the column sizes match and oracle will use the concept_code indexes for the join. alter table concept_stage modify (concept_code varchar2(50)); The international SNOMED-CT and the UK SNOMED-CT will be loaded into separate staging tables and then unioned together in views ready for processing in the load_stage.sql step below. === Details === - Run create_source_tables.sql - Download the international SNOMED file SnomedCT_Release_INT_YYYYMMDD.zip from http://www.nlm.nih.gov/research/umls/licensedcontent/snomedctfiles.html. - Extract the release date from the file name. - Extract the following files from the folder: * SnomedCT_Release_INT_YYYYMMDD\RF2Release\Full\Terminology: * sct2_Concept_Full_INT_YYYYMMDD.txt * sct2_Description_Full-en_INT_YYYYMMDD.txt * sct2_Relationship_Full_INT_YYYYMMDD.txt Remove date from file name. - Load them into SCT2_CONCEPT_FULL_INT, SCT2_DESC_FULL_EN_INT and SCT2_RELA_FULL_INT. Use the control files of the same name. - Download the British SNOMED file SNOMEDCT2_XX.0.0_YYYYMMDD000001.zip from https://isd.hscic.gov.uk/trud3/user/authenticated/group/0/pack/26/subpack/102/releases - Extract the release date from the file name. - Extract the following files from the folder SnomedCT2_GB1000000_YYYYMMDD\RF2Release\Full\Terminology into a working folder: *sct2_Concept_Full_GB1000000_YYYYMMDD.txt * sct2_Description_Full-en-GB_GB1000000_YYYYMMDD.txt * sct2_Description_Full-en-GB_GB1000000_YYYYMMDD.txt Remove date from file name. - Load them into SCT2_CONCEPT_FULL_UK, SCT2_DESC_FULL_UK, SCT2_RELA_FULL_UK. Use the control files in Vocabulary-v5.0\01-SNOMED - Extract der2_cRefset_AssociationReferenceFull_INT_YYYYMMDD.txt from SnomedCT_Release_INT_YYYYMMDD\RF2Release\Full\Refset\Content and der2_cRefset_AssociationReferenceFull_GB1000000_YYYYMMDD.txt from SnomedCT2_GB1000000_YYYYMMDD\RF2Release\Full\Refset\Content Remove date from file name. - Load them into der2_cRefset_AssRefFull_INT and der2_cRefset_AssRefFull_UK. (rename the data files to remove any YYYYMMDD suffix so they match the file names in the ctl files) - Add DM+D: Download nhsbsa_dmd_X.X.X_xxxxxxxxxxxxxx.zip from https://isd.hscic.gov.uk/trud3/user/authenticated/group/0/pack/6/subpack/24/releases - Extract f_ampp2_xxxxxxx.xml, f_amp2_xxxxxxx.xml, f_vmpp2_xxxxxxx.xml, f_vmp2_xxxxxxx.xml, f_lookup2_xxxxxxx.xml, f_vtm2_xxxxxxx.xml and f_ingredient2_xxxxxxx.xml - Load them into f_ampp2, f_amp2, f_vmpp2, f_vmp2, f_lookup2, f_vtm2 and f_ingredient2. Use the control files of the same name. - Download nhsbsa_dmdbonus_X.X.X_YYYYMMDDXXXXXX.zip from https://isd.hscic.gov.uk/trud3/user/authenticated/group/0/pack/6/subpack/25/releases - Extract weekXXYYYY-rX_X-BNF.zip/f_bnf1_XXXXXXX.xml and rename him to dmdbonus.xml - Load dmdbonus.xml using dmdbonus.ctl Staging table load statements sqlldr PRODV5/ CONTROL=SCT2_CONCEPT_FULL_INT.ctl LOG=SCT2_CONCEPT_FULL_INT.log BAD=SCT2_CONCEPT_FULL_INT.bad sqlldr PRODV5/ CONTROL=SCT2_DESC_FULL_EN_INT.ctl LOG=SCT2_DESC_FULL_EN_INT.log BAD=SCT2_DESC_FULL_EN_INT.bad sqlldr PRODV5/ CONTROL=SCT2_RELA_FULL_INT.ctl LOG=SCT2_RELA_FULL_INT.log BAD=SCT2_RELA_FULL_INT.bad sqlldr PRODV5/ CONTROL=SCT2_CONCEPT_FULL_UK.ctl LOG=SCT2_CONCEPT_FULL_INT.log BAD=SCT2_CONCEPT_FULL_UK.bad sqlldr PRODV5/ CONTROL=SCT2_DESC_FULL_UK.ctl LOG=SCT2_DESC_FULL_UK.log BAD=SCT2_DESC_FULL_UK.bad sqlldr PRODV5/ CONTROL=SCT2_RELA_FULL_UK.ctl LOG=SCT2_RELA_FULL_UK.log BAD=SCT2_RELA_FULL_UK.bad sqlldr PRODV5/ CONTROL=der2_cRefset_AssRefFull_INT.ctl LOG=der2_cRefset_AssRefFull_INT.log BAD=der2_cRefset_AssRefFull_INT.bad sqlldr PRODV5/ CONTROL=der2_cRefset_AssRefFull_UK.ctl LOG=der2_cRefset_AssRefFull_UK.log BAD=der2_cRefset_AssRefFull_UK.bad sqlldr PRODV5/ CONTROL=f_ampp2.ctl LOG=f_ampp2.log BAD=f_ampp2.bad sqlldr PRODV5/ CONTROL=f_vmpp2.ctl LOG=f_vmpp2.log BAD=f_vmpp2.bad sqlldr PRODV5/ CONTROL=f_amp2.ctl LOG=f_amp2.log BAD=f_amp2.bad sqlldr PRODV5/ CONTROL=f_vmp2.ctl LOG=f_vmp2.log BAD=f_vmp2.bad sqlldr PRODV5/ CONTROL=f_vtm2.ctl LOG=f_vtm2.log BAD=f_vtm2.bad sqlldr PRODV5/ CONTROL=f_ingredient2.ctl LOG=f_ingredient2.log BAD=f_ingredient2.bad sqlldr PRODV5/ CONTROL=f_lookup2.ctl LOG=f_lookup2.log BAD=f_lookup2.bad sqlldr PRODV5/ CONTROL=dmdbonus.ctl LOG=dmdbonus.log BAD=dmdbonus.bad - Run load_stage.sql - Run generic_update.sql (from working directory) ==== Step 3 ==== === Overview === Download, the RxNorm datasets from the website. Decompress the gzipped files. Run the ETL script to create database staging tables schema and load the datasets. === Notes === Update the update vocabulary latest_update date statement in create_source_tables.sql to the correct date for the file you are loading. Make this change in load_stage.sql for each data feed. This ETL step processes RxNorm, ATC, NDFRT, VA Product , VA Class and ATC vocabularies. At step 15 of create_source_tables.sql you must run the generic_update_sql script (from "working" directory) to load the RxNorm concepts from the concept staging tables. You also need to run that same script after the load_stage.sql script finishes. === Details === - Run create_source_tables.sql - Download RxNorm_full_MMDDYYYY.zip from http://www.nlm.nih.gov/research/umls/rxnorm/docs/rxnormfiles.html - Extract and load into the tables from the folder rrf. Use the control files of the same name RXNATOMARCHIVE RXNCONSO RXNCUI RXNCUICHANGES RXNDOC RXNREL RXNSAB RXNSAT RXNSTY - Run load_stage.sql - Run generic_update.sql (from working directory) ==== Step 4 ==== === Overview === Download the HCPCS datasets from the website. Decompress the gzipped files. Run the ETL script to create the staging tables schema and load the datasets. === Notes === load_stage.sql Steps 9 and 10 are completely manual – a select statement to generate the set of Medical Coder rows which needs to be exported as a file followed by an instruction “Append resulting file from Medical Coder (in concept_relationship_stage format) to concept_relationship_stage”. In the future this could be converted into an insert append statement. Note. The original HCPCS HCPYYYY_CONTR_ANWEB_V2.txt file is a fixed field format file that will not load with the ANWEB_V2.ctl file which requires a comma separated file so see details below on creating a csv file to use for the load. === Details === - Run create_source_tables.sql - Download the latest file http://www.cms.gov/Medicare/Coding/HCPCSReleaseCodeSets/Alpha-Numeric-HCPCS.html, file name YYYY-Alpha-Numeric-HCPCS-File.zip - Extract HCPCYYYY_CONTR_ANWEB.xlsx - Open the above file in excel and export a csv file (with double quoted fields) called ANWEB_V2.csv for loading into ANWEB_V2 staging table. - Load the file data into ANWEB_V2. Use the control files of the same name. - Run load_stage.sql - Run generic_update.sql (from working directory) Staging table load statements sqlldr PRODV5/ CONTROL=ANWEB_V2.ctl DATA=HCPC2015_CONTR_ANWEB_v2.tsv LOG=ANWEB_V2.log BAD=ANWEB_V2.bad ==== Step 5 ==== === Overview === Download, the ICD9CM datasets. Decompress the gzipped files. Run the ETL script to create the staging tables schema and load the datasets. === Details === - Run create_source_tables.sql - Download from ICD-9-CM-vXX-master-descriptions.zip from http://www.cms.gov/Medicare/Coding/ICD9ProviderDiagnosticCodes/codes.html - Extract CMSXX_DESC_LONG_DX.txt and CMSXX_DESC_SHORT_DX.txt - Load them into CMS_DESC_LONG_DX and CMS_DESC_SHORT_DX. Use the control files of the same name. - Run load_stage.sql - Run generic_update.sql (from working directory) Staging table load statements sqlldr PRODV5/ CONTROL=CMS32_DESC_LONG_DX.ctl LOG=CMS32_DESC_LONG_DX.log BAD=CMS32_DESC_LONG_DX.bad sqlldr PRODV5/ CONTROL=CMS32_DESC_SHORT_DX.ctl LOG=CMS32_DESC_SHORT_DX.log BAD=CMS32_DESC_SHORT_DX.bad ==== Step 6 ==== === Overview === Download, the ICD9Proc datasets. Decompress the gzipped files. Run the ETL script to create the staging tables schema and load the datasets. === Notes === load_stage.sql Steps 8 and 9 are completely manual – a select statement to generate the set of Medical Coder rows which need to be exported as a file called 'concept_relationship_manual9cm'. In the future these two steps could be converted into an insert append sub-select statement. - Run create_source_tables.sql - Download from ICD-9-CM-vXX-master-descriptions.zip from http://www.cms.gov/Medicare/Coding/ICD9ProviderDiagnosticCodes/codes.html - Extract CMSXX_DESC_LONG_SG.txt and CMSXX_DESC_SHORT_SG.txt - Load them into CMS_DESC_LONG_SG and CMS_DESC_SHORT_SG. Use the control files of the same name. - Run load_stage.sql - Run generic_update.sql (from working directory) Staging table load statements sqlldr PRODV5/ CONTROL=CMS32_DESC_LONG_SG.ctl LOG=CMS32_DESC_LONG_SG.log BAD=CMS32_DESC_LONG_SG.bad sqlldr PRODV5/ CONTROL=CMS32_DESC_SHORT_SG.ctl LOG=CMS32_DESC_SHORT_SG.log BAD=CMS32_DESC_SHORT_SG.bad ==== Step 7 ==== === Overview === Download, the LOINC datasets. Decompress the gzipped files. Run the ETL script to create the staging tables schema and load the datasets. === Notes === Add the missing create table statement for LOINC_HIERARCHY and add missing columns to LOINC_ANSWERS create statement in create_source_tables.sql, The added/update two tables are shown below: CREATE TABLE LOINC_HIERARCHY ( PATH_TO_ROOT VARCHAR2(4000), SEQUENCE NUMBER(38,0), IMMEDIATE_PARENT VARCHAR2(255), CODE VARCHAR2(255), CODE_TEXT VARCHAR2(255) ); CREATE TABLE LOINC ( LOINC_NUM VARCHAR2(10), COMPONENT VARCHAR2(255), PROPERTY VARCHAR2(30), TIME_ASPCT VARCHAR2(15), SYSTEM VARCHAR2(100), SCALE_TYP VARCHAR2(30), METHOD_TYP VARCHAR2(50), CLASS VARCHAR2(20), SOURCE VARCHAR2(8), DATE_LAST_CHANGED DATE, CHNG_TYPE VARCHAR2(3), COMMENTS CLOB, STATUS VARCHAR2(11), CONSUMER_NAME VARCHAR2(255), MOLAR_MASS VARCHAR2(13), CLASSTYPE VARCHAR2(20), FORMULA VARCHAR2(255), SPECIES VARCHAR2(20), EXMPL_ANSWERS CLOB, ACSSYM CLOB, BASE_NAME VARCHAR2(50), NAACCR_ID VARCHAR2(20), CODE_TABLE VARCHAR2(10), SURVEY_QUEST_TEXT CLOB, SURVEY_QUEST_SRC VARCHAR2(50), UNITSREQUIRED VARCHAR2(1), SUBMITTED_UNITS VARCHAR2(30), RELATEDNAMES2 CLOB, SHORTNAME VARCHAR2(40), ORDER_OBS VARCHAR2(15), CDISC_COMMON_TESTS VARCHAR2(1), HL7_FIELD_SUBFIELD_ID VARCHAR2(50), EXTERNAL_COPYRIGHT_NOTICE CLOB, EXAMPLE_UNITS VARCHAR2(255), LONG_COMMON_NAME VARCHAR2(255), HL7_V2_DATATYPE VARCHAR2(255), HL7_V3_DATATYPE VARCHAR2(255), CURATED_RANGE_AND_UNITS CLOB, DOCUMENT_SECTION VARCHAR2(255), EXAMPLE_UCUM_UNITS VARCHAR2(255), EXAMPLE_SI_UCUM_UNITS VARCHAR2(255), STATUS_REASON VARCHAR2(9), STATUS_TEXT CLOB, CHANGE_REASON_PUBLIC CLOB, COMMON_TEST_RANK VARCHAR2(20), COMMON_ORDER_RANK VARCHAR2(20), COMMON_SI_TEST_RANK VARCHAR2(20), HL7_ATTACHMENT_STRUCTURE VARCHAR2(15) ); === Details === - Run create_source_tables.sql - Download full set (http://loinc.org/downloads/files/loinc-table-csv-text-format/loinc-table-file-csv-text-format/download) and multiaxial hierarchy (http://loinc.org/downloads/files/loinc-multiaxial-hierarchy/loinc-multiaxial-hierarchy-file/download) - Extract loinc.csv, map_to.csv, source_organization.csv and LOINC_250_MULTI-AXIAL_HIERARCHY.CSV - Load them into LOINC, MAP_TO, SOURCE_ORGANIZATION and LOINC_HIERARCHY. Use the control files of the same name. - Load LOINC Answers and LOINC Forms - Load LOINC_XXX_PanelsAndForms.zip from http://loinc.org/downloads/files/loinc-panels-and-forms-file/loinc-panels-and-forms-file-all-selected-panels-and-forms/download - Open LOINC_XXX_PanelsAndForms.xlsx and load worksheet "ANSWERS" to table LOINC_ANSWERS (clear columns after DisplayText, save as Unicode text (UTF-8 w/o BOM) and use loinc_answers.ctl), worksheet "FORMS" to table LOINC_FORMS (clear columns after Loinc, save as Unicode text (UTF-8 w/o BOM) and use loinc_forms.ctl) - Open loinc_class.csv and load it into table loinc_class - Download SnomedCT_LOINC_TechnologyPreview_INT_xxxxxxxx.zip from https://loinc.org/news/draft-loinc-snomed-ct-mappings-and-expression-associations-now-available.html - Extract \RF2Release\Full\Refset\Content\xder2_scccRefset_MapCorrelationOriginFull_INT_xxxxxxxx.txt - Load it into scccRefset_MapCorrOrFull_INT using xder2_scccRefset_MapCorrelationOriginFull_INT.ctl - Download LNCxxx_TO_CPT2005_MAPPINGS.zip from http://www.nlm.nih.gov/research/umls/mapping_projects/loinc_to_cpt_map.html - Extract MRSMAP.RRF and load into CPT_MRSMAP using CPT_MRSMAP.ctl - Run load_stage.sql - Run generic_update.sql (from working directory) Staging table load statements sqlldr PRODV5 CONTROL=LOINC.ctl LOG=LOINC.log BAD=LOINC.bad sqlldr PRODV5/ CONTROL=MAP_TO.ctl LOG=MAP_TO.log BAD=MAP_TO.bad sqlldr PRODV5/ CONTROL=SOURCE_ORGANIZATION.ctl LOG=SOURCE_ORGANIZATION.log BAD=SOURCE_ORGANIZATION.bad sqlldr PRODV5/ CONTROL=LOINC_HIERARCHY.ctl LOG=LOINC_HIERARCHY.log BAD=LOINC_HIERARCHY.bad sqlldr PRODV5/ CONTROL=LOINC_CLASS.ctl LOG=LOINC_CLASS.log BAD=LOINC_CLASS.bad sqlldr PRODV5/ CONTROL=LOINC_ANSWERS.ctl LOG=LOINC_ANSWERS.log BAD=LOINC_ANSWERS.bad sqlldr PRODV5/ CONTROL=xder2_scccRefset_MapCorrelationOriginFull_INT.ctl LOG=xder2_scccRefset_MapCorrelationOriginFull_INT.ctl.log BAD=xder2_scccRefset_MapCorrelationOriginFull_INT.ctl.bad sqlldr PRODV5/ CONTROL=CPT_MRSMAP.ctl LOG=CPT_MRSMAP.log BAD=CPT_MRSMAP.bad ==== Step 8 ==== === Overview === Download, the MEDDRA datasets. Decompress the gzipped files. Run the ETL script to create the staging tables schema and load the datasets. You must have a MedDRA license to download the data set. The processing of this data set follows the same approach as the other data feeds. === Details === - Run create_source_tables.sql - Download the current Meddra from https://www.meddra.org/user?destination=downloads (english) - Extract and load into the tables from the folder "MedAscii". Use the control files of the same name hlgt.asc hlgt_hlt.asc hlt.asc hlt_pt.asc intl_ord.asc llt.asc mdhier.asc pt.asc SMQ_Content.asc SMQ_List.asc soc.asc soc_hlgt.asc - Run load_stage.sql - Run generic_update.sql (from working directory) ==== Step 9 ==== === overview === Download, the NDC-SPL datasets. Decompress the gzipped files. Run the ETL script to create the staging tables schema and load the datasets. === Notes === This data feed loads the NDC vocabulary and the SPL vocabularies. === Details === - Run create_source_tables.sql - Download NDC code distrbution file Open the site http://www.fda.gov/Drugs/InformationOnDrugs/ucm142438.htm - Download the latest NDC Database File. - Extract the product.txt file. - Load product.txt into PRODUCT using control file of the same name - Download additional source for SPL concepts and relationships from http://dailymed.nlm.nih.gov/dailymed/spl-resources-all-drug-labels.cfm and http://dailymed.nlm.nih.gov/dailymed/spl-resources-all-mapping-files.cfm - -Full Releases of HUMAN PRESCRIPTION LABELS, HUMAN OTC LABELS, HOMEOPATHIC LABELS and REMAINDER LABELS (1st link) - -SPL-RXNORM MAPPINGS (2d link) - Extract LABELS using unzipxml.sh and load xml files using loadxml.ctl - Extract rxnorm_mappings.zip and load rxnorm_mappings.txt using rxnorm_mappings.ctl - Run load_stage.sql - Run generic_update.sql (from working directory) Staging table load statements sqlldr PRODV5/ CONTROL=PRODUCT.ctl LOG=PRODUCT.log BAD=PRODUCT.bad ==== Step 10 ==== === Overview === Download, the READ code datasets. Decompress the gzipped files. Run the ETL script to create the staging tables schema and load the datasets. === Details === - Run create_source_tables.sql - Download Read V2 (CTV2) code distrbution file from the site https://isd.hscic.gov.uk/trud3/user/authenticated/group/0/pack/1/subpack/21/releases. - Download the latest release xx. - Extract the nhs_readv2_xxxxx\V2\Unified\Keyv2.all file. - Download Read V2 (CTV2) code to SNOMED CT mapping file nhs_datamigration_VV.0.0_YYYYMMDD000001.zip from https://isd.hscic.gov.uk/trud3/user/authenticated/group/0/pack/9/subpack/9/releases (Subscription "NHS Data Migration"). "VV" stands for the READ CODE version number. This script loads READ CODE V2. "YYYYMMDD" stands for the release date. - Download the latest release xx. - Extract nhs_datamigration_xxxxx\Mapping Tables\Updated\Clinically Assured\rcsctmap2_uk_YYYYMMDD000001.txt  - Load Keyv2.all and rcsctmap2_uk_YYYYMMDD000001.txt  into KEYV2 and RCSCTMAP2_UK using control files of the same name - Run load_stage.sql - Run generic_update.sql (from working directory) staging table load statements sqlldr PRODV5/ CONTROL=Keyv2.ctl LOG=Keyv2.log BAD=Keyv2.bad sqlldr PRODV5/ CONTROL=rcsctmap2_uk.ctl LOG=rcsctmap2_uk.log BAD=rcsctmap2_uk.bad ==== Step 11 ==== === Overview === Download, the DRG dataset. Decompress the gzipped files. Run the ETL script to create the staging tables schema and load the datasets. === Details === - Run create_source_tables.sql - Download "Files for FY" (Table 5 only) from http://www.cms.gov/Medicare/Medicare-Fee-for-Service-Payment/AcuteInpatientPPS/Acute-Inpatient-Files-for-Download.html - 2011: http://www.cms.gov/Medicare/Medicare-Fee-for-Service-Payment/AcuteInpatientPPS/Downloads/FY_2011_FR_Table_5.zip - 2012: http://www.cms.gov/Medicare/Medicare-Fee-for-Service-Payment/AcuteInpatientPPS/Downloads/FY_12_NPRM_Table_5.zip - 2013: http://www.cms.gov/Medicare/Medicare-Fee-for-Service-Payment/AcuteInpatientPPS/Downloads/FY_13_FR_Table_5.zip - 2014: http://www.cms.gov/Medicare/Medicare-Fee-for-Service-Payment/AcuteInpatientPPS/Downloads/FY_14_FR_Table_5.zip - 2015: http://www.cms.gov/Medicare/Medicare-Fee-for-Service-Payment/AcuteInpatientPPS/Downloads/FY2015-NPRM-Table-5.zip - Extract *.txt files from archives - Rename them to FY2011.txt, FY2012.txt ... FY2015.txt and sequentially load using control files of the same names - Run load_stage.sql - generic_update.sql NOT needed staging table load statements sqlldr PRODV5/ CONTROL=FY2011.ctl LOG=FY2011.log BAD=FY2011.bad sqlldr PRODV5/ CONTROL=FY2012.ctl LOG=FY2012.log BAD=FY2012.bad sqlldr PRODV5/ CONTROL=FY2013.ctl LOG=FY2013.log BAD=FY2013.bad sqlldr PRODV5/ CONTROL=FY2014.ctl LOG=FY2014.log BAD=FY2014.bad sqlldr PRODV5/ CONTROL=FY2015.ctl LOG=FY2015.log BAD=FY2015.bad ==== Step 12 ==== === Overview === MeSH datasets contains in UMLS. === Details === - Run load_stage.sql - Run generic_update.sql (from working directory) ==== Step 13 ==== === Overview === ICD10 datasets contains in UMLS. === Details === - Run load_stage.sql - Run generic_update.sql (from working directory) ==== Step 14 ==== === Overview === GCN_SEQNO datasets contains in RxNorm. === Details === - Run create_source_tables.sql - Load NDDF_PRODUCT_INFO.TXT into NDDF_PRODUCT_INFO using control file of the same name - Run load_stage.sql - Run generic_update.sql (from working directory) ==== Step 15 ==== === Overview === ETC datasets contains in "ETC sources.zip". === Details === - Run create_source_tables.sql - Unpack "ETC sources.zip" - Load all TXT files using control files of the same name - Run load_stage.sql - Run generic_update.sql (from working directory) ==== Step 16 ==== === Overview === Indication datasets contains in "Indication sources.zip". === Details === - Run create_source_tables.sql - Unpack "Indication sources.zip" - Load all TXT files using control files of the same name - Run load_stage.sql - Run generic_update.sql (from working directory) ==== Step 17 ==== === Overview === Download the ICD10CM datasets. Decompress the gzipped files. Run the ETL script to create the staging tables schema and load the datasets. === Details === - Run create_source_tables.sql - Download ftp://ftp.cdc.gov/pub/Health_Statistics/NCHS/Publications/ICD10CM/20xx/ICD10CM_FY20xx_code_descriptions.zip - Extract icd10cm_order_20xx.txt and rename to icd10cm.txt - Load into icd10cm_table with icd10cm.ctl - Unzip icd10cm2snomed.zip and load into icd10cm2snomed with icd10cm2snomed.ctl - Run load_stage.sql - Run generic_update.sql (from working directory) ==== Step 18 ==== === Overview === In the final_assembly directory, run the two SQL scripts to populate the concept_ancestor and drug_strength tables. === Details === - Run concept_ancestor.sql - Run drug_strength.sql ==== Step 19 ==== === Overview === In the "working" directory, run the manual change scripts. === Note === These scripts are collections of ad-hoc inserts / updates that have been applied over time. Before running each block of sql statements review them against the current V5 vocabulary table values to avoid potential duplicate inserts or inappropriate updates. Contact Christian Reich and the Athena vocabulary team for more info on these scripts. - Run manual_changes.sql - Run manual_changes_15-Mar-2015.sql ==== Step 20 ==== === Overview === In the "working" directory run a script that will update the V4 schema tables with the new data in the V5 schema tables. === Notes === The v4 tables and the v5 tables should already exist and be populated with data. The prodv4 user should have been granted read access to the prodv5 schema tables. === Details === - FYI. V4 schema is here: https://github.com/OHDSI/CommonDataModel/blob/master/Version4/Oracle/CDM%20V4%20DDL.sql - run v5-to-v4.sql to update the prodv4 tables with the new data from the prodv5 tables