User Tools

Site Tools


documentation:laertes_etl

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
documentation:laertes_etl [2015/05/26 18:06]
lee [Source Data Sets]
documentation:laertes_etl [2015/06/23 10:58] (current)
lee
Line 33: Line 33:
 This model decouples the data sources from the various copies of the This model decouples the data sources from the various copies of the
 sources that might have been processed in many different ways. It also sources that might have been processed in many different ways. It also
-decouples what can be said about and evidence item (i.e., the semantic+decouples what can be said about an evidence item (i.e., the semantic
 tags) from the information artifact. All of this allows for greater tags) from the information artifact. All of this allows for greater
 flexibility with respect to inclusion of sources and flexibility with respect to inclusion of sources and
Line 122: Line 122:
  
 This table shows the web sites to visit and the files to download for this ETL process. This table shows the web sites to visit and the files to download for this ETL process.
-Note. There are additional feeds under development but the ones shown below are the initial ​focus:+Note. There are additional feeds under development but the ones shown below are the initial ​core set.
  
 ^ Data Feed ^Website ^Website Download Page ^Example Download File Name ^Notes^ ^ Data Feed ^Website ^Website Download Page ^Example Download File Name ^Notes^
Line 170: Line 170:
  
 This directory contains the zipped source code to deploy and run a local URL shortening service on a local server for resolving URLs This directory contains the zipped source code to deploy and run a local URL shortening service on a local server for resolving URLs
 +
 +==== “Terminology-mappings” directory ====
 +
 +This directory contains a number of data sets ultimately used to map terminologies from the sources terms to the OMOP CDM standard terms.
  
 ==== ETL Process ​ Overview ==== ==== ETL Process ​ Overview ====
Line 179: Line 183:
 ==== ETL Process Steps ==== ==== ETL Process Steps ====
  
-==== Step 1 ====+==== Prerequisites ​==== 
 +Ensure that all of the prerequisite hardware, DBMSs and the URL Shortener service have been deployed.
  
-=== Overview ​===+==== EU SPC ADR data feed ==== 
 +European Union Adverse Drug Reactions from Summary of Product Characteristics (EU SPC) Database Import
  
-Download, the UML datasets from the UMLS website. Decompress the gzipped files. Run the ETL script to create the UML database staging tables schema and load the datasets into the UMLS staging tables. 
- 
-=== Details === 
- 
-  - Log into the UMLS website. 
-  - Navigate to the page http://​www.nlm.nih.gov/​research/​umls/​licensedcontent/​umlsknowledgesources.html and download both the files YYYYab-1-meta.nlm and YYYYab-2-meta.nlm into the  vocabETL/​vocabulary-v5.0-master/​UML sub-directory 
-  - An example pair of file names would be:  2014ab-1-meta.nlm and 2014ab-2-meta.nlm 
-  - The two files are actually zip files even though they don't have a .zip file extension. 
-  - Connect to the PRODV5 SQL database in an Oracle SQL client and run the create_source_tables.sql script to create the UML staging tables. (skip this step if the staging tables already exists). Each table has the same name as the UML dataset that will populate it (see files listed in below step).  ​ 
-  - Extract (unzip) the YYYYab-1-meta.nlm and YYYYab-2-meta.nlm files into the current directory (vocabETL/​vocabulary-v5.0-master/​UML) using  unzip and gzip on Windows or gunzip on Linux . Then unzip the extracted files (each file is gzipped) ​ 
- 
-Note. There are multiple sets of extracted files. Each set of files must be concatenated together to create a single file. 
- 
-A. Linux  commands to unzip and concatenate files (run in shell): ​ 
-  ​ 
-<​code>​ 
-  cd /​VOCABETL/​Vocabulary-v5.0-master/​UMLS/​YYYYAB\META ​ where YYYY is the year (e.g. 2014). 
-  gunzip *-meta.nlm  ​ 
-  gunzip *.gz --cat MRCONSO.RRF.aa MRCONSO.RRF.ab > MRCONSO.RRF --cat MRHIER.RRF.aa MRHIER.RRF.ab > MRHIER.RRF ​     --cat MRREL.RRF.aa MRREL.RRF.ab MRREL.RRF.ac MRREL.RRF.ad > MRREL.RRF --cat MRSAT.RRF.aa MRSAT.RRF.ab MRSAT.RRF.ac MRSAT.RRF.ad > MRSAT.RRF 
-</​code>​ 
-  ​ 
-B. Alternative commands to run for Windows (run in powershell): ​ 
-  ​ 
-<​code>​ 
-cd C:​\VOCABETL\Vocabulary-v5.0-master\UMLS\2014AB\META where YYYY is the year (e.g. 2014). 
-unzip  *-meta.nlm  ​ 
-gzip -d  *.gz  MRCONSO.RRF.aa.gz MRCONSO.RRF.ab.gz ​ MRHIER.RRF.aa.gz MRHIER.RRF.ab.gz MRREL.RRF.aa.gz MRREL.RRF.ab.gz MRREL.RRF.ac.gz MRREL.RRF.ad.gz MRSAT.RRF.aa.gz MRSAT.RRF.ab.gz MRSAT.RRF.ac.gz MRSAT.RRF.ad.gz 
-cmd /c copy MRCONSO.RRF.aa + MRCONSO.RRF.ab ​ MRCONSO.RRF ​ 
-cmd /c  copy MRHIER.RRF.aa + MRHIER.RRF.ab MRHIER.RRF 
-cmd /c copy MRREL.RRF.aa + MRREL.RRF.ab + MRREL.RRF.ac + MRREL.RRF.ad ​ MRREL.RRF ​ 
-cmd /c copy MRSAT.RRF.aa +MRSAT.RRF.ab + MRSAT.RRF.ac + MRSAT.RRF.ad ​ MRSAT.RRF 
-</​code>​ 
-  ​ 
-Run the following commands in a linux shell or Windows command-line to load each file into the staging table in PRODV5 with the same name.  
- 
-Note. Each file has a separate control file which is in the vocabETL/​vocabulary-v5.0-master/​UML sub-directory and the full path to the data sets to be loaded must be specified in the command (replace <​password>​ with the database password of the PRODV5 user in the sqlldr statements below).  ​ 
- 
-Note. You can review and then ignore any single .bad record for each file which just contains the Windows EOF file char added by the above batch copy statements. ​ 
-  ​ 
-<​code>​ 
-sqlldr '​PRODV5/<​password>'​ CONTROL=MRCONSO.ctl LOG=MRCONSO.log BAD=MRCONSO.bad ​ 
-sqlldr '​PRODV5/<​password>'​ CONTROL=MRHIER.ctl LOG=MRHIER.log BAD=MRHIER.bad ​ 
-sqlldr '​PRODV5/<​password>'​ CONTROL=MRMAP.ctl LOG=MRMAP.log BAD=MRMAP.bad ​ 
-sqlldr '​PRODV5/<​password>'​ CONTROL=MRREL.ctl LOG=MRREL.log BAD=MRREL.bad ​ 
-sqlldr '​PRODV5/<​password>'​ CONTROL=MRSAT.ctl LOG=MRSAT.log BAD=MRSAT.bad ​ 
-sqlldr '​PRODV5/<​password>'​ CONTROL=MRSMAP.ctl LOG=MRSMAP.log BAD=MRSMAP.bad ​ 
-</​code>​ 
- 
-==== Step 2 ==== 
-  
 === Overview === === Overview ===
  
-Download, the SNOMED datasets from the website. Decompress ​the gzipped files. Run the ETL script ​to create ​the staging tables schema and load the datasets.+  * Download, the EU SPC data sets (see data sources table) 
 +  * Run python scripts to convert ​the data into RDF ntriple graph data 
 +  * Load the RDF ntriple graph data into the Virtuoso database 
 +  * Manually run Virtuoso SPARQL query to export ​the drug/hoi combinations along with the adverse event counts into an export file 
 +  * Manually ​load the annotation URIs into the URL Shortener MySQL database using the MySQL command line client 
 +  * Run Python script to load the export file into the PostgreSQL public schema database
  
-=== Notes === 
- 
-  * step 14 of load_stage.sql:​ update concept_stage from concept. Before running this step run the below statement so that the column sizes match and oracle will use the concept_code indexes for the join. 
-<​code>​ 
-alter table concept_stage modify ​ (concept_code varchar2(50));​ 
-</​code>​ 
- 
-The international SNOMED-CT and the UK SNOMED-CT will be loaded into separate staging tables and then unioned together ​ in views ready for processing in the load_stage.sql step below. 
-              
 === Details === === Details ===
  
-  - Run create_source_tables.sql ​  +The details for this data feed are documented and maintained here
-  - Download the international SNOMED file SnomedCT_Release_INT_YYYYMMDD.zip from http://​www.nlm.nih.gov/​research/​umls/​licensedcontent/​snomedctfiles.html. +https://github.com/OHDSI/KnowledgeBase/tree/master/LAERTES/EuSPC
-  - Extract the release date from the file name. +
-  - Extract the following files from the folder: +
-    * SnomedCT_Release_INT_YYYYMMDD\RF2Release\Full\Terminology:​ +
-    * sct2_Concept_Full_INT_YYYYMMDD.txt +
-    * sct2_Description_Full-en_INT_YYYYMMDD.txt +
-    * sct2_Relationship_Full_INT_YYYYMMDD.txt Remove date from file name. +
-  - Load them into SCT2_CONCEPT_FULL_INT,​ SCT2_DESC_FULL_EN_INT and SCT2_RELA_FULL_INT. Use the control files of the same name. +
-  - Download the British SNOMED file SNOMEDCT2_XX.0.0_YYYYMMDD000001.zip from https://isd.hscic.gov.uk/trud3/user/authenticated/group/0/pack/​26/​subpack/​102/​releases +
-  - Extract the release date from the file name. +
-  - Extract the following files from the folder SnomedCT2_GB1000000_YYYYMMDD\RF2Release\Full\Terminology into a working folder: +
-    *sct2_Concept_Full_GB1000000_YYYYMMDD.txt +
-    * sct2_Description_Full-en-GB_GB1000000_YYYYMMDD.txt +
-    * sct2_Description_Full-en-GB_GB1000000_YYYYMMDD.txt Remove date from file name. +
-  - Load them into SCT2_CONCEPT_FULL_UK,​ SCT2_DESC_FULL_UK,​ SCT2_RELA_FULL_UK. Use the control files in Vocabulary-v5.0\01-SNOMED +
-  - Extract der2_cRefset_AssociationReferenceFull_INT_YYYYMMDD.txt from SnomedCT_Release_INT_YYYYMMDD\RF2Release\Full\Refset\Content and der2_cRefset_AssociationReferenceFull_GB1000000_YYYYMMDD.txt from SnomedCT2_GB1000000_YYYYMMDD\RF2Release\Full\Refset\Content Remove date from file name. +
-  - Load them into der2_cRefset_AssRefFull_INT and der2_cRefset_AssRefFull_UK. (rename the data files to remove any YYYYMMDD suffix so they match the file names in the ctl files)+
  
-Staging table load statements +==== PUBMED ​MEDLINE data feed ==== 
- +MEDLINE records for indexed literature reporting adverse drug events ​
-<​code>​ +
-sqlldr PRODV5/<​password>​ CONTROL=SCT2_CONCEPT_FULL_INT.ctl LOG=SCT2_CONCEPT_FULL_INT.log BAD=SCT2_CONCEPT_FULL_INT.bad  +
-sqlldr PRODV5/<​password>​ CONTROL=SCT2_DESC_FULL_EN_INT.ctl LOG=SCT2_DESC_FULL_EN_INT.log BAD=SCT2_DESC_FULL_EN_INT.bad  +
-sqlldr PRODV5/<​password> ​ CONTROL=SCT2_RELA_FULL_INT.ctl LOG=SCT2_RELA_FULL_INT.log BAD=SCT2_RELA_FULL_INT.bad  +
-sqlldr PRODV5/<​password>​ CONTROL=SCT2_CONCEPT_FULL_UK.ctl LOG=SCT2_CONCEPT_FULL_INT.log BAD=SCT2_CONCEPT_FULL_UK.bad ​ +
-sqlldr PRODV5/<​password>​ CONTROL=SCT2_DESC_FULL_UK.ctl LOG=SCT2_DESC_FULL_UK.log BAD=SCT2_DESC_FULL_UK.bad  +
-sqlldr PRODV5/<​password>​ CONTROL=SCT2_RELA_FULL_UK.ctl LOG=SCT2_RELA_FULL_UK.log BAD=SCT2_RELA_FULL_UK.bad  +
-sqlldr PRODV5/<​password>​ CONTROL=der2_cRefset_AssRefFull_INT.ctl LOG=der2_cRefset_AssRefFull_INT.log BAD=der2_cRefset_AssRefFull_INT.bad  +
-sqlldr PRODV5/<​password>​ CONTROL=der2_cRefset_AssRefFull_UK.ctl LOG=der2_cRefset_AssRefFull_UK.log BAD=der2_cRefset_AssRefFull_UK.bad  +
-</​code>​ +
- +
-  - Run load_stage.sql +
-  - Run generic_update.sql (from working directory) +
- +
-==== Step 3 ====+
  
 === Overview === === Overview ===
  
-Download, the RxNorm datasets from the website. Decompress ​the gzipped files. Run the ETL script to create ​database ​staging tables schema and load the datasets. +  * Download, the PUBMED/​MEDLINE data sets (see data sources table) 
- +  * Run python scripts to convert ​the data into RDF ntriple graph data 
-=== Notes === +  * Load the RDF ntriple graph data into the Virtuoso ​database 
-Update ​the update vocabulary latest_update date statement in create_source_tables.sql ​to the correct date for the file you are loading. ​ Make this change in load_stage.sql for each data feed. +  * Manually ​load the annotation URIs into the URL Shortener MySQL database using the MySQL command line client 
- +  * Manually run Virtuoso SPARQL query to export ​the drug/hoi combinations along with the adverse event counts into an export ​file 
-This ETL step processes RxNorm, ATC, NDFRT, VA Product , VA Class and ATC vocabularies. +  * Run Python ​script to load the export file into the PostgreSQL public schema database
- +
-At step 15 of create_source_tables.sql you must run the generic_update_sql ​script ​(from "​working"​ directory) ​to load the RxNorm concepts from the concept staging tables. You also need to run that same script after the load_stage.sql script finishes.+
  
 === Details === === Details ===
  
-  - Run create_source_tables.sql +The details for this data feed are documented and maintained here: 
-  - Download RxNorm_full_MMDDYYYY.zip from http://www.nlm.nih.gov/research/umls/rxnorm/docs/rxnormfiles.html +https://github.com/OHDSI/KnowledgeBase/tree/master/LAERTES/​PubMed 
-  - Extract and load into the tables from the folder rrf. Use the control files of the same name RXNATOMARCHIVE RXNCONSO RXNCUI RXNCUICHANGES RXNDOC RXNREL RXNSAB RXNSAT RXNSTY +==== SPLICER SPL data feed ==== 
-  - Run load_stage.sql +SPLICER Natural Language Processing extracted Adverse Drug Events from FDA Structured Product Labels (SPLs)
-  - Run generic_update.sql (from working directory) +
- +
-==== Step 4 ====+
  
 === Overview === === Overview ===
  
-Download the HCPCS datasets from the website. Decompress the gzipped files. ​Run the ETL script ​to create ​the  ​staging tables schema and load the datasets. +  * Download the SPLICER data sets (see data sources table) 
- +  * Run python scripts ​to convert ​the data into RDF ntriple graph data 
-=== Notes === +  * Load the RDF ntriple graph data into the Virtuoso database 
- +  * Manually ​load the annotation URIs into the URL Shortener MySQL database using the MySQL command line client 
-load_stage.sql Steps 9 and 10 are completely manual – a select statement ​to generate ​the set of Medical Coder rows which needs to be exported as a file followed by an instruction “Append resulting file from Medical Coder (in concept_relationship_stage format) to concept_relationship_stage”. ​ In the future this could be converted ​into an insert append statement. ​ Note. The original HCPCS HCPYYYY_CONTR_ANWEB_V2.txt file is a fixed field format ​file that will not load with the ANWEB_V2.ctl file which requires a comma separated ​file so see details below on creating a csv file to use for the load.+  * Manually run Virtuoso SPARQL query to export ​the drug/hoi combinations along with the adverse event counts ​into an export ​file 
 +  * Run Python script to load the export ​file into the PostgreSQL public schema database
  
 === Details === === Details ===
  
-  - Run create_source_tables.sql +The details for this data feed are documented and maintained here: 
-  - Download the latest file http://www.cms.gov/Medicare/Coding/HCPCSReleaseCodeSets/Alpha-Numeric-HCPCS.html,​ file name YYYY-Annual-Alpha-Numeric-HCPCS-File.zip +https://github.com/OHDSI/KnowledgeBase/tree/master/LAERTES/SPLICER 
-  - Extract HCPCYYYY_CONTR_ANWEB_v2.xlsx +==== SemMED data feed ==== 
-  - Open the above file in excel and export a csv file (with double quoted fields) called HCPC2015_CONTR_ANWEB_v2.csv for loading into ANWEB_V2 staging table.  +The Semantic MEDLINE Database is a repository of semantic predications (subject-predicate-object triples) extracted by SemRep, a semantic interpreter of biomedical text.
-  - Load the file data into ANWEB_V2. Use the control files of the same name. +
-  - Run load_stage.sql +
-  - Run generic_update.sql (from working directory) +
- +
-Staging table load statements +
- +
-<​code>​ +
-sqlldr PRODV5/<​password>​ CONTROL=ANWEB_V2.ctl DATA=HCPC2015_CONTR_ANWEB_v2.tsv LOG=ANWEB_V2.log BAD=ANWEB_V2.bad  +
-</code> +
- +
- +
-==== Step 5 ====+
  
 === Overview === === Overview ===
  
-Downloadthe ICD9CM datasets. Decompress ​the gzipped files. ​Run the ETL script ​to create ​the  ​staging tables schema and load the datasets.+  * Download the SemMED Database MySQL data dump file (see data sources table) 
 +  * Load the MySQL database dump file into a MySQL database (this typically takes hours to load) 
 +  * Run python scripts to convert ​the data into RDF ntriple graph data 
 +  * Load the RDF ntriple graph data into the Virtuoso database 
 +  * Manually load the annotation URIs into the URL Shortener MySQL database using the MySQL command line client 
 +  * Manually run Virtuoso SPARQL query to export ​the drug/hoi combinations along with the adverse event counts into an export file 
 +  * Run Python script to load the export file into the PostgreSQL public schema database
  
 === Details === === Details ===
  
-  - Run create_source_tables.sql +The details ​for this data feed are documented ​and maintained ​here: 
-  - Download from ICD-9-CM-vXX-master-descriptions.zip from http://​www.cms.gov/​Medicare/​Coding/​ICD9ProviderDiagnosticCodes/​codes.html +https://​github.com/​OHDSI/​KnowledgeBase/tree/master/LAERTES/SemMED
-  - Extract CMSXX_DESC_LONG_DX.txt and CMSXX_DESC_SHORT_DX.txt +
-  - Load them into CMS_DESC_LONG_DX and CMS_DESC_SHORT_DX. Use the control files of the same name. +
-  - Run load_stage.sql +
-  - Run generic_update.sql (from working directory) +
- +
-Staging table load statements +
-<​code>​ +
-sqlldr PRODV5/<​password>​ CONTROL=CMS32_DESC_LONG_DX.ctl LOG=CMS32_DESC_LONG_DX.log BAD=CMS32_DESC_LONG_DX.bad  +
-sqlldr PRODV5/<​password>​ CONTROL=CMS32_DESC_SHORT_DX.ctl LOG=CMS32_DESC_SHORT_DX.log BAD=CMS32_DESC_SHORT_DX.bad  +
-</​code>​ +
- +
-==== Step 6 ==== +
- +
-=== Overview === +
- +
-Download, the ICD9Proc datasets. Decompress the gzipped files. Run the ETL script to create the  staging tables schema and load the datasets. +
- +
-=== Notes === +
- +
-load_stage.sql Steps 8 and 9 are completely manual – a select statement to generate the set of Medical Coder rows which need to be exported as a file called '​concept_relationship_manual9cm'​. ​ In the future these two steps could be converted into an insert append sub-select statement. +
- +
-  - Run create_source_tables.sql +
-  - Download from ICD-9-CM-vXX-master-descriptions.zip from http://​www.cms.gov/​Medicare/​Coding/​ICD9ProviderDiagnosticCodes/​codes.html +
-  - Extract CMSXX_DESC_LONG_SG.txt and CMSXX_DESC_SHORT_SG.txt +
-  - Load them into CMS_DESC_LONG_SG and CMS_DESC_SHORT_SG. Use the control files of the same name. +
-  - Run load_stage.sql +
-  - Run generic_update.sql (from working directory) +
- +
-Staging table load statements +
- +
-<​code>​sqlldr PRODV5/<​password>​ CONTROL=CMS32_DESC_LONG_SG.ctl LOG=CMS32_DESC_LONG_SG.log BAD=CMS32_DESC_LONG_SG.bad  +
-sqlldr PRODV5/<​password>​ CONTROL=CMS32_DESC_SHORT_SG.ctl LOG=CMS32_DESC_SHORT_SG.log BAD=CMS32_DESC_SHORT_SG.bad  +
-</​code>​ +
- +
-==== Step 7 ==== +
- +
-=== Overview === +
- +
-Download, the LOINC datasets. Decompress the gzipped files. Run the ETL script to create the  staging tables schema and load the datasets. +
- +
-=== Notes === +
- +
-Add the missing create table statement ​for LOINC_HIERARCHY and add missing columns to LOINC_ANSWERS create statement in create_source_tables.sql, ​ The added/​update two tables are shown below: +
-<​code>​ +
-CREATE TABLE LOINC_HIERARCHY +
-+
-  PATH_TO_ROOT ​ VARCHAR2(4000),​  +
-  SEQUENCE NUMBER(38,​0),​  +
-  IMMEDIATE_PARENT VARCHAR2(255),​  +
-  CODE VARCHAR2(255),​  +
-  CODE_TEXT VARCHAR2(255) +
-); +
- +
- +
-CREATE TABLE LOINC +
-+
-  LOINC_NUM ​                 VARCHAR2(10),​ +
-  COMPONENT ​                 VARCHAR2(255),​ +
-  PROPERTY ​                  ​VARCHAR2(30),​ +
-  TIME_ASPCT ​                ​VARCHAR2(15),​ +
-  SYSTEM ​                    ​VARCHAR2(100),​ +
-  SCALE_TYP ​                 VARCHAR2(30),​ +
-  METHOD_TYP ​                ​VARCHAR2(50),​ +
-  CLASS                      VARCHAR2(20),​ +
-  SOURCE ​                    ​VARCHAR2(8),​ +
-  DATE_LAST_CHANGED ​         DATE, +
-  CHNG_TYPE ​                 VARCHAR2(3),​ +
-  COMMENTS ​                  ​CLOB,​ +
-  STATUS ​                    ​VARCHAR2(11),​ +
-  CONSUMER_NAME ​             VARCHAR2(255),​ +
-  MOLAR_MASS ​                ​VARCHAR2(13),​ +
-  CLASSTYPE ​                 VARCHAR2(20),​ +
-  FORMULA ​                   VARCHAR2(255),​ +
-  SPECIES ​                   VARCHAR2(20),​ +
-  EXMPL_ANSWERS ​             CLOB, +
-  ACSSYM ​                    ​CLOB,​ +
-  BASE_NAME ​                 VARCHAR2(50),​ +
-  NAACCR_ID ​                 VARCHAR2(20),​ +
-  CODE_TABLE ​                ​VARCHAR2(10),​ +
-  SURVEY_QUEST_TEXT ​         CLOB, +
-  SURVEY_QUEST_SRC ​          ​VARCHAR2(50),​ +
-  UNITSREQUIRED ​             VARCHAR2(1),​ +
-  SUBMITTED_UNITS ​           VARCHAR2(30),​ +
-  RELATEDNAMES2 ​             CLOB, +
-  SHORTNAME ​                 VARCHAR2(40),​ +
-  ORDER_OBS ​                 VARCHAR2(15),​ +
-  CDISC_COMMON_TESTS ​        ​VARCHAR2(1),​ +
-  HL7_FIELD_SUBFIELD_ID ​     VARCHAR2(50),​ +
-  EXTERNAL_COPYRIGHT_NOTICE ​ CLOB, +
-  EXAMPLE_UNITS ​             VARCHAR2(255),​ +
-  LONG_COMMON_NAME ​          ​VARCHAR2(255),​ +
-  HL7_V2_DATATYPE ​           VARCHAR2(255),​ +
-  HL7_V3_DATATYPE ​           VARCHAR2(255),​ +
-  CURATED_RANGE_AND_UNITS ​   CLOB, +
-  DOCUMENT_SECTION ​          ​VARCHAR2(255),​ +
-  EXAMPLE_UCUM_UNITS ​        ​VARCHAR2(255),​ +
-  EXAMPLE_SI_UCUM_UNITS ​     VARCHAR2(255),​ +
-  STATUS_REASON ​             VARCHAR2(9),​ +
-  STATUS_TEXT ​               CLOB, +
-  CHANGE_REASON_PUBLIC ​      ​CLOB,​ +
-  COMMON_TEST_RANK ​          ​VARCHAR2(20),​ +
-  COMMON_ORDER_RANK ​         VARCHAR2(20),​ +
-  COMMON_SI_TEST_RANK ​       VARCHAR2(20),​ +
-  HL7_ATTACHMENT_STRUCTURE ​  ​VARCHAR2(15) +
-); +
-</​code>​ +
- +
-=== Details === +
- +
-  - Run create_source_tables.sql +
-  - Download full set (http://​loinc.org/​downloads/​files/​loinc-table-csv-text-format/​loinc-table-file-csv-text-format/​download) and multiaxial hierarchy (http://​loinc.org/​downloads/​files/​loinc-multiaxial-hierarchy/​loinc-multiaxial-hierarchy-file/​download) +
-  - Extract loinc.csv, map_to.csv, source_organization.csv and LOINC_250_MULTI-AXIAL_HIERARCHY.CSV +
-  - Load them into LOINC, MAP_TO, SOURCE_ORGANIZATION and LOINC_HIERARCHY. Use the control files of the same name. +
-  - Load LOINC Answers - Load LOINC_XXX_SELECTED_FORMS.zip from http://​loinc.org/​downloads/​files/​loinc-panels-and-forms-file/​loinc-panels-and-forms-file-all-selected-panels-and-forms/​download +
-  - Open LOINC_XXX_SELECTED_FORMS.xlsx and load worksheet "​ANSWERS"​ to table LOINC_ANSWERS +
-  - Open loinc_class.csv and load it into table loinc_class +
-  - Run load_stage.sql +
-  - Run generic_update.sql (from working directory) +
- +
-Staging table load statements +
-<​code>​ +
-sqlldr PRODV5<​password>​ CONTROL=LOINC.ctl LOG=LOINC.log BAD=LOINC.bad  +
-sqlldr PRODV5/<​password>​ CONTROL=MAP_TO.ctl LOG=MAP_TO.log BAD=MAP_TO.bad  +
-sqlldr PRODV5/<​password>​ CONTROL=SOURCE_ORGANIZATION.ctl LOG=SOURCE_ORGANIZATION.log BAD=SOURCE_ORGANIZATION.bad  +
-sqlldr PRODV5/<​password>​ CONTROL=LOINC_HIERARCHY.ctl LOG=LOINC_HIERARCHY.log BAD=LOINC_HIERARCHY.bad  +
-sqlldr PRODV5/<​password>​ CONTROL=LOINC_CLASS.ctl LOG=LOINC_CLASS.log BAD=LOINC_CLASS.bad  +
-sqlldr PRODV5/<​password>​ CONTROL=LOINC_ANSWERS.ctl LOG=LOINC_ANSWERS.log BAD=LOINC_ANSWERS.bad  +
-</​code>​ +
-==== Step 8 ==== +
- +
-=== Overview === +
- +
-Download, the MEDDRA datasets. Decompress the gzipped files. Run the ETL script to create the  staging tables schema and load the datasets. +
- +
-You must have a MedDRA license to download the data set.  The processing of this data set follows the same approach as the other data feeds. +
- +
-=== Details === +
- +
-  - Run create_source_tables.sql +
-  - Download the current Meddra from https://​www.meddra.org/​user?​destination=downloads (english) +
-  - Extract and load into the tables from the folder "​MedAscii"​. Use the control files of the same name hlgt.asc hlgt_hlt.asc hlt.asc hlt_pt.asc intl_ord.asc llt.asc mdhier.asc pt.asc SMQ_Content.asc SMQ_List.asc soc.asc soc_hlgt.asc +
-  - Run load_stage.sql +
-  - Run generic_update.sql (from working directory) +
- +
-==== Step 9 ==== +
- +
-=== overview === +
- +
-Download, the NDC-SPL datasets. Decompress the gzipped files. Run the ETL script to create the  staging tables schema and load the datasets. +
- +
-=== Notes === +
- +
-This data feed loads the NDC vocabulary and the SPL vocabularies. ​  +
- +
-=== Details === +
- +
-  - Run create_source_tables.sql +
-  - Download NDC code distrbution file Open the site http://​www.fda.gov/​Drugs/​InformationOnDrugs/​ucm142438.htm +
-  - Download the latest NDC Database File. +
-  - Extract the product.txt file. +
-  - Load product.txt into PRODUCT using control file of the same name +
-  - Run load_stage.sql +
-  - Run generic_update.sql (from working directory) +
- +
-Staging table load statements +
- +
-<​code>​ +
-sqlldr PRODV5/<​password>​ CONTROL=PRODUCT.ctl LOG=PRODUCT.log BAD=PRODUCT.bad +
-</​code>​ +
- +
-==== Step 10 ==== +
- +
-=== Overview === +
- +
-Download, the READ code datasets. Decompress the gzipped files. Run the ETL script to create the  staging tables schema and load the datasets. +
- +
-=== Details === +
- +
-  - Run create_source_tables.sql +
-  - Download Read V2 (CTV2) code distrbution file from the site https://​isd.hscic.gov.uk/​trud3/​user/​authenticated/​group/​0/​pack/​1/​subpack/​21/​releases. +
-  - Download the latest release xx. +
-  - Extract the nhs_readv2_xxxxx\V2\Unified\Keyv2.all file. +
-  - Download Read V2 (CTV2) code to SNOMED CT mapping file nhs_datamigration_VV.0.0_YYYYMMDD000001.zip from https://​isd.hscic.gov.uk/​trud3/​user/​authenticated/​group/​0/​pack/​9/​subpack/​9/​releases (Subscription "NHS Data Migration"​). "​VV"​ stands for the READ CODE version number. ​ This script loads READ CODE V2.  "​YYYYMMDD"​ stands for the release date. +
-  - Download the latest release xx. +
-  - Extract nhs_datamigration_xxxxx\Mapping Tables\Updated\Clinically Assured\rcsctmap2_uk_YYYYMMDD000001.txt   +
-  - Load Keyv2.all and rcsctmap2_uk_YYYYMMDD000001.txt  into KEYV2 and RCSCTMAP2_UK using control files of the same name +
-  - Run load_stage.sql +
-  - Run generic_update.sql (from working directory) +
- +
-staging table load statements +
- +
-<​code>​ +
-sqlldr PRODV5/<​password>​ CONTROL=Keyv2.ctl LOG=Keyv2.log BAD=Keyv2.bad  +
-sqlldr PRODV5/<​password>​ CONTROL=rcsctmap2_uk.ctl LOG=rcsctmap2_uk.log BAD=rcsctmap2_uk.bad  +
-</​code>​ +
- +
-==== Step 11 ==== +
- +
-=== Overview === +
- +
-In the final_assembly directory, run the two SQL scripts to populate the concept_ancestor and drug_strength tables. +
- +
-=== Details === +
- +
-  - Run concept_ancestor.sql +
-  - Run drug_strength.sql +
- +
-==== Step 12 ==== +
- +
-=== Overview === +
- +
-In the "​working"​ directory, run the manual change scripts. +
- +
-=== Note === +
- +
-These scripts ​are collections of ad-hoc inserts / updates that have been applied over time.  Before running each block of sql statements review them against the current V5 vocabulary table values to avoid potential duplicate inserts ​ or inappropriate updates. Contact Christian Reich and the Athena vocabulary team for more info on these scripts. +
- +
-  - Run manual_changes.sql +
-  - Run manual_changes_15-Mar-2015.sql  +
- +
-==== Step 13 ==== +
- +
-=== Overview === +
- +
-In the "​working"​ directory run a script that will update the V4 schema tables with the new data in the V5 schema tables. +
- +
-=== Notes === +
- +
-The v4 tables and the v5 tables should already exist and be populated with data. The prodv4 user should have been granted read access to the prodv5 schema tables. +
- +
-=== Details === +
- +
-  - FYI. V4 schema is here: https://​github.com/​OHDSI/​CommonDataModel/blob/master/Version4/Oracle/​CDM%20V4%20DDL.sql +
-  - run v5-to-v4.sql to update the prodv4 tables with the new data from the prodv5 tables  +
documentation/laertes_etl.1432663580.txt.gz · Last modified: 2015/05/26 18:06 by lee