Differences

This shows you the differences between two versions of the page.

--- documentation:laertes_etl [2015/05/26 18:18]
lee [Step 1]
+++ documentation:laertes_etl [2015/06/23 10:58] (current)
lee
@@ Line 33: / Line 33: @@
 This model decouples the data sources from the various copies of the
 sources that might have been processed in many different ways. It also
-decouples what can be said about and evidence item (i.e., the semantic
+decouples what can be said about an evidence item (i.e., the semantic
 tags) from the information artifact. All of this allows for greater
 flexibility with respect to inclusion of sources and
@@ Line 173: / Line 173: @@
 ==== “Terminology-mappings” directory ====
-This directory contains a number of data sets used to ultimately map terminologies from the sources terms to the OMOP CDM standard terms.
+This directory contains a number of data sets ultimately used to map terminologies from the sources terms to the OMOP CDM standard terms.
 ==== ETL Process  Overview ====
@@ Line 186: / Line 186: @@
 Ensure that all of the prerequisite hardware, DBMSs and the URL Shortener service have been deployed.
-==== Step 1 ====
+==== EU SPC ADR data feed ====
 European Union Adverse Drug Reactions from Summary of Product Characteristics (EU SPC) Database Import
 === Overview ===
-  * Download, the EU SPC datasets
+  * Download, the EU SPC data sets (see data sources table)
-  * Convert the data into RDF ntriple graph data
+  * Run python scripts to convert the data into RDF ntriple graph data
   * Load the RDF ntriple graph data into the Virtuoso database
-  * Use the Virtuoso SPARQL functionality to run a SPARQL query exporting the drug/hoi combinations along with the adverse event counts into an export file
+  * Manually run Virtuoso SPARQL query to export the drug/hoi combinations along with the adverse event counts into an export file
-  * Load the export file into the PostgreSQL public schema database
+  * Manually load the annotation URIs into the URL Shortener MySQL database using the MySQL command line client
+  * Run Python script to load the export file into the PostgreSQL public schema database
 === Details ===
@@ Line 201: / Line 202: @@
 The details for this data feed are documented and maintained here:
 https://github.com/OHDSI/KnowledgeBase/tree/master/LAERTES/EuSPC
-==== Step 2 ====
-=== Overview ===
-Download, the SNOMED datasets from the website. Decompress the gzipped files. Run the ETL script to create the staging tables schema and load the datasets.
-=== Notes ===
-  * step 14 of load_stage.sql: update concept_stage from concept. Before running this step run the below statement so that the column sizes match and oracle will use the concept_code indexes for the join.
-<code>
-alter table concept_stage modify  (concept_code varchar2(50));
-</code>
-The international SNOMED-CT and the UK SNOMED-CT will be loaded into separate staging tables and then unioned together  in views ready for processing in the load_stage.sql step below.
-=== Details ===
-  - Run create_source_tables.sql
-  - Download the international SNOMED file SnomedCT_Release_INT_YYYYMMDD.zip from http://www.nlm.nih.gov/research/umls/licensedcontent/snomedctfiles.html.
-  - Extract the release date from the file name.
-  - Extract the following files from the folder:
-    * SnomedCT_Release_INT_YYYYMMDD\RF2Release\Full\Terminology:
-    * sct2_Concept_Full_INT_YYYYMMDD.txt
-    * sct2_Description_Full-en_INT_YYYYMMDD.txt
-    * sct2_Relationship_Full_INT_YYYYMMDD.txt Remove date from file name.
-  - Load them into SCT2_CONCEPT_FULL_INT, SCT2_DESC_FULL_EN_INT and SCT2_RELA_FULL_INT. Use the control files of the same name.
-  - Download the British SNOMED file SNOMEDCT2_XX.0.0_YYYYMMDD000001.zip from https://isd.hscic.gov.uk/trud3/user/authenticated/group/0/pack/26/subpack/102/releases
-  - Extract the release date from the file name.
-  - Extract the following files from the folder SnomedCT2_GB1000000_YYYYMMDD\RF2Release\Full\Terminology into a working folder:
-    *sct2_Concept_Full_GB1000000_YYYYMMDD.txt
-    * sct2_Description_Full-en-GB_GB1000000_YYYYMMDD.txt
-    * sct2_Description_Full-en-GB_GB1000000_YYYYMMDD.txt Remove date from file name.
-  - Load them into SCT2_CONCEPT_FULL_UK, SCT2_DESC_FULL_UK, SCT2_RELA_FULL_UK. Use the control files in Vocabulary-v5.0\01-SNOMED
-  - Extract der2_cRefset_AssociationReferenceFull_INT_YYYYMMDD.txt from SnomedCT_Release_INT_YYYYMMDD\RF2Release\Full\Refset\Content and der2_cRefset_AssociationReferenceFull_GB1000000_YYYYMMDD.txt from SnomedCT2_GB1000000_YYYYMMDD\RF2Release\Full\Refset\Content Remove date from file name.
-  - Load them into der2_cRefset_AssRefFull_INT and der2_cRefset_AssRefFull_UK. (rename the data files to remove any YYYYMMDD suffix so they match the file names in the ctl files)
-Staging table load statements
-<code>
-sqlldr PRODV5/<password> CONTROL=SCT2_CONCEPT_FULL_INT.ctl LOG=SCT2_CONCEPT_FULL_INT.log BAD=SCT2_CONCEPT_FULL_INT.bad
-sqlldr PRODV5/<password> CONTROL=SCT2_DESC_FULL_EN_INT.ctl LOG=SCT2_DESC_FULL_EN_INT.log BAD=SCT2_DESC_FULL_EN_INT.bad
-sqlldr PRODV5/<password>  CONTROL=SCT2_RELA_FULL_INT.ctl LOG=SCT2_RELA_FULL_INT.log BAD=SCT2_RELA_FULL_INT.bad
-sqlldr PRODV5/<password> CONTROL=SCT2_CONCEPT_FULL_UK.ctl LOG=SCT2_CONCEPT_FULL_INT.log BAD=SCT2_CONCEPT_FULL_UK.bad
-sqlldr PRODV5/<password> CONTROL=SCT2_DESC_FULL_UK.ctl LOG=SCT2_DESC_FULL_UK.log BAD=SCT2_DESC_FULL_UK.bad
-sqlldr PRODV5/<password> CONTROL=SCT2_RELA_FULL_UK.ctl LOG=SCT2_RELA_FULL_UK.log BAD=SCT2_RELA_FULL_UK.bad
-sqlldr PRODV5/<password> CONTROL=der2_cRefset_AssRefFull_INT.ctl LOG=der2_cRefset_AssRefFull_INT.log BAD=der2_cRefset_AssRefFull_INT.bad
-sqlldr PRODV5/<password> CONTROL=der2_cRefset_AssRefFull_UK.ctl LOG=der2_cRefset_AssRefFull_UK.log BAD=der2_cRefset_AssRefFull_UK.bad
-</code>
-  - Run load_stage.sql
-  - Run generic_update.sql (from working directory)
-==== Step 3 ====
+==== PUBMED / MEDLINE data feed ====
+MEDLINE records for indexed literature reporting adverse drug events
 === Overview ===
-Download, the RxNorm datasets from the website. Decompress the gzipped files. Run the ETL script to create database staging tables schema and load the datasets.
+  * Download, the PUBMED/MEDLINE data sets (see data sources table)
+  * Run python scripts to convert the data into RDF ntriple graph data
-=== Notes ===
+  * Load the RDF ntriple graph data into the Virtuoso database
-Update the update vocabulary latest_update date statement in create_source_tables.sql to the correct date for the file you are loading.  Make this change in load_stage.sql for each data feed.
+  * Manually load the annotation URIs into the URL Shortener MySQL database using the MySQL command line client
+  * Manually run Virtuoso SPARQL query to export the drug/hoi combinations along with the adverse event counts into an export file
-This ETL step processes RxNorm, ATC, NDFRT, VA Product , VA Class and ATC vocabularies.
+  * Run Python script to load the export file into the PostgreSQL public schema database
-At step 15 of create_source_tables.sql you must run the generic_update_sql script (from "working" directory) to load the RxNorm concepts from the concept staging tables. You also need to run that same script after the load_stage.sql script finishes.
 === Details ===
-  - Run create_source_tables.sql
+The details for this data feed are documented and maintained here:
-  - Download RxNorm_full_MMDDYYYY.zip from http://www.nlm.nih.gov/research/umls/rxnorm/docs/rxnormfiles.html
+https://github.com/OHDSI/KnowledgeBase/tree/master/LAERTES/PubMed
-  - Extract and load into the tables from the folder rrf. Use the control files of the same name RXNATOMARCHIVE RXNCONSO RXNCUI RXNCUICHANGES RXNDOC RXNREL RXNSAB RXNSAT RXNSTY
+==== SPLICER SPL data feed ====
-  - Run load_stage.sql
+SPLICER Natural Language Processing extracted Adverse Drug Events from FDA Structured Product Labels (SPLs)
-  - Run generic_update.sql (from working directory)
-==== Step 4 ====
 === Overview ===
-Download the HCPCS datasets from the website. Decompress the gzipped files. Run the ETL script to create the  staging tables schema and load the datasets.
+  * Download the SPLICER data sets (see data sources table)
+  * Run python scripts to convert the data into RDF ntriple graph data
-=== Notes ===
+  * Load the RDF ntriple graph data into the Virtuoso database
+  * Manually load the annotation URIs into the URL Shortener MySQL database using the MySQL command line client
-load_stage.sql Steps 9 and 10 are completely manual – a select statement to generate the set of Medical Coder rows which needs to be exported as a file followed by an instruction “Append resulting file from Medical Coder (in concept_relationship_stage format) to concept_relationship_stage”.  In the future this could be converted into an insert append statement.  Note. The original HCPCS HCPYYYY_CONTR_ANWEB_V2.txt file is a fixed field format file that will not load with the ANWEB_V2.ctl file which requires a comma separated file so see details below on creating a csv file to use for the load.
+  * Manually run Virtuoso SPARQL query to export the drug/hoi combinations along with the adverse event counts into an export file
+  * Run Python script to load the export file into the PostgreSQL public schema database
 === Details ===
-  - Run create_source_tables.sql
+The details for this data feed are documented and maintained here:
-  - Download the latest file http://www.cms.gov/Medicare/Coding/HCPCSReleaseCodeSets/Alpha-Numeric-HCPCS.html, file name YYYY-Annual-Alpha-Numeric-HCPCS-File.zip
+https://github.com/OHDSI/KnowledgeBase/tree/master/LAERTES/SPLICER
-  - Extract HCPCYYYY_CONTR_ANWEB_v2.xlsx
+==== SemMED data feed ====
-  - Open the above file in excel and export a csv file (with double quoted fields) called HCPC2015_CONTR_ANWEB_v2.csv for loading into ANWEB_V2 staging table.
+The Semantic MEDLINE Database is a repository of semantic predications (subject-predicate-object triples) extracted by SemRep, a semantic interpreter of biomedical text.
-  - Load the file data into ANWEB_V2. Use the control files of the same name.
-  - Run load_stage.sql
-  - Run generic_update.sql (from working directory)
-Staging table load statements
-<code>
-sqlldr PRODV5/<password> CONTROL=ANWEB_V2.ctl DATA=HCPC2015_CONTR_ANWEB_v2.tsv LOG=ANWEB_V2.log BAD=ANWEB_V2.bad
-</code>
-==== Step 5 ====
 === Overview ===
-Download, the ICD9CM datasets. Decompress the gzipped files. Run the ETL script to create the  staging tables schema and load the datasets.
+  * Download the SemMED Database MySQL data dump file (see data sources table)
+  * Load the MySQL database dump file into a MySQL database (this typically takes hours to load)
+  * Run python scripts to convert the data into RDF ntriple graph data
+  * Load the RDF ntriple graph data into the Virtuoso database
+  * Manually load the annotation URIs into the URL Shortener MySQL database using the MySQL command line client
+  * Manually run Virtuoso SPARQL query to export the drug/hoi combinations along with the adverse event counts into an export file
+  * Run Python script to load the export file into the PostgreSQL public schema database
 === Details ===
-  - Run create_source_tables.sql
+The details for this data feed are documented and maintained here:
-  - Download from ICD-9-CM-vXX-master-descriptions.zip from http://www.cms.gov/Medicare/Coding/ICD9ProviderDiagnosticCodes/codes.html
+https://github.com/OHDSI/KnowledgeBase/tree/master/LAERTES/SemMED
-  - Extract CMSXX_DESC_LONG_DX.txt and CMSXX_DESC_SHORT_DX.txt
-  - Load them into CMS_DESC_LONG_DX and CMS_DESC_SHORT_DX. Use the control files of the same name.
-  - Run load_stage.sql
-  - Run generic_update.sql (from working directory)
-Staging table load statements
-<code>
-sqlldr PRODV5/<password> CONTROL=CMS32_DESC_LONG_DX.ctl LOG=CMS32_DESC_LONG_DX.log BAD=CMS32_DESC_LONG_DX.bad
-sqlldr PRODV5/<password> CONTROL=CMS32_DESC_SHORT_DX.ctl LOG=CMS32_DESC_SHORT_DX.log BAD=CMS32_DESC_SHORT_DX.bad
-</code>
-==== Step 6 ====
-=== Overview ===
-Download, the ICD9Proc datasets. Decompress the gzipped files. Run the ETL script to create the  staging tables schema and load the datasets.
-=== Notes ===
-load_stage.sql Steps 8 and 9 are completely manual – a select statement to generate the set of Medical Coder rows which need to be exported as a file called 'concept_relationship_manual9cm'.  In the future these two steps could be converted into an insert append sub-select statement.
-  - Run create_source_tables.sql
-  - Download from ICD-9-CM-vXX-master-descriptions.zip from http://www.cms.gov/Medicare/Coding/ICD9ProviderDiagnosticCodes/codes.html
-  - Extract CMSXX_DESC_LONG_SG.txt and CMSXX_DESC_SHORT_SG.txt
-  - Load them into CMS_DESC_LONG_SG and CMS_DESC_SHORT_SG. Use the control files of the same name.
-  - Run load_stage.sql
-  - Run generic_update.sql (from working directory)
-Staging table load statements
-<code>sqlldr PRODV5/<password> CONTROL=CMS32_DESC_LONG_SG.ctl LOG=CMS32_DESC_LONG_SG.log BAD=CMS32_DESC_LONG_SG.bad
-sqlldr PRODV5/<password> CONTROL=CMS32_DESC_SHORT_SG.ctl LOG=CMS32_DESC_SHORT_SG.log BAD=CMS32_DESC_SHORT_SG.bad
-</code>
-==== Step 7 ====
-=== Overview ===
-Download, the LOINC datasets. Decompress the gzipped files. Run the ETL script to create the  staging tables schema and load the datasets.
-=== Notes ===
-Add the missing create table statement for LOINC_HIERARCHY and add missing columns to LOINC_ANSWERS create statement in create_source_tables.sql,  The added/update two tables are shown below:
-<code>
-CREATE TABLE LOINC_HIERARCHY
-(
-  PATH_TO_ROOT  VARCHAR2(4000),
-  SEQUENCE NUMBER(38,0),
-  IMMEDIATE_PARENT VARCHAR2(255),
-  CODE VARCHAR2(255),
-  CODE_TEXT VARCHAR2(255)
-);
-CREATE TABLE LOINC
-(
-  LOINC_NUM                  VARCHAR2(10),
-  COMPONENT                  VARCHAR2(255),
-  PROPERTY                   VARCHAR2(30),
-  TIME_ASPCT                 VARCHAR2(15),
-  SYSTEM                     VARCHAR2(100),
-  SCALE_TYP                  VARCHAR2(30),
-  METHOD_TYP                 VARCHAR2(50),
-  CLASS                      VARCHAR2(20),
-  SOURCE                     VARCHAR2(8),
-  DATE_LAST_CHANGED          DATE,
-  CHNG_TYPE                  VARCHAR2(3),
-  COMMENTS                   CLOB,
-  STATUS                     VARCHAR2(11),
-  CONSUMER_NAME              VARCHAR2(255),
-  MOLAR_MASS                 VARCHAR2(13),
-  CLASSTYPE                  VARCHAR2(20),
-  FORMULA                    VARCHAR2(255),
-  SPECIES                    VARCHAR2(20),
-  EXMPL_ANSWERS              CLOB,
-  ACSSYM                     CLOB,
-  BASE_NAME                  VARCHAR2(50),
-  NAACCR_ID                  VARCHAR2(20),
-  CODE_TABLE                 VARCHAR2(10),
-  SURVEY_QUEST_TEXT          CLOB,
-  SURVEY_QUEST_SRC           VARCHAR2(50),
-  UNITSREQUIRED              VARCHAR2(1),
-  SUBMITTED_UNITS            VARCHAR2(30),
-  RELATEDNAMES2              CLOB,
-  SHORTNAME                  VARCHAR2(40),
-  ORDER_OBS                  VARCHAR2(15),
-  CDISC_COMMON_TESTS         VARCHAR2(1),
-  HL7_FIELD_SUBFIELD_ID      VARCHAR2(50),
-  EXTERNAL_COPYRIGHT_NOTICE  CLOB,
-  EXAMPLE_UNITS              VARCHAR2(255),
-  LONG_COMMON_NAME           VARCHAR2(255),
-  HL7_V2_DATATYPE            VARCHAR2(255),
-  HL7_V3_DATATYPE            VARCHAR2(255),
-  CURATED_RANGE_AND_UNITS    CLOB,
-  DOCUMENT_SECTION           VARCHAR2(255),
-  EXAMPLE_UCUM_UNITS         VARCHAR2(255),
-  EXAMPLE_SI_UCUM_UNITS      VARCHAR2(255),
-  STATUS_REASON              VARCHAR2(9),
-  STATUS_TEXT                CLOB,
-  CHANGE_REASON_PUBLIC       CLOB,
-  COMMON_TEST_RANK           VARCHAR2(20),
-  COMMON_ORDER_RANK          VARCHAR2(20),
-  COMMON_SI_TEST_RANK        VARCHAR2(20),
-  HL7_ATTACHMENT_STRUCTURE   VARCHAR2(15)
-);
-</code>
-=== Details ===
-  - Run create_source_tables.sql
-  - Download full set (http://loinc.org/downloads/files/loinc-table-csv-text-format/loinc-table-file-csv-text-format/download) and multiaxial hierarchy (http://loinc.org/downloads/files/loinc-multiaxial-hierarchy/loinc-multiaxial-hierarchy-file/download)
-  - Extract loinc.csv, map_to.csv, source_organization.csv and LOINC_250_MULTI-AXIAL_HIERARCHY.CSV
-  - Load them into LOINC, MAP_TO, SOURCE_ORGANIZATION and LOINC_HIERARCHY. Use the control files of the same name.
-  - Load LOINC Answers - Load LOINC_XXX_SELECTED_FORMS.zip from http://loinc.org/downloads/files/loinc-panels-and-forms-file/loinc-panels-and-forms-file-all-selected-panels-and-forms/download
-  - Open LOINC_XXX_SELECTED_FORMS.xlsx and load worksheet "ANSWERS" to table LOINC_ANSWERS
-  - Open loinc_class.csv and load it into table loinc_class
-  - Run load_stage.sql
-  - Run generic_update.sql (from working directory)
-Staging table load statements
-<code>
-sqlldr PRODV5<password> CONTROL=LOINC.ctl LOG=LOINC.log BAD=LOINC.bad
-sqlldr PRODV5/<password> CONTROL=MAP_TO.ctl LOG=MAP_TO.log BAD=MAP_TO.bad
-sqlldr PRODV5/<password> CONTROL=SOURCE_ORGANIZATION.ctl LOG=SOURCE_ORGANIZATION.log BAD=SOURCE_ORGANIZATION.bad
-sqlldr PRODV5/<password> CONTROL=LOINC_HIERARCHY.ctl LOG=LOINC_HIERARCHY.log BAD=LOINC_HIERARCHY.bad
-sqlldr PRODV5/<password> CONTROL=LOINC_CLASS.ctl LOG=LOINC_CLASS.log BAD=LOINC_CLASS.bad
-sqlldr PRODV5/<password> CONTROL=LOINC_ANSWERS.ctl LOG=LOINC_ANSWERS.log BAD=LOINC_ANSWERS.bad
-</code>
-==== Step 8 ====
-=== Overview ===
-Download, the MEDDRA datasets. Decompress the gzipped files. Run the ETL script to create the  staging tables schema and load the datasets.
-You must have a MedDRA license to download the data set.  The processing of this data set follows the same approach as the other data feeds.
-=== Details ===
-  - Run create_source_tables.sql
-  - Download the current Meddra from https://www.meddra.org/user?destination=downloads (english)
-  - Extract and load into the tables from the folder "MedAscii". Use the control files of the same name hlgt.asc hlgt_hlt.asc hlt.asc hlt_pt.asc intl_ord.asc llt.asc mdhier.asc pt.asc SMQ_Content.asc SMQ_List.asc soc.asc soc_hlgt.asc
-  - Run load_stage.sql
-  - Run generic_update.sql (from working directory)
-==== Step 9 ====
-=== overview ===
-Download, the NDC-SPL datasets. Decompress the gzipped files. Run the ETL script to create the  staging tables schema and load the datasets.
-=== Notes ===
-This data feed loads the NDC vocabulary and the SPL vocabularies.
-=== Details ===
-  - Run create_source_tables.sql
-  - Download NDC code distrbution file Open the site http://www.fda.gov/Drugs/InformationOnDrugs/ucm142438.htm
-  - Download the latest NDC Database File.
-  - Extract the product.txt file.
-  - Load product.txt into PRODUCT using control file of the same name
-  - Run load_stage.sql
-  - Run generic_update.sql (from working directory)
-Staging table load statements
-<code>
-sqlldr PRODV5/<password> CONTROL=PRODUCT.ctl LOG=PRODUCT.log BAD=PRODUCT.bad
-</code>
-==== Step 10 ====
-=== Overview ===
-Download, the READ code datasets. Decompress the gzipped files. Run the ETL script to create the  staging tables schema and load the datasets.
-=== Details ===
-  - Run create_source_tables.sql
-  - Download Read V2 (CTV2) code distrbution file from the site https://isd.hscic.gov.uk/trud3/user/authenticated/group/0/pack/1/subpack/21/releases.
-  - Download the latest release xx.
-  - Extract the nhs_readv2_xxxxx\V2\Unified\Keyv2.all file.
-  - Download Read V2 (CTV2) code to SNOMED CT mapping file nhs_datamigration_VV.0.0_YYYYMMDD000001.zip from https://isd.hscic.gov.uk/trud3/user/authenticated/group/0/pack/9/subpack/9/releases (Subscription "NHS Data Migration"). "VV" stands for the READ CODE version number.  This script loads READ CODE V2.  "YYYYMMDD" stands for the release date.
-  - Download the latest release xx.
-  - Extract nhs_datamigration_xxxxx\Mapping Tables\Updated\Clinically Assured\rcsctmap2_uk_YYYYMMDD000001.txt
-  - Load Keyv2.all and rcsctmap2_uk_YYYYMMDD000001.txt  into KEYV2 and RCSCTMAP2_UK using control files of the same name
-  - Run load_stage.sql
-  - Run generic_update.sql (from working directory)
-staging table load statements
-<code>
-sqlldr PRODV5/<password> CONTROL=Keyv2.ctl LOG=Keyv2.log BAD=Keyv2.bad
-sqlldr PRODV5/<password> CONTROL=rcsctmap2_uk.ctl LOG=rcsctmap2_uk.log BAD=rcsctmap2_uk.bad
-</code>
-==== Step 11 ====
-=== Overview ===
-In the final_assembly directory, run the two SQL scripts to populate the concept_ancestor and drug_strength tables.
-=== Details ===
-  - Run concept_ancestor.sql
-  - Run drug_strength.sql
-==== Step 12 ====
-=== Overview ===
-In the "working" directory, run the manual change scripts.
-=== Note ===
-These scripts are collections of ad-hoc inserts / updates that have been applied over time.  Before running each block of sql statements review them against the current V5 vocabulary table values to avoid potential duplicate inserts  or inappropriate updates. Contact Christian Reich and the Athena vocabulary team for more info on these scripts.
-  - Run manual_changes.sql
-  - Run manual_changes_15-Mar-2015.sql
-==== Step 13 ====
-=== Overview ===
-In the "working" directory run a script that will update the V4 schema tables with the new data in the V5 schema tables.
-=== Notes ===
-The v4 tables and the v5 tables should already exist and be populated with data. The prodv4 user should have been granted read access to the prodv5 schema tables.
-=== Details ===
-  - FYI. V4 schema is here: https://github.com/OHDSI/CommonDataModel/blob/master/Version4/Oracle/CDM%20V4%20DDL.sql
-  - run v5-to-v4.sql to update the prodv4 tables with the new data from the prodv5 tables

Observational Health Data Sciences and Informatics

User Tools

Site Tools

Differences

Page Tools