This is an old revision of the document!

Background

LAERTES: Large scale Adverse Effects Related to Treatment Evidence Standardization

LAERTES provides an evidence base of a wide variety of sources with information relevant for assessing associations between drugs and health outcomes of interest. Please see the following publication for a description of the goal and use cases driving the system:

Boyce. RD., Ryan. PB., Noren. N., et al., Bridging islands of information to establish an integrated knowledge base of drugs and health outcomes of interest. Drug Safety. 2014. Volume 37, Issue 8 (2014), Page 557-567. DOI: 10.1007/s40264-014-0189-0, PubMed PMID: 24985530. PMCID: PMC4134480.http://link.springer.com/article/10.1007%2Fs40264-014-0189-0.

Conceptually, the data model is a hybrid relational/RDF schema where the RDF component applies the Open Data Annotation (OA) model (http://www.openannotation.org/spec/core/) to represent specific evidence items for or against an association between a drug and a health outcome of interest (HOIs) from any of a variety of sources, and the relational component provides a summary and index for drug-HOI evidence represented in the RDF component.

The relational component extends the OHDSI Standard Vocabulary, enumerates the evidence data sources, and provides counts for the records associated with every drug-HOI pair in each source noting the modality of the association (i.e., a positive or negative association). Associated with the counts from a given source is a URL that can be used in the RDF component to pull back a list of OA records typed using custom OHDSI evidence types. Each OA record provides data that client programs can use to render data about the source of the evidence (the “target”) and the semantic tags used to identify the record as a source of evidence for a drug-HOI pair (the “body” or “bodies”).

This model decouples the data sources from the various copies of the sources that might have been processed in many different ways. It also decouples what can be said about and evidence item (i.e., the semantic tags) from the information artifact. All of this allows for greater flexibility with respect to inclusion of sources and analysis. Moreover, the 'drill down' use case can be supported by following URIs provided by in the 'evidence target' table to Linked Data resources.

This model will be developed to support all of the sources here: https://docs.google.com/document/d/13UwoqjPyqKr-MLpcflzNs8WD5Za4Ynqnue1xWU2cgaA/edit#

NOTE: The OHDSI standard vocabulary may require extensions from the following terminologies:

This wiki page provides a high level guide or “cookbook” for the ETL process that the LAERTES team follows to periodically update the LAERTES database when new source data sets are released.

This ETL process will be improved and extended with new data sources over time. It is somewhat complex, it has very specific pre-requisites, it requires a good knowledge of the source data sets and it includes some manual steps.

This ETL is not intended to be executed by typical consumers of LAERTES data, instead they should use the following approaches:

Call the LAERTES evidence services within the WebAPI (this option is currently available)
User Interfaces will be developed for LAERTES evidence
A copy of the LAERTES database may eventually be made available for download

LAERTES ETL cookbook

Prerequisites

Network connectivity

Internet access is required to download source dataset files from external websites and to download the ETL source code and schema from github: https://github.com/OHDSI/KnowledgeBase/tree/master/LAERTES

Utilities

A zip file decompression program is required (e.g. 7-zip for windows or gzip/gunzip for linux)

DBMS

PostgreSQL DBMS

Notes. The ETL process combines the source data into a PostgreSQL database and then the last step populates the final tables in ANSI standard DDL tables which can be created and loaded in the three supported OHDSI DBMSs: Oracle, PostgreSQL and SQL Server.

* Note. As a general principle to be followed when running the ETL scripts You will get the best performance by first running the create table statements, then loading the tables and then running the create index statements. The reason is that it is faster to add an index to a populated table than to load data into a table that already has an existing index.

MySQL DBMS

A MySQL database for the SemMedDB database
A MySQL database for the URL shortening service that must run on a local server for resolving URLs

Notes.

SemMedDB, a repository of semantic predications (subject-predicate-object triples) extracted from the entire set of PubMed citations is only available as a MySQL database dump file. Therefore a MySQL database is required to load the data for ETL processing. The SemMedDB data is relatively large. (80 to 100 GBs).

Virtuoso DBMS

Notes.

LAERTES is a hybrid relational/RDF schema model. The RDF schema is created and maintained in a Virtuoso server.

* Note. instructions on faceted browser installation: http://virtuoso.openlinksw.com/dataspace/doc/dav/wiki/Main/VirtFacetBrowserInstallConfig

Hardware

PostgreSQL DBMS server

4 Gig of RAM (minimum)
300 GB disk space for database and source files

Virtuoso server

4 Gig of RAM (minimum)
300 GB disk space for database and source files

MySQL DBMS server

4 Gig of RAM (minimum)
200 GB disk space for database and source files

ETL and URL shortening service server

4 Gig of RAM (minimum)
300 GB disk space for database and source files

Source Data Sets

This table shows the web sites to visit and the files to download for this ETL process. Note. There are additional feeds under development but the ones shown below are the initial core set.

Data Feed	Website	Website Download Page	Example Download File Name	Notes
EU SPC ADRs	PROTECT Website	http://www.imi-protect.eu/adverseDrugReactions.shtml	http://www.imi-protect.eu/documents/Finalrepository_2Sep2014_DLP30June2013.xls	A structured Excel database of all adverse drug reactions (ADRs) listed in section 4.8 of the Summary of Product Characteristics (SPC) of medicinal products authorised in the EU according to the centralised procedure
PubMed/MEDLINE using MeSH tags ADEs	NLM ftp server	NLM ftp server	Releases folder, MedlineXmlToDatabase*.zip files	MEDLINE records for indexed literature reporting adverse drug events
SPLICER extracted US SPL ADEs	SPLICER Data set provided to OHDSI by Regenstrief	SPLICER Data set provided to OHDSI by Regenstrief	SPLICER Data set provided to OHDSI by Regenstrief	Structured Product Label Information Coder and Extractor (SPLICER) natural language processing to extract ADE data from FDA mandated Structured Product Labels (SPL)
NLM's SemMedDB (semantic representation of PubMed content)	http://skr3.nlm.nih.gov/	http://skr3.nlm.nih.gov/	semmedVER24_2_WHOLEDB_to06302014.sql	NLM's SemMedDB (semantic representation of PubMed content)

Licenses/registrations required for access to some source datasets

MedDRA license
SPLICER NLP US SPL dataset: If you are a commercial organization, please contact Regenstrief Institute for licensing information Regenstrief Institute
NLM licence (lease PubMed/MEDLINE data): http://www.nlm.nih.gov/databases/license/weblic/index.html

LAERTES ETL scripts

Use a web browser or the command-line program wget to download the following zip file containing all the ETL scripts:

https://github.com/OHDSI/KnowledgeBase/archive/master.zip unzip the master.zip file using 7-Zip on Windows or unzip on Linux.

This will result in the following directory structure:

/KnowledgeBase-master/LAERTES

CTD
ClinicalTrials.gov
EuSPC
PVSignals
PubMed
SIDER
SPLICER
Schema
SemMED
URLShortener
Vigibase
terminology-mappings

There is one sub-directory containing the ETL scripts for each data feed, plus the Schema, URLShortener and terminology-mappings sub-directories explained below.

“URLShortener” directory

This directory contains the zipped source code to deploy and run a local URL shortening service on a local server for resolving URLs

“Terminology-mappings” directory

This directory contains a number of data sets used to ultimately map terminologies from the sources terms to the OMOP CDM standard terms.

ETL Process Overview

ETL Process Steps

Prerequisites

Ensure that all of the prerequisite hardware, DBMSs and the URL Shortener service have been deployed.

EU SPC ADR data feed

European Union Adverse Drug Reactions from Summary of Product Characteristics (EU SPC) Database Import

Overview

Download, the EU SPC data sets (see data sources table)
Run python scripts to convert the data into RDF ntriple graph data
Load the RDF ntriple graph data into the Virtuoso database
Manually run Virtuoso SPARQL query to export the drug/hoi combinations along with the adverse event counts into an export file
Run Python script to load the export file into the PostgreSQL public schema database

Details

The details for this data feed are documented and maintained here: https://github.com/OHDSI/KnowledgeBase/tree/master/LAERTES/EuSPC

PUBMED/MEDLINE data feed

MEDLINE records for indexed literature reporting adverse drug events

Overview

Download, the PUBMED/MEDLINE data sets (see data sources table)
Run python scripts to convert the data into RDF ntriple graph data
Load the RDF ntriple graph data into the Virtuoso database
Manually run Virtuoso SPARQL query to export the drug/hoi combinations along with the adverse event counts into an export file
Run Python script to load the export file into the PostgreSQL public schema database

Details

The details for this data feed are documented and maintained here: https://github.com/OHDSI/KnowledgeBase/tree/master/LAERTES/PubMed

SPLICER SPL data feed

SPLICER Natural Language Processing extracted Adverse Drug Events from FDA Structured Product Labels (SPLs)

Overview

Download the SPLICER data sets (see data sources table)
Run python scripts to convert the data into RDF ntriple graph data
Load the RDF ntriple graph data into the Virtuoso database
Manually run Virtuoso SPARQL query to export the drug/hoi combinations along with the adverse event counts into an export file
Run Python script to load the export file into the PostgreSQL public schema database

Details

The details for this data feed are documented and maintained here: https://github.com/OHDSI/KnowledgeBase/tree/master/LAERTES/SPLICER

SemMED data feed

The Semantic MEDLINE Database is a repository of semantic predications (subject-predicate-object triples) extracted by SemRep, a semantic interpreter of biomedical text.

Overview

Download the SemMED Database MySQL data dump file (see data sources table)
Load the MySQL database dump file into a MySQL database (this typically takes hours to load)
Run python scripts to convert the data into RDF ntriple graph data
Load the RDF ntriple graph data into the Virtuoso database
Manually run Virtuoso SPARQL query to export the drug/hoi combinations along with the adverse event counts into an export file
Run Python script to load the export file into the PostgreSQL public schema database

Details

The details for this data feed are documented and maintained here: https://github.com/OHDSI/KnowledgeBase/tree/master/LAERTES/SemMED

Step 4

Overview

Download the HCPCS datasets from the website. Decompress the gzipped files. Run the ETL script to create the staging tables schema and load the datasets.

Notes

load_stage.sql Steps 9 and 10 are completely manual – a select statement to generate the set of Medical Coder rows which needs to be exported as a file followed by an instruction “Append resulting file from Medical Coder (in concept_relationship_stage format) to concept_relationship_stage”. In the future this could be converted into an insert append statement. Note. The original HCPCS HCPYYYY_CONTR_ANWEB_V2.txt file is a fixed field format file that will not load with the ANWEB_V2.ctl file which requires a comma separated file so see details below on creating a csv file to use for the load.

Details

Run create_source_tables.sql
Download the latest file http://www.cms.gov/Medicare/Coding/HCPCSReleaseCodeSets/Alpha-Numeric-HCPCS.html, file name YYYY-Annual-Alpha-Numeric-HCPCS-File.zip
Extract HCPCYYYY_CONTR_ANWEB_v2.xlsx
Open the above file in excel and export a csv file (with double quoted fields) called HCPC2015_CONTR_ANWEB_v2.csv for loading into ANWEB_V2 staging table.
Load the file data into ANWEB_V2. Use the control files of the same name.
Run load_stage.sql
Run generic_update.sql (from working directory)

Staging table load statements

sqlldr PRODV5/<password> CONTROL=ANWEB_V2.ctl DATA=HCPC2015_CONTR_ANWEB_v2.tsv LOG=ANWEB_V2.log BAD=ANWEB_V2.bad

Step 5

Overview

Download, the ICD9CM datasets. Decompress the gzipped files. Run the ETL script to create the staging tables schema and load the datasets.

Details

Run create_source_tables.sql
Download from ICD-9-CM-vXX-master-descriptions.zip from http://www.cms.gov/Medicare/Coding/ICD9ProviderDiagnosticCodes/codes.html
Extract CMSXX_DESC_LONG_DX.txt and CMSXX_DESC_SHORT_DX.txt
Load them into CMS_DESC_LONG_DX and CMS_DESC_SHORT_DX. Use the control files of the same name.
Run load_stage.sql
Run generic_update.sql (from working directory)

Staging table load statements

sqlldr PRODV5/<password> CONTROL=CMS32_DESC_LONG_DX.ctl LOG=CMS32_DESC_LONG_DX.log BAD=CMS32_DESC_LONG_DX.bad 
sqlldr PRODV5/<password> CONTROL=CMS32_DESC_SHORT_DX.ctl LOG=CMS32_DESC_SHORT_DX.log BAD=CMS32_DESC_SHORT_DX.bad

Step 6

Overview

Download, the ICD9Proc datasets. Decompress the gzipped files. Run the ETL script to create the staging tables schema and load the datasets.

Notes

load_stage.sql Steps 8 and 9 are completely manual – a select statement to generate the set of Medical Coder rows which need to be exported as a file called 'concept_relationship_manual9cm'. In the future these two steps could be converted into an insert append sub-select statement.

Run create_source_tables.sql
Download from ICD-9-CM-vXX-master-descriptions.zip from http://www.cms.gov/Medicare/Coding/ICD9ProviderDiagnosticCodes/codes.html
Extract CMSXX_DESC_LONG_SG.txt and CMSXX_DESC_SHORT_SG.txt
Load them into CMS_DESC_LONG_SG and CMS_DESC_SHORT_SG. Use the control files of the same name.
Run load_stage.sql
Run generic_update.sql (from working directory)

Staging table load statements

sqlldr PRODV5/<password> CONTROL=CMS32_DESC_LONG_SG.ctl LOG=CMS32_DESC_LONG_SG.log BAD=CMS32_DESC_LONG_SG.bad 
sqlldr PRODV5/<password> CONTROL=CMS32_DESC_SHORT_SG.ctl LOG=CMS32_DESC_SHORT_SG.log BAD=CMS32_DESC_SHORT_SG.bad

Step 7

Overview

Download, the LOINC datasets. Decompress the gzipped files. Run the ETL script to create the staging tables schema and load the datasets.

Notes

Add the missing create table statement for LOINC_HIERARCHY and add missing columns to LOINC_ANSWERS create statement in create_source_tables.sql, The added/update two tables are shown below:

CREATE TABLE LOINC_HIERARCHY
(
  PATH_TO_ROOT  VARCHAR2(4000), 
  SEQUENCE NUMBER(38,0), 
  IMMEDIATE_PARENT VARCHAR2(255), 
  CODE VARCHAR2(255), 
  CODE_TEXT VARCHAR2(255)
);


CREATE TABLE LOINC
(
  LOINC_NUM                  VARCHAR2(10),
  COMPONENT                  VARCHAR2(255),
  PROPERTY                   VARCHAR2(30),
  TIME_ASPCT                 VARCHAR2(15),
  SYSTEM                     VARCHAR2(100),
  SCALE_TYP                  VARCHAR2(30),
  METHOD_TYP                 VARCHAR2(50),
  CLASS                      VARCHAR2(20),
  SOURCE                     VARCHAR2(8),
  DATE_LAST_CHANGED          DATE,
  CHNG_TYPE                  VARCHAR2(3),
  COMMENTS                   CLOB,
  STATUS                     VARCHAR2(11),
  CONSUMER_NAME              VARCHAR2(255),
  MOLAR_MASS                 VARCHAR2(13),
  CLASSTYPE                  VARCHAR2(20),
  FORMULA                    VARCHAR2(255),
  SPECIES                    VARCHAR2(20),
  EXMPL_ANSWERS              CLOB,
  ACSSYM                     CLOB,
  BASE_NAME                  VARCHAR2(50),
  NAACCR_ID                  VARCHAR2(20),
  CODE_TABLE                 VARCHAR2(10),
  SURVEY_QUEST_TEXT          CLOB,
  SURVEY_QUEST_SRC           VARCHAR2(50),
  UNITSREQUIRED              VARCHAR2(1),
  SUBMITTED_UNITS            VARCHAR2(30),
  RELATEDNAMES2              CLOB,
  SHORTNAME                  VARCHAR2(40),
  ORDER_OBS                  VARCHAR2(15),
  CDISC_COMMON_TESTS         VARCHAR2(1),
  HL7_FIELD_SUBFIELD_ID      VARCHAR2(50),
  EXTERNAL_COPYRIGHT_NOTICE  CLOB,
  EXAMPLE_UNITS              VARCHAR2(255),
  LONG_COMMON_NAME           VARCHAR2(255),
  HL7_V2_DATATYPE            VARCHAR2(255),
  HL7_V3_DATATYPE            VARCHAR2(255),
  CURATED_RANGE_AND_UNITS    CLOB,
  DOCUMENT_SECTION           VARCHAR2(255),
  EXAMPLE_UCUM_UNITS         VARCHAR2(255),
  EXAMPLE_SI_UCUM_UNITS      VARCHAR2(255),
  STATUS_REASON              VARCHAR2(9),
  STATUS_TEXT                CLOB,
  CHANGE_REASON_PUBLIC       CLOB,
  COMMON_TEST_RANK           VARCHAR2(20),
  COMMON_ORDER_RANK          VARCHAR2(20),
  COMMON_SI_TEST_RANK        VARCHAR2(20),
  HL7_ATTACHMENT_STRUCTURE   VARCHAR2(15)
);

Details

Run create_source_tables.sql
Download full set (http://loinc.org/downloads/files/loinc-table-csv-text-format/loinc-table-file-csv-text-format/download) and multiaxial hierarchy (http://loinc.org/downloads/files/loinc-multiaxial-hierarchy/loinc-multiaxial-hierarchy-file/download)
Extract loinc.csv, map_to.csv, source_organization.csv and LOINC_250_MULTI-AXIAL_HIERARCHY.CSV
Load them into LOINC, MAP_TO, SOURCE_ORGANIZATION and LOINC_HIERARCHY. Use the control files of the same name.
Load LOINC Answers - Load LOINC_XXX_SELECTED_FORMS.zip from http://loinc.org/downloads/files/loinc-panels-and-forms-file/loinc-panels-and-forms-file-all-selected-panels-and-forms/download
Open LOINC_XXX_SELECTED_FORMS.xlsx and load worksheet “ANSWERS” to table LOINC_ANSWERS
Open loinc_class.csv and load it into table loinc_class
Run load_stage.sql
Run generic_update.sql (from working directory)

Staging table load statements

sqlldr PRODV5<password> CONTROL=LOINC.ctl LOG=LOINC.log BAD=LOINC.bad 
sqlldr PRODV5/<password> CONTROL=MAP_TO.ctl LOG=MAP_TO.log BAD=MAP_TO.bad 
sqlldr PRODV5/<password> CONTROL=SOURCE_ORGANIZATION.ctl LOG=SOURCE_ORGANIZATION.log BAD=SOURCE_ORGANIZATION.bad 
sqlldr PRODV5/<password> CONTROL=LOINC_HIERARCHY.ctl LOG=LOINC_HIERARCHY.log BAD=LOINC_HIERARCHY.bad 
sqlldr PRODV5/<password> CONTROL=LOINC_CLASS.ctl LOG=LOINC_CLASS.log BAD=LOINC_CLASS.bad 
sqlldr PRODV5/<password> CONTROL=LOINC_ANSWERS.ctl LOG=LOINC_ANSWERS.log BAD=LOINC_ANSWERS.bad

Step 8

Overview

Download, the MEDDRA datasets. Decompress the gzipped files. Run the ETL script to create the staging tables schema and load the datasets.

You must have a MedDRA license to download the data set. The processing of this data set follows the same approach as the other data feeds.

Details

Run create_source_tables.sql
Download the current Meddra from https://www.meddra.org/user?destination=downloads (english)
Extract and load into the tables from the folder “MedAscii”. Use the control files of the same name hlgt.asc hlgt_hlt.asc hlt.asc hlt_pt.asc intl_ord.asc llt.asc mdhier.asc pt.asc SMQ_Content.asc SMQ_List.asc soc.asc soc_hlgt.asc
Run load_stage.sql
Run generic_update.sql (from working directory)

Step 9

overview

Download, the NDC-SPL datasets. Decompress the gzipped files. Run the ETL script to create the staging tables schema and load the datasets.

Notes

This data feed loads the NDC vocabulary and the SPL vocabularies.

Details

Run create_source_tables.sql
Download NDC code distrbution file Open the site http://www.fda.gov/Drugs/InformationOnDrugs/ucm142438.htm
Download the latest NDC Database File.
Extract the product.txt file.
Load product.txt into PRODUCT using control file of the same name
Run load_stage.sql
Run generic_update.sql (from working directory)

Staging table load statements

sqlldr PRODV5/<password> CONTROL=PRODUCT.ctl LOG=PRODUCT.log BAD=PRODUCT.bad

Step 10

Overview

Download, the READ code datasets. Decompress the gzipped files. Run the ETL script to create the staging tables schema and load the datasets.

Details

Run create_source_tables.sql
Download Read V2 (CTV2) code distrbution file from the site https://isd.hscic.gov.uk/trud3/user/authenticated/group/0/pack/1/subpack/21/releases.
Download the latest release xx.
Extract the nhs_readv2_xxxxx\V2\Unified\Keyv2.all file.
Download Read V2 (CTV2) code to SNOMED CT mapping file nhs_datamigration_VV.0.0_YYYYMMDD000001.zip from https://isd.hscic.gov.uk/trud3/user/authenticated/group/0/pack/9/subpack/9/releases (Subscription “NHS Data Migration”). “VV” stands for the READ CODE version number. This script loads READ CODE V2. “YYYYMMDD” stands for the release date.
Download the latest release xx.
Extract nhs_datamigration_xxxxx\Mapping Tables\Updated\Clinically Assured\rcsctmap2_uk_YYYYMMDD000001.txt
Load Keyv2.all and rcsctmap2_uk_YYYYMMDD000001.txt into KEYV2 and RCSCTMAP2_UK using control files of the same name
Run load_stage.sql
Run generic_update.sql (from working directory)

staging table load statements

sqlldr PRODV5/<password> CONTROL=Keyv2.ctl LOG=Keyv2.log BAD=Keyv2.bad 
sqlldr PRODV5/<password> CONTROL=rcsctmap2_uk.ctl LOG=rcsctmap2_uk.log BAD=rcsctmap2_uk.bad

Step 11

Overview

In the final_assembly directory, run the two SQL scripts to populate the concept_ancestor and drug_strength tables.

Details

Run concept_ancestor.sql
Run drug_strength.sql

Step 12

Overview

In the “working” directory, run the manual change scripts.

Note

These scripts are collections of ad-hoc inserts / updates that have been applied over time. Before running each block of sql statements review them against the current V5 vocabulary table values to avoid potential duplicate inserts or inappropriate updates. Contact Christian Reich and the Athena vocabulary team for more info on these scripts.

Run manual_changes.sql
Run manual_changes_15-Mar-2015.sql

Step 13

Overview

In the “working” directory run a script that will update the V4 schema tables with the new data in the V5 schema tables.

Notes

The v4 tables and the v5 tables should already exist and be populated with data. The prodv4 user should have been granted read access to the prodv5 schema tables.

Details

FYI. V4 schema is here: https://github.com/OHDSI/CommonDataModel/blob/master/Version4/Oracle/CDM%20V4%20DDL.sql
run v5-to-v4.sql to update the prodv4 tables with the new data from the prodv5 tables

Observational Health Data Sciences and Informatics

User Tools

Site Tools

Sidebar

Documentation

Table of Contents

Background

LAERTES: Large scale Adverse Effects Related to Treatment Evidence Standardization

LAERTES ETL cookbook

Prerequisites

Network connectivity

Utilities

DBMS

PostgreSQL DBMS

MySQL DBMS

Virtuoso DBMS

Hardware

Source Data Sets

Licenses/registrations required for access to some source datasets

LAERTES ETL scripts

“URLShortener” directory

“Terminology-mappings” directory

ETL Process Overview

ETL Process Steps

Prerequisites

EU SPC ADR data feed

Overview

Details

PUBMED/MEDLINE data feed

Overview

Details

SPLICER SPL data feed

Overview

Details

SemMED data feed

Overview

Details

Step 4

Overview

Notes

Details

Step 5

Overview

Details

Step 6

Overview

Notes

Step 7

Overview

Notes

Details

Step 8

Overview

Details

Step 9

overview

Notes

Details

Step 10

Overview

Details

Step 11

Overview

Details

Step 12

Overview

Note

Step 13

Overview

Notes

Details

Page Tools