====== Background ======

===== LAERTES: Large scale Adverse Effects Related to Treatment Evidence Standardization =====


LAERTES provides an evidence base of a wide variety of sources with
information relevant for assessing associations between drugs and
health outcomes of interest. Please see the following publication for
a description of the goal and use cases driving the system:

Boyce. RD., Ryan. PB., Noren. N., et al., Bridging islands of information to establish an integrated knowledge base of drugs and health outcomes of interest. Drug Safety. 2014. Volume 37, Issue 8 (2014), Page 557-567. DOI: 10.1007/s40264-014-0189-0, PubMed PMID: 24985530. PMCID: PMC4134480.http://link.springer.com/article/10.1007%2Fs40264-014-0189-0. 

Conceptually, the data model is a hybrid relational/RDF schema where
the RDF component applies the Open Data Annotation (OA) model
(http://www.openannotation.org/spec/core/) to represent specific
evidence items for or against an association between a drug and a
health outcome of interest (HOIs) from any of a variety of sources,
and the relational component provides a summary and index for drug-HOI
evidence represented in the RDF component.

The relational component extends the OHDSI Standard Vocabulary,
enumerates the evidence data sources, and provides counts for the
records associated with every drug-HOI pair in each source noting the
modality of the association (i.e., a positive or negative
association). Associated with the counts from a given source is a URL
that can be used in the RDF component to pull back a list of OA
records typed using custom OHDSI evidence types. Each OA record
provides data that client programs can use to render data about the
source of the evidence (the "target") and the semantic tags used to
identify the record as a source of evidence for a drug-HOI pair (the
"body" or "bodies").

This model decouples the data sources from the various copies of the
sources that might have been processed in many different ways. It also
decouples what can be said about an evidence item (i.e., the semantic
tags) from the information artifact. All of this allows for greater
flexibility with respect to inclusion of sources and
analysis. Moreover, the 'drill down' use case can be supported by
following URIs provided by in the 'evidence target' table to Linked
Data resources.

This model will be developed to support all of the sources here:
https://docs.google.com/document/d/13UwoqjPyqKr-MLpcflzNs8WD5Za4Ynqnue1xWU2cgaA/edit# 

NOTE: The OHDSI standard vocabulary may require extensions from the
following terminologies:

  * http://www.openannotation.org/spec/core/#Terminology
  * http://xmlns.com/foaf/spec/
  * http://www.w3.org/TR/prov-o/
  * http://www.w3.org/TR/skos-reference/

This wiki page provides a high level guide or "cookbook" for the ETL process that the LAERTES team follows to periodically update the LAERTES database when new source data sets are released.  

This ETL process will be improved and extended with new data sources over time. It is somewhat complex, it has very specific pre-requisites, it requires a good knowledge of the source data sets and it includes some manual steps.  

This ETL is not intended to be executed by typical consumers of LAERTES data, instead they should use the following approaches:

  * Call the LAERTES evidence services within the WebAPI (this option is currently available)
  * User Interfaces will be developed for LAERTES evidence
  * A copy of the LAERTES database may eventually be made available for download

====== LAERTES ETL cookbook ======


==== Prerequisites ====


==== Network connectivity ====

Internet access is required to download source dataset files from external websites and to download the ETL source code and schema from github: https://github.com/OHDSI/KnowledgeBase/tree/master/LAERTES

==== Utilities ====

A zip file decompression program is required (e.g. 7-zip for windows or gzip/gunzip for linux)

==== DBMS ====

=== PostgreSQL DBMS ===
 
Notes.
The ETL process combines the source data into a PostgreSQL database and then the last step populates the final tables in ANSI standard DDL tables which can be created and loaded in the three supported OHDSI DBMSs: Oracle, PostgreSQL and SQL Server.

* Note. As a general principle to be followed when running the ETL scripts You will get the best performance by first running the create table statements, then loading the tables and then running the create index statements.  The reason is that it is faster to add an index to a populated table than to load data into a table that already has an existing index.

=== MySQL DBMS ===

  * A MySQL database for the SemMedDB database
  * A MySQL database for the URL shortening service that must run on a local server for resolving URLs
 
Notes.

SemMedDB, a repository of semantic predications (subject-predicate-object triples) extracted from the entire set of PubMed citations is only available as a MySQL database dump file.  Therefore a MySQL database is required to load the data for ETL processing.  The SemMedDB data is relatively large. (80 to 100 GBs).  

=== Virtuoso DBMS ===

Notes.

LAERTES is a hybrid relational/RDF schema model. The RDF schema is created and maintained in a Virtuoso server.

* Note. instructions on faceted browser installation:
http://virtuoso.openlinksw.com/dataspace/doc/dav/wiki/Main/VirtFacetBrowserInstallConfig
==== Hardware ====

PostgreSQL DBMS server
  * 4 Gig of RAM (minimum)
  * 300 GB disk space for database and source files

Virtuoso server
  * 4 Gig of RAM (minimum)
  * 300 GB disk space for database and source files

MySQL DBMS server
  * 4 Gig of RAM (minimum)
  * 200 GB disk space for database and source files

ETL and URL shortening service server
  * 4 Gig of RAM (minimum)
  * 300 GB disk space for database and source files

==== Source Data Sets ====

This table shows the web sites to visit and the files to download for this ETL process.
Note. There are additional feeds under development but the ones shown below are the initial core set.

^ Data Feed	^Website	^Website Download Page ^Example Download File Name	^Notes^
| EU SPC ADRs | PROTECT Website       | http://www.imi-protect.eu/adverseDrugReactions.shtml | http://www.imi-protect.eu/documents/Finalrepository_2Sep2014_DLP30June2013.xls | A structured Excel database of all adverse drug reactions (ADRs) listed in section 4.8 of the Summary of Product Characteristics (SPC) of medicinal products authorised in the EU according to the centralised procedure |
| PubMed/MEDLINE using MeSH tags ADEs | NLM ftp server       | NLM ftp server | Releases folder, MedlineXmlToDatabase*.zip files |  MEDLINE records for indexed literature reporting adverse drug events  |
| SPLICER extracted US SPL ADEs | SPLICER Data set provided to OHDSI by Regenstrief |SPLICER Data set provided to OHDSI by Regenstrief | SPLICER Data set provided to OHDSI by Regenstrief |  Structured Product Label Information Coder and Extractor (SPLICER) natural language processing to extract ADE data from FDA mandated Structured Product Labels (SPL) |
| NLM's SemMedDB (semantic representation of PubMed content)  | http://skr3.nlm.nih.gov/ | http://skr3.nlm.nih.gov/|semmedVER24_2_WHOLEDB_to06302014.sql |  NLM's SemMedDB (semantic representation of PubMed content) |


==== Licenses/registrations required for access to some source datasets ====

  * MedDRA license
  * SPLICER NLP US SPL dataset:  If you are a commercial organization, please contact Regenstrief Institute for licensing information [[allenkat@regenstrief.org|Regenstrief Institute]] 
  * NLM licence (lease PubMed/MEDLINE data):  http://www.nlm.nih.gov/databases/license/weblic/index.html

==== LAERTES ETL scripts ====

Use a web browser or the command-line program wget to download the following zip file containing all the ETL scripts:

https://github.com/OHDSI/KnowledgeBase/archive/master.zip
unzip the master.zip file using 7-Zip on Windows or unzip on Linux.
  
This will result in the following directory structure:

/KnowledgeBase-master/LAERTES

  * CTD	
  * ClinicalTrials.gov	
  * EuSPC	
  * PVSignals	
  * PubMed	
  * SIDER	
  * SPLICER	
  * Schema	
  * SemMED	
  * URLShortener	
  * Vigibase	
  * terminology-mappings	
	

There is one sub-directory containing the ETL scripts for each data feed, plus the Schema, URLShortener and terminology-mappings sub-directories explained below.

==== “URLShortener” directory ====

This directory contains the zipped source code to deploy and run a local URL shortening service on a local server for resolving URLs

==== “Terminology-mappings” directory ====

This directory contains a number of data sets ultimately used to map terminologies from the sources terms to the OMOP CDM standard terms.

==== ETL Process  Overview ====

{{:documentation:laertes_etl_overview.png?300|}}

{{:documentation:laertes_load_merged_counts.png?300|}}

==== ETL Process Steps ====

==== Prerequisites ====
Ensure that all of the prerequisite hardware, DBMSs and the URL Shortener service have been deployed.

==== EU SPC ADR data feed ====
European Union Adverse Drug Reactions from Summary of Product Characteristics (EU SPC) Database Import

=== Overview ===

  * Download, the EU SPC data sets (see data sources table)
  * Run python scripts to convert the data into RDF ntriple graph data
  * Load the RDF ntriple graph data into the Virtuoso database
  * Manually run Virtuoso SPARQL query to export the drug/hoi combinations along with the adverse event counts into an export file
  * Manually load the annotation URIs into the URL Shortener MySQL database using the MySQL command line client
  * Run Python script to load the export file into the PostgreSQL public schema database

=== Details ===

The details for this data feed are documented and maintained here:
https://github.com/OHDSI/KnowledgeBase/tree/master/LAERTES/EuSPC

==== PUBMED / MEDLINE data feed ====
MEDLINE records for indexed literature reporting adverse drug events 

=== Overview ===

  * Download, the PUBMED/MEDLINE data sets (see data sources table)
  * Run python scripts to convert the data into RDF ntriple graph data
  * Load the RDF ntriple graph data into the Virtuoso database
  * Manually load the annotation URIs into the URL Shortener MySQL database using the MySQL command line client
  * Manually run Virtuoso SPARQL query to export the drug/hoi combinations along with the adverse event counts into an export file
  * Run Python script to load the export file into the PostgreSQL public schema database

=== Details ===

The details for this data feed are documented and maintained here:
https://github.com/OHDSI/KnowledgeBase/tree/master/LAERTES/PubMed
==== SPLICER SPL data feed ====
SPLICER Natural Language Processing extracted Adverse Drug Events from FDA Structured Product Labels (SPLs)

=== Overview ===

  * Download the SPLICER data sets (see data sources table)
  * Run python scripts to convert the data into RDF ntriple graph data
  * Load the RDF ntriple graph data into the Virtuoso database
  * Manually load the annotation URIs into the URL Shortener MySQL database using the MySQL command line client
  * Manually run Virtuoso SPARQL query to export the drug/hoi combinations along with the adverse event counts into an export file
  * Run Python script to load the export file into the PostgreSQL public schema database

=== Details ===

The details for this data feed are documented and maintained here:
https://github.com/OHDSI/KnowledgeBase/tree/master/LAERTES/SPLICER
==== SemMED data feed ====
The Semantic MEDLINE Database is a repository of semantic predications (subject-predicate-object triples) extracted by SemRep, a semantic interpreter of biomedical text.

=== Overview ===

  * Download the SemMED Database MySQL data dump file (see data sources table)
  * Load the MySQL database dump file into a MySQL database (this typically takes hours to load)
  * Run python scripts to convert the data into RDF ntriple graph data
  * Load the RDF ntriple graph data into the Virtuoso database
  * Manually load the annotation URIs into the URL Shortener MySQL database using the MySQL command line client
  * Manually run Virtuoso SPARQL query to export the drug/hoi combinations along with the adverse event counts into an export file
  * Run Python script to load the export file into the PostgreSQL public schema database

=== Details ===

The details for this data feed are documented and maintained here:
https://github.com/OHDSI/KnowledgeBase/tree/master/LAERTES/SemMED