User Tools

Site Tools


documentation:software:whiterabbit

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
Last revision Both sides next revision
documentation:software:whiterabbit [2014/12/03 19:55]
ericavoss
documentation:software:whiterabbit [2019/12/24 16:29]
mkahn 24-12-2019: BigQuery configuration instructions added by M Kahn
Line 8: Line 8:
  
 WhiteRabbit is a software tool to help prepare for ETLs (Extraction,​ Transformation,​ Loading) of WhiteRabbit is a software tool to help prepare for ETLs (Extraction,​ Transformation,​ Loading) of
-longitudinal healthcare databases into the Observational Medical Outcomes Partnership +longitudinal healthcare databases into the [[documentation:​cdm:​single-page|Observational Medical Outcomes Partnership (OMOP) Common Data Model (CDM)]]. The source data can be in 
-(OMOP) Common Data Model (CDM) ([[http://​omop.org/​CDM]]). The source data can be in +comma-separated text files, or in a database (MySQL, SQL Server, ​Oracle, PostgreSQL, ​Microsoft APS, Microsoft Access, Amazon RedShift, Google BigQuery). Note that the CDM will need to be in one of limited set of database ​platforms ​(SQL Server, Oracle, PostgreSQL, Microsoft APS, Amazon RedShift, Google BigQuery). 
-comma-separated text files, or in a database (MySQL, SQL Server, ​ORACLE, PostgreSQL), the +
-CDM will be in a database (MySQL, ​SQL Server, PostgreSQL). WhiteRabbit is known not to work +
-with Teradata.+
  
 WhiteRabbit’s main function is to perform a scan of the source data, providing detailed WhiteRabbit’s main function is to perform a scan of the source data, providing detailed
 information on the tables, fields, and values that appear in a field. This scan will generate a information on the tables, fields, and values that appear in a field. This scan will generate a
-report that can be used as a reference when designing the ETL, for instance by using the RabbitIn-a-Hat+report that can be used as a reference when designing the ETL, for instance by using the Rabbit-In-a-Hat
 tool. White Rabbit differs from standard data profiling tools in that it attempts to tool. White Rabbit differs from standard data profiling tools in that it attempts to
 prevent the display of personally identifiable information (PII) data values in the generated prevent the display of personally identifiable information (PII) data values in the generated
Line 35: Line 32:
 ===== Installation and support ===== ===== Installation and support =====
  
-All source code and installation instructions available on Github: https://​github.com/​OHDSI/​WhiteRabbit+All source code and installation instructions available on GitHub: https://​github.com/​OHDSI/​WhiteRabbit
  
-Any bugs/​issues/​enhancements should be posted to the Github ​repository: https://​github.com/​OHDSI/​WhiteRabbit/​issues+Any bugs/​issues/​enhancements should be posted to the GitHub ​repository: https://​github.com/​OHDSI/​WhiteRabbit/​issues
  
-Any questions/​comments/​feedback/​discussion ​about HERMES ​can be posted on the OHDSI Developer Forum: http://​forums.ohdsi.org/​c/​developers+Any questions/​comments/​feedback/​discussion can be posted on the OHDSI Developer Forum: http://​forums.ohdsi.org/​c/​developers
  
-===== Application-specific sections about using the application functions ​=====+===== Using the Application Functions ​=====
  
 ==== Specifying the Location of Data ==== ==== Specifying the Location of Data ====
Line 55: Line 52:
  
 Here you can specify the location of the source data. The following source types are supported: Here you can specify the location of the source data. The following source types are supported:
-delimited text files, MySQL, Oracle, SQL Server, ​and PostgreSQL. Below are connection+delimited text files, MySQL, Oracle, SQL Server, PostgreSQL, and Google BigQuery. Below are connection
 instructions for each data type of data source. Once you have entered the necessary instructions for each data type of data source. Once you have entered the necessary
 information,​ the “Test connection” button can ensure a connection can be made. information,​ the “Test connection” button can ensure a connection can be made.
  
-**Delimited text files** ​(source data only)+**Delimited text files**
   * **//​Delimiter://​** specifies the delimiter that separates columns, default is ‘,’ and your can write ‘tab’ for tab delimited.   * **//​Delimiter://​** specifies the delimiter that separates columns, default is ‘,’ and your can write ‘tab’ for tab delimited.
  
Line 66: Line 63:
  
 **MySQL** **MySQL**
-  * //**Server location:​**//​ the name or IP address of the server running MySQL+  * //**Server location:​**//​ the name or IP address of the server running MySQL. You can also specify the port (ex: <​host>:<​port>​),​ which defaults to 3306.
   * //**User name:**// name of the user used to log into the server   * //**User name:**// name of the user used to log into the server
   * //​**Password:​**//​ password for the supplied user name   * //​**Password:​**//​ password for the supplied user name
Line 78: Line 75:
  
 **SQL Server** **SQL Server**
-  * //**Server location:​**//​ the name or IP address of the server running SQL Server ​+  * //**Server location:​**//​ the name or IP address of the server running SQL Server. You can also specify the port (ex: <​host>:<​port>​),​ which defaults to 1433. 
   * //**User name:**// name of the user used to log into the server. Optionally, the domain can be specified as <​domain>/<​user>​ (e.g. '​MyDomain/​Joe'​)   * //**User name:**// name of the user used to log into the server. Optionally, the domain can be specified as <​domain>/<​user>​ (e.g. '​MyDomain/​Joe'​)
   * //​**Password:​**//​ password for the supplied user name   * //​**Password:​**//​ password for the supplied user name
Line 86: Line 83:
   - Download the .exe from http://​msdn.microsoft.com/​en-us/​sqlserver/​aa937724.aspx.   - Download the .exe from http://​msdn.microsoft.com/​en-us/​sqlserver/​aa937724.aspx.
   - Run it, thereby extracting its contents to a folder.   - Run it, thereby extracting its contents to a folder.
-  - In the extracted folder you will find the file sqljdbc_4.0/​enu/​auth/​x64/​sqljdbc_auth.dll (64-bits) or sqljdbc_4.0/​enu/​auth/​x86/​sqljdbc_auth.dll (32-bits), which needs to be moved to a location on the system path, for example to c:/​windows/​system32.+  - In the extracted folder you will find the file //sqljdbc_4.0/​enu/​auth/​x64/​sqljdbc_auth.dll// (64-bits) or //sqljdbc_4.0/​enu/​auth/​x86/​sqljdbc_auth.dll// (32-bits), which needs to be moved to a location on the system path, for example to //c:/​windows/​system32//. 
 + 
 +**MS APS** 
 + 
 +At this time WhiteRabbit does not run on this platform.
  
 **PostgreSQL** **PostgreSQL**
 +
   * //**Server location:​**//​ this field contains the host name and database name (<​host>/<​database>​)   * //**Server location:​**//​ this field contains the host name and database name (<​host>/<​database>​)
   * //**User name:**// name of the user used to log into the server   * //**User name:**// name of the user used to log into the server
   * //​**Password:​**//​ password for the supplied user name   * //​**Password:​**//​ password for the supplied user name
-  * //​**Database name:**// this field contains the schema containing the source tables ​+  * //​**Database name:**// this field contains the schema containing the source tables 
 + 
 +**Google BigQuery** 
 + 
 +Google BigQuery (GBQ) supports two different connection/​authentication methods -- application default credentials and service account authentication. The former method is considered more secure because it writes auditing events to stackdriver. The specific method used is determined by the arguments provided to the configuration panel as described below. 
 + 
 + 
 +Authentication via application default credentials:​ 
 + 
 +When using application default credentials authentication,​ you must run the following gcloud command in the user account only once: '​gcloud auth application-default login' (do not include the single quote characters). An application key is written to ~/​.config/​gcloud/​application_default_credentails.json. 
 + 
 +  * //**Server location:​**//​ name of the GBQ ProjectID 
 +  * //**User name:**// not used 
 +  * //​**Password:​**//​ not used 
 +  * //​**Database name:**// data set name within ProjectID named in Server location field 
 + 
 +Authentication via service account credentials:​ 
 +  * //**Server location:​**//​ name of GBQ ProjectID 
 +  * //**User name:**// OAuth service account email address 
 +  * //​**Password:​**//​ OAuth private key path (file location of private key JSON file). Must be a valid full file pathname 
 +  * //​**Database name:**// data set name within ProjectID named in Server location field 
  
 ==== Scanning a Database ==== ==== Scanning a Database ====
Line 110: Line 133:
  
 Once all settings are completed, press the “Scan tables” button. After the scan is completed the report will be written to the working folder. Once all settings are completed, press the “Scan tables” button. After the scan is completed the report will be written to the working folder.
 +
 +=== Running from the command line ===
 +
 +For various reasons one could prefer to run WhiteRabbit from the command line. This is possible by specifying all the options one would normally select in the user interface in an .ini file. An example ini file can be found [[https://​github.com/​OHDSI/​WhiteRabbit/​blob/​master/​iniFileExamples/​WhiteRabbit.ini|here]]. Then, we can reference the ini file when calling WhiteRabbit from the command line:
 +
 +  java -jar WhiteRabbit.jar -ini WhiteRabbit.ini
  
 === Reading the Scan === === Reading the Scan ===
  
-After the scan is completed, a “ScanReport” Excel document will be created in the working folder location selected earlier. The document will have multiple tabs, one as an “Overview” and then one tab for each database table or delimited text files selected for the scan. The “Overview” tab will tell you about each table selected, what the columns in each table are, the data type of the columns, the amount of data within the table, the number of rows scanned, and the fraction of data empty. Below is an example image of the “Overview” tab. Column A will list what table the information is about, Column B the column being reviewed, Column C, the type of the column, Column E is the number of rows (however with text files it will return - 1), Column F will tell you how many rows of the N rows were reviewed, and Column G will let you know how many rows are empty. ​+After the scan is completed, a “ScanReport” Excel document will be created in the working folder location selected earlier. The document will have multiple tabs, one as an “Overview” and then one tab for each database table or delimited text files selected for the scan. The “Overview” tab will tell you about each table selected, what the columns in each table are, the data type of the columns, the amount of data within the table, the number of rows scanned, and the fraction of data empty. Below is an example image of the “Overview” tab. Column A will list what table the information is about, Column B the column being reviewed, Column C, the type of the column, Column E is the number of rows (however with text files it will return - 1), Column F will tell you how many rows of the N rows were reviewed, and Column G will let you know how many rows are empty.
  
 {{ :​documentation:​software:​whiterabbitscreen-readingthescan.png |}} {{ :​documentation:​software:​whiterabbitscreen-readingthescan.png |}}
  
-For a tab that describes a single table, the columns names from the source table (or delimited text file) will be across the columns of the Excel tab. Each source table column will generate two columns in the Excel. One column will list all distinct values that have a “Min cell count” greater than what was set at time of the scan (Section ​1.4.1). If a list of unique values was truncated, the last value in the list will be “List truncated”;​ this indicates that there are one or more additional unique source values that appear less than the number entered in the “Min cell count” described in Section ​1.4.1. Next to each distinct value will be a second column that contains the frequency, or the number of times that value occurs in the data. These two columns (distinct values and frequency) will repeat for all the source columns in the table profiled in the workbook. ​+For a tab that describes a single table, the columns names from the source table (or delimited text file) will be across the columns of the Excel tab. Each source table column will generate two columns in the Excel. One column will list all distinct values that have a “Min cell count” greater than what was set at time of the scan (Section ​- Performing the Scan). If a list of unique values was truncated, the last value in the list will be “List truncated”;​ this indicates that there are one or more additional unique source values that appear less than the number entered in the “Min cell count” described in Section ​- Performing the Scan. Next to each distinct value will be a second column that contains the frequency, or the number of times that value occurs in the data. These two columns (distinct values and frequency) will repeat for all the source columns in the table profiled in the workbook.
  
 {{ :​documentation:​software:​whiterabbitscreen-readingthescanex.png |}} {{ :​documentation:​software:​whiterabbitscreen-readingthescanex.png |}}
Line 163: Line 192:
  
 To open an existing Rabbit-In-a-Hat document use //File --> Open ETL specs//. To open an existing Rabbit-In-a-Hat document use //File --> Open ETL specs//.
 +
 +
 +==== Selecting Desired CDM Version ====
 +
 +Rabbit-In-a-Hat allows you to select which CDM version (v4 or v5) you'd like to built your ETL specification against.
 +
 +See the graphic below for how to select your desired CDM:
 +{{http://​i.imgur.com/​LOqhp7H.gif|Switching between CDMv4 and CDMv5}}
 +
 +
 +The CDM version can be changed at any time, but beware that some of your existing mappings may be lost in the process. ​ By default, Rabbit-In-a-Hat will attempt to pereserve as many mappings between the source data and the newly selected CDM as possible. ​ When a new CDM is selected, Rabbit-In-a-Hat will drop any mappings if the mapping'​s CDM table or CDM column name no longer exist
 +
 +For instance, switching from CDMv4 to CDMv5, a mapping from source to CDM person.person_source_value will be kept because the person table has person_source_value in both CDMv4 and CDMv5. ​ However, person.assocaited_provider_id exists only in CDMv4 and has been renamed to [[http://​www.ohdsi.org/​web/​wiki/​doku.php?​id=documentation:​cdm:​person|person.provider_id in CDMv5]] and so that mapping will not be kept when switching between these two CDMs.
 +
 +==== Loading in a Custom CDM ====
 +
 +There are times when users might need to load in a customized version of the CDM, for instance if they are sandboxing new features. ​ To load in a custom CDM schema, first you must create a CSV file that uses the same format as [[https://​github.com/​OHDSI/​WhiteRabbit/​blob/​master/​src/​org/​ohdsi/​rabbitInAHat/​dataModel/​CDMV5.csv|the existing CDMv5 schema file]].
 +
 +Once you have created the CSV file, load it into RiaH as shown below:
 +{{http://​i.imgur.com/​Tn9NKL3.gif|Loading a custom CDM schema}}
 +
 +Please note that the name of the file you load in becomes the label that appears above the target tables, so "My Super File.csv"​ will create the label "My Super File" above the target tables, so name your CSV accordingly.
  
 ===== Connecting Source Tables to CDM Tables ===== ===== Connecting Source Tables to CDM Tables =====
Line 184: Line 235:
 ===== Connecting Source Fields to CDM Fields ===== ===== Connecting Source Fields to CDM Fields =====
  
-By double clicking on an arrow connecting a source and CDM table, it will open a //Fields// pane below the arrow selected. The //Fields// pane will have all the source table and CDM fields and is meant to make the specific column mappings between tables. Hovering over a source table will generate an arrow head that can then be selected and dragged to its corresponding CDM field. For example, in the //​drug_claims//​ to //​drug_exposure//​ table mapping example, the source data owners know that //​patient_id//​ is the patient identifier and corresponds to the //​CDM.person_id//​. Also, just as before, the arrow can be selected and //Logic/// and //​Comments//​ can be added. ​+By double clicking on an arrow connecting a source and CDM table, it will open a //Fields// pane below the arrow selected. The //Fields// pane will have all the source table and CDM fields and is meant to make the specific column mappings between tables. Hovering over a source table will generate an arrow head that can then be selected and dragged to its corresponding CDM field. For example, in the //​drug_claims//​ to //​drug_exposure//​ table mapping example, the source data owners know that //​patient_id//​ is the patient identifier and corresponds to the //​CDM.person_id//​. Also, just as before, the arrow can be selected and //Logic// and //​Comments//​ can be added.
  
 {{ :​documentation:​software:​rabbitinahat-fields.png |}} {{ :​documentation:​software:​rabbitinahat-fields.png |}}
 +
 +If you select the source table orange box, Rabbit-In-a-Hat will expose values the source data has for that table. This is meant to help in the process in understanding the source data and what logic may be required to handle the data in the ETL. In the example below //ndcnum// is selected and raw NDC codes are displayed starting with most frequent (note that in the WhiteRabbit scan a “Min cell count” could have been selected and values below that frequency will not show).
 +
 +{{ :​documentation:​software:​rabbitinahat-fieldex.png |}}
 +
 +Continue this process until all source columns necessary in all mapped tables have been mapped to the corresponding CDM column. Not all columns must be mapped into a CDM column and not all CDM columns require a mapping. One source column may supply information to multiple CDM columns and one CDM column can receive information from multiple columns.
 +
 +===== Generating an ETL Document =====
 +
 +To generate an ETL MS Word document use //File --> Generate ETL// document and select a location to save. It may take a moment before the document is finished creating. This document will contain all notes from Rabbit-In-a-Hat.
 +
 +Once the information is in the document, if an update is needed you must either update the information in Rabbit-In-a-Hat and regenerate the document or update the document. If you make changes in the document, Rabbit-In-a-Hat will not read those changes and update the information in the tool. However it is common to generate the document with the core mapping information and fill in more detail within the document.
 +
 +Once the document is completed, this should be shared with the individuals who plan to implement the code to execute the ETL.
 +
 +===== Generating a testing framework =====
 +
 +To make sure the ETL process is working as specified, it is highly recommended to create [[https://​en.wikipedia.org/​wiki/​Unit_testing|unit tests]] that evaluate the behavior of the ETL process. To efficiently create a set of unit tests Rabbit-in-a-Hat can [[documentation:​software:​whiterabbit:​test_framework|generate a testing framework]].
 +
 +====== Best Practices ======
 +
 +The following lists best practices in using WhiteRabbit and Rabbit-In-a-Hat to manage your ETL documentation process:
 +  * **Overall Process:**
 +    * When going through the ETL exercise, it is critical to get all key stakeholders in the room such as the data owners and individuals who plan to perform/​manage research using the CDM. We have found different stakeholders have different perspectives on source data and the conversation that occurs improves the ETL process.
 +  * **WhiteRabbit:​**
 +    * If it is known some tables will not be needed from the source data, do not include them in the scan. If there is a question to if the table is necessary it is better to include it.
 +  * **Rabbit-In-a-Hat:​**
 +    * Start with mapping tables to tables and then dive into how one table’s fields map into CDM fields. In other words, stay at the table level until all tables a mapped, and then start to map fields.
 +    * If your source data does not contain certain information you do not need to impute or generate information to fulfil a CDM requirement. For example, if your source data does not contain an end date for when a medication was stopped you do not need to population the DRUG_EXPOSURE.DRUG_EXPOSURE_START_DATE column.
 +    * Derived CDM tables, like DRUG_ERA, typically will not receive a mapping from the source data because they are generated off the CDM table (in this case DRUG_ERA is generated off DRUG_EXPOSURE).
 +
documentation/software/whiterabbit.txt · Last modified: 2020/02/18 14:12 by maximmoinat