What do I have to have to do an OMOP conversion?

Separate the process into modules.

Pre-analysis
  - Data from source (or at least range for each field).
  - Data dictionary including:
    - taxonomies
    - OMOP vocabulary (reference codes)
    - are any fields masked or modified for privacy and if so, what is the algorithm?
  - Use min and max to measure range for observations and for each field.
    - Define how to handle exceptions
  - Identify subject matter expert for source.
  - Identify OMOP subject matter expert.
  - Use cases for destination.
  - List of standard questions:
    - Number of records
    - How often is data refreshed?
    - Is the conversion feasible?
  - Draft ETL Specs.
  - Privacy requirements
  - Definition of visit
  - Specific codes that are of interest.  Ensure these are mapped.
  - Business rules for handling conflicting information (e.g. visits after patient death).
  - Specific metrics that are of interest.


Meeting with subject matter experts

Post-analysis
  - Revised Data dictionary
  - Initial ETL Spec including:
    - Business rules for mapping in a detailed specification, preferably in a computer readable format, like [[https://www.ohdsi.org/analytic-tools/whiterabbit-for-etl-design/?ModPagespeed=noscript|White Rabbit]].
    - View of mapping, preferably in a computer readable format, like [[https://www.ohdsi.org/analytic-tools/whiterabbit-for-etl-design/?ModPagespeed=noscript|Rabbit-In-a-Hat]].
  - Identify any additional mapping needed:
    - custom or local mapping of organizational codes
    - country codes not already in OHDSI vocabulary like OHIP codes
  - What to do with old codes that have been deactivated or invalidated but have no replacement codes
    - add new values to be published in next version of vocabulary
  - Preferred development methodology

ETL Conversion

  - CDM Builder
  - SQL Development
  - Sample size for development
  - Major pieces of work (Scrum Epics)
    - Each table
    - Dimension table first
    - GRA table always last because derived
  - Define range of data to convert
    - Observation period
      - first/last?
    - EMR vs Claims
    - Use min and max to measure range for observations and for each field.
      - exclude reversals for claims
      - exclude invalid dates
      - identify broken processes that generate invalid dates to help remove invalid dates
    - Use start and end date of vocabulary items
    - Document what to do with records that are missing required fields
      - maybe you have a medical coder who can code from a description field
    - Document what to do with records that have fields with invalid values
  - Software lifecycle
    - How do you develop, test, and accept for production?
      - Jenkins for automated build
      - SVN for source code control
    - How do you manage effort and cost to convert millions of patient records in TB of data?
      - Use a sample subset of the total data based on number of patients, amount of data, or processing time
      - develop and test using a sample subset of entire data (150 thousand patients)
      - business acceptance test using a large sample subset of entire data (1 million patients)
      - production run using entire data (millions of patients)
    - Define destination location(s)
    - Always get the latest vocabularies before each refresh (development, test, or production run)
    - Where do you get the most recent list of codes?
    - Frequency or schedule of reviews
  - How do you become aware of updates to CDM?
  - How do you become aware of updates to vocabularies?
  - Partitioning for parallelism to optimize performance
  - Guidelines for incremental update
  - Reusable code/Tables
  - Intermediate model?

Quality Assurance (QA)
  - How do we ensure ETL is good?
    - metrics for success
    - for each source and destination table
      - top mapped codes with count of records
      - top unmapped codes with count of records
      - variance between previous and current run
      - count of records
      - % of records with mapped codes
      - % unique codes that are mapped
      - for select fields (demographics), show histogram of values
    - compare actual to expected results
    - ensure referential integrity on platforms that do not enforce it
    - it would be awesome to compare histogram of values for source with equivalent destination
    - it would be awesome to show improvements between runs due to better mapping and coding
    - it would be nice to show average condition per visit
    - some deidentification processes introduce variance in dates or id values
  - How do we get business units to participate?
  - How do we get approval from business units?
  - Validate destination data with use cases and compare against source data with use cases. Investigate or accept variance.
  - Standard model checks that are independent of data or volume
  - automatic vs manual checks
  - Frequency or schedule of reviews
  - Tools
    - Achilles
    - Autosys
    - Oozie

Operation
  - Guidance for archive
  - Tools
    - monitoring
    - kibana