User Tools

Site Tools


welcome:overview:cdm:cdm_conversion_best_practices

What do I have to have to do an OMOP conversion?

Separate the process into modules.

Pre-analysis

  1. Data from source (or at least range for each field).
  2. Data dictionary including:
    1. taxonomies
    2. OMOP vocabulary (reference codes)
    3. are any fields masked or modified for privacy and if so, what is the algorithm?
  3. Use min and max to measure range for observations and for each field.
    1. Define how to handle exceptions
  4. Identify subject matter expert for source.
  5. Identify OMOP subject matter expert.
  6. Use cases for destination.
  7. List of standard questions:
    1. Number of records
    2. How often is data refreshed?
    3. Is the conversion feasible?
  8. Draft ETL Specs.
  9. Privacy requirements
  10. Definition of visit
  11. Specific codes that are of interest. Ensure these are mapped.
  12. Business rules for handling conflicting information (e.g. visits after patient death).
  13. Specific metrics that are of interest.

Meeting with subject matter experts

Post-analysis

  1. Revised Data dictionary
  2. Initial ETL Spec including:
    1. Business rules for mapping in a detailed specification, preferably in a computer readable format, like White Rabbit.
    2. View of mapping, preferably in a computer readable format, like Rabbit-In-a-Hat.
  3. Identify any additional mapping needed:
    1. custom or local mapping of organizational codes
    2. country codes not already in OHDSI vocabulary like OHIP codes
  4. What to do with old codes that have been deactivated or invalidated but have no replacement codes
    1. add new values to be published in next version of vocabulary
  5. Preferred development methodology

ETL Conversion

  1. CDM Builder
  2. SQL Development
  3. Sample size for development
  4. Major pieces of work (Scrum Epics)
    1. Each table
    2. Dimension table first
    3. GRA table always last because derived
  5. Define range of data to convert
    1. Observation period
      1. first/last?
    2. EMR vs Claims
    3. Use min and max to measure range for observations and for each field.
      1. exclude reversals for claims
      2. exclude invalid dates
      3. identify broken processes that generate invalid dates to help remove invalid dates
    4. Use start and end date of vocabulary items
    5. Document what to do with records that are missing required fields
      1. maybe you have a medical coder who can code from a description field
    6. Document what to do with records that have fields with invalid values
  6. Software lifecycle
    1. How do you develop, test, and accept for production?
      1. Jenkins for automated build
      2. SVN for source code control
    2. How do you manage effort and cost to convert millions of patient records in TB of data?
      1. Use a sample subset of the total data based on number of patients, amount of data, or processing time
      2. develop and test using a sample subset of entire data (150 thousand patients)
      3. business acceptance test using a large sample subset of entire data (1 million patients)
      4. production run using entire data (millions of patients)
    3. Define destination location(s)
    4. Always get the latest vocabularies before each refresh (development, test, or production run)
    5. Where do you get the most recent list of codes?
    6. Frequency or schedule of reviews
  7. How do you become aware of updates to CDM?
  8. How do you become aware of updates to vocabularies?
  9. Partitioning for parallelism to optimize performance
  10. Guidelines for incremental update
  11. Reusable code/Tables
  12. Intermediate model?

Quality Assurance (QA)

  1. How do we ensure ETL is good?
    1. metrics for success
    2. for each source and destination table
      1. top mapped codes with count of records
      2. top unmapped codes with count of records
      3. variance between previous and current run
      4. count of records
      5. % of records with mapped codes
      6. % unique codes that are mapped
      7. for select fields (demographics), show histogram of values
    3. compare actual to expected results
    4. ensure referential integrity on platforms that do not enforce it
    5. it would be awesome to compare histogram of values for source with equivalent destination
    6. it would be awesome to show improvements between runs due to better mapping and coding
    7. it would be nice to show average condition per visit
    8. some deidentification processes introduce variance in dates or id values
  2. How do we get business units to participate?
  3. How do we get approval from business units?
  4. Validate destination data with use cases and compare against source data with use cases. Investigate or accept variance.
  5. Standard model checks that are independent of data or volume
  6. automatic vs manual checks
  7. Frequency or schedule of reviews
  8. Tools
    1. Achilles
    2. Autosys
    3. Oozie

Operation

  1. Guidance for archive
  2. Tools
    1. monitoring
    2. kibana
welcome/overview/cdm/cdm_conversion_best_practices.txt · Last modified: 2017/06/29 18:47 by bchristian