What do I have to have to do an OMOP conversion? Separate the process into modules. Pre-analysis - Data from source (or at least range for each field). - Data dictionary including: - taxonomies - OMOP vocabulary (reference codes) - are any fields masked or modified for privacy and if so, what is the algorithm? - Use min and max to measure range for observations and for each field. - Define how to handle exceptions - Identify subject matter expert for source. - Identify OMOP subject matter expert. - Use cases for destination. - List of standard questions: - Number of records - How often is data refreshed? - Is the conversion feasible? - Draft ETL Specs. - Privacy requirements - Definition of visit - Specific codes that are of interest. Ensure these are mapped. - Business rules for handling conflicting information (e.g. visits after patient death). - Specific metrics that are of interest. Meeting with subject matter experts Post-analysis - Revised Data dictionary - Initial ETL Spec including: - Business rules for mapping in a detailed specification, preferably in a computer readable format, like [[https://www.ohdsi.org/analytic-tools/whiterabbit-for-etl-design/?ModPagespeed=noscript|White Rabbit]]. - View of mapping, preferably in a computer readable format, like [[https://www.ohdsi.org/analytic-tools/whiterabbit-for-etl-design/?ModPagespeed=noscript|Rabbit-In-a-Hat]]. - Identify any additional mapping needed: - custom or local mapping of organizational codes - country codes not already in OHDSI vocabulary like OHIP codes - What to do with old codes that have been deactivated or invalidated but have no replacement codes - add new values to be published in next version of vocabulary - Preferred development methodology ETL Conversion - CDM Builder - SQL Development - Sample size for development - Major pieces of work (Scrum Epics) - Each table - Dimension table first - GRA table always last because derived - Define range of data to convert - Observation period - first/last? - EMR vs Claims - Use min and max to measure range for observations and for each field. - exclude reversals for claims - exclude invalid dates - identify broken processes that generate invalid dates to help remove invalid dates - Use start and end date of vocabulary items - Document what to do with records that are missing required fields - maybe you have a medical coder who can code from a description field - Document what to do with records that have fields with invalid values - Software lifecycle - How do you develop, test, and accept for production? - Jenkins for automated build - SVN for source code control - How do you manage effort and cost to convert millions of patient records in TB of data? - Use a sample subset of the total data based on number of patients, amount of data, or processing time - develop and test using a sample subset of entire data (150 thousand patients) - business acceptance test using a large sample subset of entire data (1 million patients) - production run using entire data (millions of patients) - Define destination location(s) - Always get the latest vocabularies before each refresh (development, test, or production run) - Where do you get the most recent list of codes? - Frequency or schedule of reviews - How do you become aware of updates to CDM? - How do you become aware of updates to vocabularies? - Partitioning for parallelism to optimize performance - Guidelines for incremental update - Reusable code/Tables - Intermediate model? Quality Assurance (QA) - How do we ensure ETL is good? - metrics for success - for each source and destination table - top mapped codes with count of records - top unmapped codes with count of records - variance between previous and current run - count of records - % of records with mapped codes - % unique codes that are mapped - for select fields (demographics), show histogram of values - compare actual to expected results - ensure referential integrity on platforms that do not enforce it - it would be awesome to compare histogram of values for source with equivalent destination - it would be awesome to show improvements between runs due to better mapping and coding - it would be nice to show average condition per visit - some deidentification processes introduce variance in dates or id values - How do we get business units to participate? - How do we get approval from business units? - Validate destination data with use cases and compare against source data with use cases. Investigate or accept variance. - Standard model checks that are independent of data or volume - automatic vs manual checks - Frequency or schedule of reviews - Tools - Achilles - Autosys - Oozie Operation - Guidance for archive - Tools - monitoring - kibana