OHDSI Home | Forums | Wiki | Github

Genomic Data in the CDM

I would like to join.

Hello everyone. I am quite happy that a lot of people shared these needs. Yu Rang and I shall upload our poster tables on to GitHub by next week. I due believe that these efforts can allow realization of precision medicine.

sign me up as well, please

Sign me up too.

I would like to join!

As NGS in cancer patients was started to be covered by Korean national insurance system from this year, many Korean hospitals in OHDSI network are just beginning to collect genomic data in caner patients.

18,000+ cases of Foundation Medicine (FMI) single nucleotide variants now available in the NCI Genomic Data Commons (GDC). (enter link description here) If clinical and genomic information of such open databases can be converted into CDM, it will be combined with real world CDM data to create strong evidences.

I think oncology working group can cooperate with genomic data group for this ambitious goal, since it has been hard to convert clinical information in open genomic database into OMOP-CDM until now.
@rimma @clairblacketer

I would like to join also.

I would like to join as well

I’m interested to join as well

Please sign me up too!

Sign me up please!

Please sign me up too.

Thank you everyone for the interest! Please fill out the doodle poll with your availability. Right now the times are all for eastern standard time so please keep that in mind when filling it out.

Clair

Just some thoughts to get the conversation going.

Data model:
Vertical/EAV (similar to the existing data model of OHDSI), horizontal (Kyu Pyo Kim’s model), or a hybrid model (Seng Chan You’ model).

Basic variant info / Annotations:
What annotations should we include? Just enough data to define a variant, or include some annotation data? The arguments for including annotation data is that it is extremely useful for searching for variants or understanding why they were classified as pathogenic at the time. The argument against is that annotations can become obsolete and should be left in external data sources. Semi-static annotation data includes Gene name, functional type, accession, rsID, quality scores from the chromatograms, etc. Rapidly changing annotation data includes MAF and versions of pathogenicity scores (ex. CADD and Polyphen).

Interpretation / Report:
A single field with all of the interpretation concatenated into one string, or should we define multiple fields for the interpretation? Genetic pathology reports are highly variable and we have to create a data model that accommodates all of them without becoming overly complex.

Scope:
What type of variants should be stored? My opinion on this is that the focus on storing genetic data in OHDSI should be limited to variants with a clinical interpretation. These would include genetic reports from pathology or variants identified by a robust Clinical decision support pipeline with high confidence of pathogenicity. OHDSI is not meant to have thousands of variants per individual; there are plenty of other systems that are meant to deal with that kind of research data (DNANexus, GeneInsight, cBioPortal, etc.).

Use cases:

  • Search for all conditions in patients with a pathogenic variant in a specific gene. (Find new links between phenos and genes.)

  • Discover variants which have been changed from pathogenic to benign. This may be important for patient notifications.

  • New discoveries in PGx

  • Let’s come up with more!

I’ve written a document that tries to capture some of these thoughts. I’ve also created a spreadsheet of the potential fields of the data model here. The “Proposed” sheet shows a list of proposed fields, and their FHIR counterparts. This is an aggregation of fields that I have seen across many pathology reports. I’ve also created a brief presentation which includes notes for each slide. Slides 7-9 describe trying to use the existing CDM with minimal changes. Slide 10 describes a potential horizontal data model.

These documents are missing the refinements or alternate models presented by @KKP1122, Yurang_Park, & @SCYou. I will add them soon and try to represent their work as best as I can. If anyone would like for me to add you as an editor to these documents, then please message me.

Looking forward to the discussion.

Could you send us a link to your poster / data model? I can’t find it in the list of posters from the symposium.

Clair,

Thank you for organizing, sign me in please.

David,

One important use case and modeling consideration: an absence of a variant may be as important for the analysis as its presence. The usual OMOP CDM convention “no record” means “NO” would not work in this case.

Looking forward to the discussion.

Thank you.

Love the fact that this would be use case driven, and we don’t try to boil the ocean or recreate another variant calling and storage mechanism. Or fall into the “attic trap”, trying to store any potentially useful information by all means.

Don’t understand this. Whom do you want to notify of what?

Please do. I’d put in:

  • Create hypothesis generation or testing methods for connecting variants to any type of phenotype (could be Condition, but also timing of things, severity, pharmacological effect etc. We are the only ones who would be able to pull that off.

Can you define where do we get that from? Generally, in OMOP CDM we have no verbatim texts (with very few exceptions), so annotations would have to be conceptualized.

How do you mane that call?

With the help from @ShinSeojeong and my colleagues, our first draft for Genetic CDM was released at GoogleDrive
This model is developed on the basis of ISO standard for reporting NGS result (ISO/TS 20428, ‘Health Informatics-Data elements and their metadata for describing structured clinical genomic sequence information in electronic health records’)

This is our first draft and we need your thorough review and comments!
I agree with @rimma 's comment
it is important to know ‘there is no mutation in certain genes’. We need to figure out how to add information of target genes in targeted NGS.

Thank you for @davidfasel 's comment
Basic variant info / Annotations:
Basically, I agree with David’s thought. Annotation data is useful but annotation data can be changed rapidly and this is so huge. I think to leave this data in external data sources too, if it is possible. And that’s the reason why we add another table for annotation or basic variant info.

Interpretation/Report
I think we can store the information of original pathology report and genetic pathology into ‘note’ table in existing CDM.

Scope
As @Christian_Reich said, I think that it would be hard to define ‘limited variant with a clinical interpretation’ (In our model, the information for clinical implication should be stored in ‘variant_annotation’ table). I don’t think the data of thousands of variants itself is overwhelming for CDM compared with current CDM. We store every single device, medication, device and note in CDM now. The current variant_occurrence table in our model has 23 columns. And most of patients have single NGS result.

Use case
On-going project of mine is developing machine learning to predict outcomes in cancer patients by using combined information of genomic and clinical data. Owing to great contribution @Rijnbeek, @jennareps, @schuemie and their colleagues, it won’t that hard to build this by modifying feature extraction package and using patient level prediction package.

Another my ambitious goal is converting existing open genomic database in cancer patients into OMOP-CDM. by this, it is possible to leverage accumulated genomic and clinical database to generate better evidence for accumulating genetic and clinical information. Collaboration with oncology group is absolutely essential for this ambitious goal to capture information from existing oncology registries in OMOP-CDM

1 Like

Please sign me up.

t