Get Real: Combining clinical and genetic covariates in a synthetic dataset.

Motivation and Goals

We want to develop a script that generates realistic synthetic datasets for hands-on learning in BD2K workshops.

Learning Objectives

In generating the data, we have the following learning objectives:

Learn the difficulties and challenges of working with both clinical data and genetic data
Learn the strengths/weaknesses of machine learning algorithms for classification in a “safe context”
Highlight known issues with integrating clinical and genetic data

Methods

Reverse-engineering a Decision Tree

Our inspiration for this data set is the LOOK-AHEAD study, which integrated both clinical and genetic covariates for predicting low/high cardiovascular risk over 10 years.

The basic idea behind the data generation script is we use a decision tree to define cardiovascular risk groups.

Our first step is to define variable importance, which defines the level at which it exists in the decision tree.

Defining the decision tree

We use the decision tree as a communication tool, to define realistic risk groups. For example, risk group 1 consists of Caucasians with High Hypertension and are over 50 years of age.

risk-group-tree

Estimating frequencies of risk groups

The total number of risk groups in the decision tree defines a complete set of patient categories that we can sample from. The frequencies of these risk groups can then be estimated from actual patient data.

frequency-table

Sampling the Risk Group Space

Once we have frequencies for each risk group, we can use these frequencies as a set of probabilities that we sample from to define our patient population.

Each risk group defines limitations on the clinical covariates and genetic covariates. For example, Risk Group 6 has the restrictions of Age < 50, BMI > 25, and Type 2 Diabetes = TRUE, but is unrestricted in the other covariates (for example, Race can be Non-Caucasian/Caucasian, Hypertension = True/False, Smoking = Y/N, SNP1 = AA/GG/AG).

Genetic Covariates

Genetic Covariates are similar to the clinical covariates discussed above, but their distinct frequencies are associated with race.

Discussion Questions

We welcome any input and thoughts on the following discussion questions:

Decision Trees provide framework for integrating clinical and genetic datatypes, but they are overly simplistic.
- How difficult do we make the problem?
- Tradeoffs between learning and difficulty
How can we incorporate other techniques such as natural language processing (NLP) into this task?
What are other scenarios we can model?
What extra covariates should we include for clinical and genetic data?
- Should they be extraneous or collinear with other variables?

Availability and Feedback

Once we have defined the script, it will be available via GitHub. Link to come soon.

We encourage feedback! laderast@ohsu.edu