Motivation and Goals

We want to develop a script that generates realistic synthetic datasets for hands-on learning in BD2K workshops.

Learning Objectives

In generating the data, we have the following learning objectives:

Methods

Reverse-engineering a Decision Tree

Our inspiration for this data set is the LOOK-AHEAD study, which integrated both clinical and genetic covariates for predicting low/high cardiovascular risk over 10 years.

The basic idea behind the data generation script is we use a decision tree to define cardiovascular risk groups.

Our first step is to define variable importance, which defines the level at which it exists in the decision tree.

Defining the decision tree

We use the decision tree as a communication tool, to define realistic risk groups. For example, risk group 1 consists of Caucasians with High Hypertension and are over 50 years of age.

risk-group-tree

risk-group-tree

Estimating frequencies of risk groups

The total number of risk groups in the decision tree defines a complete set of patient categories that we can sample from. The frequencies of these risk groups can then be estimated from actual patient data.

frequency-table

frequency-table

Sampling the Risk Group Space

Once we have frequencies for each risk group, we can use these frequencies as a set of probabilities that we sample from to define our patient population.

Each risk group defines limitations on the clinical covariates and genetic covariates. For example, Risk Group 6 has the restrictions of Age < 50, BMI > 25, and Type 2 Diabetes = TRUE, but is unrestricted in the other covariates (for example, Race can be Non-Caucasian/Caucasian, Hypertension = True/False, Smoking = Y/N, SNP1 = AA/GG/AG).

Genetic Covariates

Genetic Covariates are similar to the clinical covariates discussed above, but their distinct frequencies are associated with race.

Discussion Questions

We welcome any input and thoughts on the following discussion questions:

Availability and Feedback

Once we have defined the script, it will be available via GitHub. Link to come soon.

We encourage feedback! laderast@ohsu.edu