We want to develop a script that generates realistic synthetic datasets for hands-on learning in BD2K workshops.
In generating the data, we have the following learning objectives:
Our inspiration for this data set is the LOOK-AHEAD study, which integrated both clinical and genetic covariates for predicting low/high cardiovascular risk over 10 years.
The basic idea behind the data generation script is we use a decision tree to define cardiovascular risk groups.
Our first step is to define variable importance, which defines the level at which it exists in the decision tree.
We use the decision tree as a communication tool, to define realistic risk groups. For example, risk group 1 consists of Caucasians with High Hypertension and are over 50 years of age.
The total number of risk groups in the decision tree defines a complete set of patient categories that we can sample from. The frequencies of these risk groups can then be estimated from actual patient data.
Once we have frequencies for each risk group, we can use these frequencies as a set of probabilities that we sample from to define our patient population.
Each risk group defines limitations on the clinical covariates and genetic covariates. For example, Risk Group 6 has the restrictions of Age < 50, BMI > 25, and Type 2 Diabetes = TRUE, but is unrestricted in the other covariates (for example, Race can be Non-Caucasian/Caucasian, Hypertension = True/False, Smoking = Y/N, SNP1 = AA/GG/AG).
Genetic Covariates are similar to the clinical covariates discussed above, but their distinct frequencies are associated with race.
We welcome any input and thoughts on the following discussion questions:
Once we have defined the script, it will be available via GitHub. Link to come soon.
We encourage feedback! laderast@ohsu.edu