Motivation and Goals

We want to develop a script that generates realistic synthetic datasets for hands-on learning in BD2K workshops.

Learning Objectives

In generating the data, we have the following learning objectives:


Reverse-engineering a Decision Tree

Our inspiration for this data set is the LOOK-AHEAD study, which integrated both clinical and genetic covariates for predicting low/high cardiovascular risk over 10 years.

The basic idea behind the data generation script is we use a decision tree to define cardiovascular risk groups.

Our first step is to define variable importance, which defines the level at which it exists in the decision tree.

Defining the decision tree

We use the decision tree as a communication tool, to define realistic risk groups. For example, risk group 1 consists of Caucasians with High Hypertension and are over 50 years of age.