a Institute of Clinical Chemistry and Laboratory Medicine,
b Institute of Arteriosclerosis Research, University of Münster, Germany.
Correspondence: Dr Paul Cullen, Institut für Arterioskleroseforschung an der Universität Münster, Domagkstraße 3, 48149 Münster, Germany. E-mail: cullen{at}uni-muenster.de
![]() |
Abstract |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Methods We used a multi-layer perceptron (MLP) and probabilistic neural networks (PNN) to estimate the risk of myocardial infarction or acute coronary death (coronary events) during 10 years follow-up among 5159 men aged 3565 years at recruitment into PROCAM. In all, 325 coronary events occurred in this group. We assessed the performance of each procedure by measuring the area under the receiver-operating characteristics curve (AUROC).
Results The AUROC of the MLP was greater than that of the PNN (0.897 versus 0.872), and both exceeded the AUROC for LR of 0.840. If high risk is defined as an event risk >20% in 10 years, LR classified 8.4% of men as high risk, 36.7% of whom suffered an event in 10 years (45.8% of all events). The MLP classified 7.9% as high risk, 64.0% of whom suffered an event (74.5% of all events), while with the PNN, only 3.9% were at high risk, 58.6% of whom suffered an event (33.5% of all events).
Conclusion Intervention trials indicate that about one in three coronary events can be prevented by 5 years of lipid-lowering treatment. Our analysis suggests that use of the MLP to identify high-risk individuals as candidates for drug treatment would allow prevention of 25% of coronary events in middle-aged men, compared to 15% and 11% with LR and the PNN, respectively.
Keywords Coronary heart disease, risk factors, neural networks, logistic regression
Accepted 31 July 2002
![]() |
Introduction |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Logistic regression assumes that a variable is related to risk in a particular continuous fashion. Because the number of coefficients present in logistic regression functions is limited, it is usually not difficult to detect when such a function is producing results that are biologically implausible and to identify the source of that implausibility. By contrast, neural networks are more complex, utilize larger numbers of coefficients, and take into account complex non-linear relationships that exist within the data. Moreover, as recently noted by Dayhoff and Deleo, in approximating a multifactorial function, neural networks create the functional form and fit the function at the same time, a capability that gives them a decided advantage over traditional statistical multivariate regression techniques.6
As a result, neural networks may produce a model of greater discrimination and thus a more accurate estimation of risk than logistic regression. Several good introductions to the neural network approach have been published.79 Our aim here was to use neural networks to calculate the posterior probability of an individual suffering a fatal or non-fatal myocardial infarction if a particular constellation of risk factors is present, and to compare the results to those obtained with logistic regression analysis. Neural networks have been shown to provide good estimates of classical Bayesian probabilities.10,11 As reviewed by Tu, neural networks are also able to implicitly detect complex non-linear relationships between dependent and independent variables and to uncover all possible interactions between predictors, in our case risk factors for CHD.12 For these reasons, neural networks may provide better generalization than conventional regression techniques.12
The PROCAM study is a large prospective epidemiological investigation performed among the working population in Westphalia and the northern Ruhr regions of Germany. We developed a logistic regression model and two kinds of neural network to identify those individuals at risk of CHD in the cohort of middle-aged men with 10 years of follow-up.
![]() |
Subjects and Methods |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Follow-up
Follow-up was by questionnaire every 2 years, with a response rate of 96%. In each case in which evidence of morbidity or mortality was entered in the questionnaire, hospital records and records of the attending physician were obtained and, in the case of deceased study participants, an eyewitness account of death was sought. A Critical Event Committee reviewed these data, together with data from death certificates, in order to verify the diagnosis or the cause of death. A major coronary event was defined as the presence of sudden cardiac death or of a definite fatal or non-fatal myocardial infarction on the basis of ECG and/or cardiac enzyme changes.15
Cohort analysis
In the cohort of 5159 men aged 3565 years recruited before the end of 1985, a sufficient number of major coronary events occurred within 10 years to allow statistically valid longitudinal analysis.4 Among women and younger men, numbers were insufficient to permit such analysis. In this group of middle-aged men, 325 major coronary events occurred. In 63 men who did not suffer an acute coronary event as defined in this study, CHD was diagnosed by angiography or other means during the period of follow-up. These 63 men were excluded from the analysis presented here. Fourteen suspected coronary deaths were observed, and these individuals were also excluded from the analysis. In all, 218 men died from other causes within the 10-year follow-up period. Non-fatal stroke occurred in 46 men. Thus, 4493 middle-aged men survived 10 years after examination without a major coronary event and without the exclusion factors listed above. For all analyses in this report, these 4493 men were compared to the 325 men (prevalence 6.7%) who suffered a major coronary event. The establishment of the cohort of middle-aged men studied in this report is shown in Table 1.
|
Neural networks
Two types of neural network were used, a multi-layer perceptron network with one hidden layer (MLP) and a probabilistic neural network (PNN). Both are supervised networks that are used for prediction and classification problems. The networks were designed using Statistica Neural Networks (SNN) software release 4.1 (StatSoft Inc., Tulsa, Oklahoma, USA) running under Microsoft Windows 2000®. The architecture of these networks is shown in Figures 1 and 2
, respectively, and is described in detail in the Appendix. In order to select the variables for use with neural networks, we first used forward and backwards stepwise feature selection and a genetic input selection algorithm,18 which are all part of the SNN software, to identify and/or remove variables that did or did not contribute significantly to the performance of the networks. Algorithms for forward and backwards selection work by adding or removing variables one at a time. Genetic algorithms, by contrast, generate optimal sets of input variables by constructing binary masks using artificial mutation, crossover and selection operators. Each mask is used to construct a new training set for testing with probabilistic neural networks. The increase of the network error is an indicator of irrelevant input variables.18
|
|
Use of validation sets to adjust performance of logistic regression and the neural networks
In order to derive the logistic regression equations and the neural networks in this study, we divided our data into five equal and distinct sets. Four of these five sets were then combined and used for training. The remaining fifth was used for testing the performance of the logistic regression and neural networks models on unknown data. This cross-validation procedure was repeated for every possible 4 + 1 combination. As training of an MLP is an iterative adaptive task, each of the 4/5 datasets described above was randomly partitioned into an internal training set comprising half of the cases, a verification set comprising a quarter of the cases, and a validation set comprising the remaining quarter of the cases. The MLP was then trained in iterative fashion through a number of epochs. In each epoch, the entire training set was presented to the network, case by case. Errors were calculated and were used to adjust the weights in the network, the performance of which was tested against the validation dataset. We used this cross verification as an early stopping condition to prevent over-fitting, a phenomenon in which the training error falls but the verification error rises as the network learns to produce the desired outputs given the training set, but fails to generalize to new data.
The PNN, by contrast, was trained in a single epoch (Appendix). The 4/5 training sets were passed in toto through the network and a set of weights was calculated for each class of outcome (coronary event, no coronary event). The performance of each PNN was optimized by varying the smoothing factor (i.e. the radial deviation of the Gaussian kernel functions) in the range from 0.1 to 0.3.
Neural networks and over-fitting
A major problem of neural networks is their tendency to over-fit the data, i.e. to generate networks that are too closely adapted to the training data (the so-called bias-variance problem [Appendix]).19
When applied to unseen data, such over-fitted networks often produce biologically implausible results. One way to deal with over-fitting is to divide the data into a training set and a validation set as described above and to stop training the network at the point at which the error in the verification set is at a minimum (early stopping procedure).12 We tried this approach but found that errors still occurred when the networks were presented with unseen data. To address this problem, we generated datasets in which all variables but one were held constant at their average value in our population and the remaining variables were varied throughout the entire biologically plausible range. For example, risk profiles were generated in which all variables but systolic blood pressure were held constant at their average value and outcomes were computed for systolic blood pressures varying in steps of 1 mmHg from 70 mmHg to 250 mmHg. These simulations showed that MLP networks containing more than four nodes in the hidden layer and probabilistic networks using smoothing factors lower than 0.12 were liable to produce implausible results when presented with unseen constellations of risk factors. For these reasons, an MLP with four nodes in one hidden layer and a PNN with a smoothing factor of 0.14 were used.
Statistics
The performances of logistic regression, the MLP and the PNN were compared using receiver-operating characteristics (ROC) analysis.20 In a ROC analysis, the sensitivity (positive fraction of class 2 [presence of a major coronary event]) is plotted against 1 minus the specificity (false negative fraction of class 1 [absence of a major coronary event]) for each possible decision threshold. Performance in this case refers to the ability of the various procedures to discriminate between men who had developed a coronary event and men who had not. We compared the areas under the ROC curves according to the approach of Hanley and McNeil.21 Statistical analyses were performed using the Statistical Package for the Social Sciences (SPSS-X).22
![]() |
Results |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
|
probability of a major coronary event in 10 years =
![]() |
where the variables xi and the coefficients ßi were as follows
|
Receiver-operating characteristics curve analysis of neural networks and logistic regression analysis
Analysis of ROC curves showed the superior performance of neural network analysis over logistic regression in predicting coronary events among middle-aged men in PROCAM (Figure 3, Table 3
). The area under the ROC curve of the MLP was greater than that of the PNN (0.897 versus 0.872, respectively), both of which values exceeded the area under the ROC curve for logistic regression of 0.840. As can be seen from Table 3
, the 95% CI of the three ROC curves showed no overlap, indicating that the differences between the ROC curves were unlikely to be due to chance. In prevention of CHD, the value of a predictive test lies in its ability to accurately identify those people who require intervention while excluding those who do not. For this reason, the left-hand sections of the ROC curves, at which there is a low false positive rate, are of particular relevance in our context (Figure 3
). Thus, even though the areas under the ROC curves were similar for both networks and only moderately greater than the area under the ROC curve for logistic regression, the performance of the MLP greatly exceeded those of the other two models. The ROC curve of the MLP rose much more steeply than that of the PNN or of logistic regression, indicating that this network achieved a true positive rate of greater than 75% at a false positive rate of less than 5%.
|
|
![]() |
Discussion |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
As noted in the introduction, neural networks may produce a model of greater discrimination and thus a more accurate estimation of risk than logistic regression. However, as described by Tu,12 neural networks may cause over-fitting, resulting in implausible results when the network is presented with unseen data. In this study, we used three methods to deal with this problem. First, we divided the data into five distinct sets and used cross-validation to improve the performance of logistic regression and both types of neural network. Second, we stopped training the MLP neural network when the errors in the validation sets were at a minimum (early stopping procedure). Third, we presented the networks with large sets of synthetic data that spanned in small steps the biologically meaningful range. We then progressively modified the networks until biologically implausible results were no longer obtained with these synthetic datasets. This reduced predictive power but produced networks that were robust predictors when presented with unseen data.
Because there is no a priori way of knowing which type of network will perform best in a given clinical situation,12 we tested two of the most commonly used network types on the PROCAM dataset. In our data and under the constraint of biological plausibility, the multilayer perceptron was clearly superior in identifying individuals at high risk of developing a coronary event.
In preventive medicine, the value of a test lies in its ability to identify those individuals who are at high risk of an illness and who therefore require intervention while excluding those who do not require such intervention. In our case, we took high to mean a risk of developing an acute coronary event that exceeds 20% in 10 years. Accuracy of risk classification is of particular relevance in the case of coronary artery disease. Because of the high prevalence of this condition, inaccurate risk prediction will lead to over-treatment of a large number of people and under-treatment of many more. Moreover, it has become clear in recent years that the risk of an acute coronary event in people who have not yet suffered a myocardial infarction but have a number of risk factors (the so-called pre-symptomatic patient) may equal or even exceed that of people who have already suffered a myocardial infarction. Thus the distinction between primary and secondary prevention has become blurred.
In recent years, several large scale intervention studies have shown that, in people at high risk of CHD, treatment of risk factors with diet and drugs may be expected to prevent about a third of all coronary events.2528 It is instructive to apply this experience to our data. Thus, using logistic regression, 843 of every 10 000 middle-aged men are classified as high risk, 309 (incidence 36.7%, Table 3) of whom will suffer a myocardial infarction within 10 years. Treatment of all 843 high-risk men may thus be expected to prevent 103 events (one-third of 309), or to put it another way, 8.2 men need to be treated to prevent one event (840 divided by 103 equals 8.2). With the probabilistic neural network, only 386 men are classified as high risk, 226 (58.6%) of whom will suffer a myocardial infarction. Treatment may be expected to prevent 75 events (one-third of 226), and 5.1 men need to be treated to prevent one event. With the multilayer perceptron, 785 men are classified at high risk, 502 (64.0%) of whom will suffer myocardial infarctions, 167 of which can be prevented by treatment. Thus 4.7 men need to be treated to prevent one event. Since in the overall group of 10 000 men, 675 events may be expected to occur in 10 years (overall event prevalence in PROCAM), identification of high-risk individuals by means of the multilayer perceptron may allow us to prevent no fewer than 25% of all myocardial infarctions among middle-aged men (167 x 100/675 = 25). By contrast, logistic regression and the probabilistic network would allow prevention of only 15% and 11% of myocardial infarctions, respectively.
In medicine, neural networks have seen increasing use, with about 60 new papers on the topic appearing in the medical literature every year.6 To our knowledge, however, this is the first report where neural networks have been applied to calculating risk of a coronary event in a prospective epidemiological study. Moreover, as far as we can ascertain, none of the published studies directly addressed the issues of biological plausibility and over-fitting using the methods described in the present report. We therefore believe that the approach we have taken may be of use not only in improving risk assessment in CHD but also in other areas where outcomes are determined by a large number of variables that interact in a complex non-linear fashion.
![]() |
Appendix |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
The multi-layer perceptron (MLP)
The final MLP we used in this paper was a three layer network (Figure 1) with 13 input units (nodes) corresponding to our 13 risk factors (x1,...x13), 4 hidden units and one output unit, modelling the dichotomous risk outcome.1
Pre-scaling of input values for the MLP and the probabilistic network
The input values (except the values from nominal variables) are pre-scaled to a range between 0 and 1 using
![]() | (1) |
Optimizing MLP-weights, cross-entropy error and sigmoid activation
Typically, the weights in an MLP are adjusted using least squares fitting together with a suitable back propagation algorithm2,3 to minimize a root mean square error (RMS) function. In order to interpret the network outputs as probabilities and to make them comparable to the results of the logistic regression, we however used a cross entropy error function to adjust the weights (Table 1) and to minimize the network fit criterion. This cross-entropy function can be derived from the likelihood of the underlying Bernoulli distribution of the entire training set, where the cross entropy error,
![]() | (2) |
![]() | (3) |
Probabilistic Neural Networks (PNN)
We used a probabilistic network5 with four layers, an input layer, a radial or pattern layer, a summation layer and an output layer (Figure 2). The radial units are copied directly from the normalized training data, one per case. Each models a Gaussian function (such as exp[(x - w)2/
2]) centered at the training case. The term x - w denotes the difference between input vectors and the weight matrices for class 1 and class 2, the term
2 is the radial deviation of the Gaussian kernel function and is used as an adjustable smoothing factor. There was one summation unit for each class (c1 and c2). Each is connected to all radial units belonging to its own class. The outputs were each proportional to the kernel-based estimates of the probability density function of the classes c1 and c2. The normalized outputs are estimates of the underlying class probabilities.
Preprocessing
The learning sets were randomly partitioned into a training and verification or test set (70/30%) and normalized to unity using Equation 1.
Training
Training was accomplished by adding sets of risk variables from individual study participants (pattern units) with the appropriate weight vectors in place.
Classification and post-processing
Summing the outputs of all the pattern units belonging to a single class (e.g. class 2) enabled us to compute the posterior probability distribution function for that class 2 d/2
dnc2p(x|c2), evaluated at the input point x, where n is the number of cases and d is the dimension of the space. Using Bayes theorem and the prior probabilities of class 1 and 2, we were able to compute the probability of class membership for each individual.
Bias, variance, ANLL and over-fitting
The aim in epidemiology is not to fit the training data to any specified accuracy, but to predict the probability of a disease outcome for individual cases with as few false positive and false negative results as possible. Any model used, such as a neural network, should therefore be flexible enough to learn the topography described by the training data. However, the network should not contain too many nodes or hidden layers, and should not have smoothing parameters that are too small, because this may cause the model to fit even the noise that is inherent to every dataset. This phenomenon is known as over-fitting. An over-fitted network fails to generalize and gives poor results when predicting the probability of disease in previously unseen individuals. Bias,6 variance and average negative log likelihood are statistical parameters that allow us to obtain more insight into this problem and can be computed both from the network outputs and from the outputs of the logistic regression.
Table 2 shows the root-mean-square (RMS) errors, the ANLL, the bias values, and the mean-square errors that we derived for the logistic regression model, the probabilistic neural network and the multilayer perceptron in our test and training datasets.
KEY MESSAGES
|
|
|
|
![]() |
Acknowledgments |
---|
![]() |
References |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
2 Wasson JH, Sox HC, Neff RK et al. Clinical-prediction rulesapplications and methodological standards. New Engl J Med 1985; 313:79399.[Abstract]
3 Anderson KM, Wilson PWF, Odell PM et al. An updated coronary risk profilea statement for health professionals. Circulation 1991;83: 35662.[ISI][Medline]
4 Assmann G, Schulte H, von Eckardstein A. Hypertriglyceridemia and elevated levels of lipoprotein (a) are risk factors for major coronary events in middle-aged men. Am J Cardiol 1996;77:117984.[CrossRef][ISI][Medline]
5 Califf RM, Armstrong PW, Carver JR et al. Task-force 5stratification of patients into high, medium and low-risk subgroups for purposes of risk factor management. J Am Coll Cardiol 1996;27:100719.[CrossRef][ISI][Medline]
6 Dayhoff JE, DeLeo JM. Artificial neural networks. Cancer 2001; 91:161535.[CrossRef][ISI][Medline]
7 Guerriere MJ, Detsky AS. Neural networkswhat are they. Ann Intern Med 1991;115:90607.[ISI][Medline]
8 Hinton GE. How neural networks learn from experience. Scientific American 1992;267:14551.[ISI]
9 Bishop CM. Neural Networks for Pattern Recognition. Oxford: Oxford University Press, 1985.
10 Richard MD, Lippmann RP. Neural networks classifiers estimate Bayesian posterior probabilities. Neural Computation 1991;3:46183.
11 Funahashi T. Multilayer neural networks and Bayes decision theory. Neural Networks 1998;11:20913.[CrossRef][ISI][Medline]
12 Tu JV. Advantages and disadvantages of using artificial neural networks versus logistic-regression for predicting medical outcomes. J Clin Epidemiol 1996;49:122531.[CrossRef][ISI][Medline]
13 Assmann G, Cullen P, Schulte H. The Munster Heart Study (PROCAM)Results of follow-up at 8 years. Eur Heart J 1998; 19:A2A11.[ISI][Medline]
14 Cullen P, Schulte H, Assmann G. The Münster Heart Study (PROCAM). Total mortality in middle-aged men is increased at low total and LDL cholesterol concentrations in smokers but not in nonsmokers. Circulation 1997;96:212836.
15 Assmann G, Schulte H. Relation of high density lipoprotein cholesterol and triglycerides to incidence of atherosclerotic coronary artery disease (the PROCAM experience). Am J Cardiol 1992; 70:73337.[ISI][Medline]
16 Assmann G, Schulte H. Results and conclusions of the Prospective Cardiovascular Münster (PROCAM) Study. In: Assmann G (ed.). Lipid Metabolism Disorders and Coronary Heart Disease. München: MMV Medizin Verlag, 1993, pp.1968.
17 Hosmer DW, Lemeshow S. Applied Logistic Regression. 8291. 1989. New York, John Wiley & Sons, Inc. Wiley Series in Probability and Mathematical Statistics. Barnett V, Bradley RA, Hunter S et al. (eds).
18 Goldberg DE. Genetic Algorithms. Reading, MA: Addison Wesley, 1989.
19 Bienenstock E, Doursat R. Neural networks and the bias/variance dilemma. Neural Computation 1992;4:158.[ISI]
20 Zweig MH, Campbell G. Receiver-operating characteristic (ROC) plots: a fundamental evaluation tool in clinical medicine. Clin Chem 1993;39:56177.
21 Hanley JA, McNeil BJ. A method of comparing the areas under receiver operating characteristics curves derived from the same cases. Radiology 1983;148:83943.[Abstract]
22 Nie NH. SPSS-X Users Guide. New York: McGraw-Hill, 1983.
23 Wood D, De Backer G, Faergemann O et al. Prevention of coronary heart disease in clinical practice. Recommendations of the second joint Task Force of European and other Societies on Coronary Prevention. Eur Heart J 1998;19:1434503.
24 Anderson KM, Wilson PWF, Odell PM et al. An updated coronary risk profilea statement for health professionals. Circulation 1991;83:35662.[ISI][Medline]
25 Scandinavian Simvastatin Survival Study Group. Randomised trial of cholesterol lowering in 4444 patients with coronary heart disease: The Scandinavian Simvastatin Survival Study (4S). Lancet 1994; 344:138389.[ISI][Medline]
26 Shepherd J, Cobbe SM, Ford I et al. Prevention of coronary heart disease with pravastatin in men with hypercholesterolemia. New Engl J Med 1995;333:130107.
27 Sacks FM, Pfeffer MA, Moye LA et al. The effect of pravastatin on coronary events after myocardial infarction in patients with average cholesterol levels. New Engl J Med 1996;335:100109.
28 Downs JR, Clearfield M, Weis S et al. Primary prevention of acute coronary events with lovastatin in men and women with average cholesterol levelsResults of AFCAPS/TexCAPS. JAMA 1998;279: 161522.
![]() |
References for Appendix |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
2 Patterson D. Artificial Neural Networks. Singapore: Prentice Hall, 1996.
3 Bishop CM. Neural Networks for Pattern Recognition. Oxford: Oxford University Press, 1995.
4 Kullback S, Leibler RA. On information and sufficiency. Annals of Mathematical Statistics 1951;22:7986.[ISI]
5 Specht DF. Probabilistic neural networks. Neural Networks 1990;3:10918.[CrossRef][ISI]
6 Geman S, Bienenstock E, Doursat R. Neural networks and the bias/variance dilemma. Neural Computation 1992;4:152.[ISI]