Diagnosis: Toxic! – Trying to Apply Approaches of Clinical Diagnostics and Prevalence in Toxicology Considerations

Sebastian Hoffmann and Thomas Hartung1

European Commission, JRC–Joint Research Centre, Institute for Health & Consumer Protection, ECVAM–European Centre for the Validation of Alternative Methods, 21020 Ispra (VA), Italy

1 To whom correspondence should be addressed at European Commission, JRC–Joint Research Centre, Institute for Health and Consumer Protection ECVAM–European Centre for the Validation of Alternative Methods. Via E. Fermi 1, TP 580, 21020 Ispra (VA), Italy. Tel: +39 0332 785939. Fax: +39 0332 786297. E-mail: thomas.hartung{at}cec.eu.int.

Received November 30, 2004; accepted January 28, 2005


    ABSTRACT
 TOP
 ABSTRACT
 THOUGHT STARTER
 THE ACCURACY OF THE...
 THE QUALITY OF OUR...
 THE IMPACT OF PREVALENCE
 CONSEQUENCES OF INACCURATE TESTS...
 THE USE OF CONFIRMATORY...
 PREVALENCE IN DISTINCT CHEMICAL...
 CONCLUSIONS
 REFERENCES
 
The assessment of relevance of toxicological testing was compared with approaches of diagnostic medicine, a discipline that faces a comparable situation. Considering the work of a toxicologist as setting a diagnosis for compounds, assessment tools for diagnostic tests were transferred to toxicological tests. In clinical diagnostics, test uncertainty is well accepted and incorporated in this assessment. Furthermore, prevalence information is considered to evaluate the gain in information resulting from the application of a test. Several common toxicological scenarios, in which test uncertainty and prevalence are combined, are discussed including the interdependence of test accuracy, prevalence and predictive values or the sequential application of a screening and a confirmatory test. In addition, real prevalences derived from prevalences determined by an imperfect test are presented. We conclude that information on prevalences of toxic health effects is required to allow a complete assessment of the relevance of toxicological test. In this process, lessons can be learned from evidence-based approaches in clinical diagnostics.

Key Words: prevalence; validation; evidence-based medicine; biometry; reference standard; risk assessment.


    THOUGHT STARTER
 TOP
 ABSTRACT
 THOUGHT STARTER
 THE ACCURACY OF THE...
 THE QUALITY OF OUR...
 THE IMPACT OF PREVALENCE
 CONSEQUENCES OF INACCURATE TESTS...
 THE USE OF CONFIRMATORY...
 PREVALENCE IN DISTINCT CHEMICAL...
 CONCLUSIONS
 REFERENCES
 
Let's assume you are a doctor asked to carry out an HIV test on a healthy European who has no specific risk factors. You choose the best test available, which is 99.9% accurate. Unfortunately, the result is positive. Bad news for your patient? Not yet: The prevalence of HIV infection—i.e., proportion of infected people in the general population—in Europe, is about 1:10,000 inhabitants. This means, if testing 10,000 people you will pick up one real-positive but your best available test will show 10 false-positives. A positive result will thus only be correct in one out of 11 cases; i.e., the probability that your patient is really HIV infected is about 9%. Similar reasoning can be found in Gigerenzer et al. (1998)Go, who pointed out the need to communicate carefully any diagnosis to these patients.

What does this teach us as toxicologists? Our problem is, in most cases, that we do not even know how accurate our test methods are (certainly less then 99.9%), and we have no indication of the prevalence—i.e., the proportion of toxic chemicals for a given health effect—in specific populations of chemicals.

This thought prompted us to elaborate on what diagnostic medicine can teach toxicology in handling the uncertainty in our test methods of setting the diagnosis of a substance exerting a given toxic effect.


    THE ACCURACY OF THE DIAGNOSIS—TRANSLATION TO TOXICOLOGY
 TOP
 ABSTRACT
 THOUGHT STARTER
 THE ACCURACY OF THE...
 THE QUALITY OF OUR...
 THE IMPACT OF PREVALENCE
 CONSEQUENCES OF INACCURATE TESTS...
 THE USE OF CONFIRMATORY...
 PREVALENCE IN DISTINCT CHEMICAL...
 CONCLUSIONS
 REFERENCES
 
Setting a diagnosis in a medical clinic is an art that involves three aspects: the patient, the physician, and the diagnostic measures (Haynes et al., 1996Go, 2002Go). Difficulties in setting a diagnosis arise from incompatibilities and limitations of each component. When identifying a toxic hazard of a chemical, similar to the role of the physician, the expert assessor has to overcome the many limitations of what is known of the nature of the phenomenon. In both cases a number of problems have to be considered (Table 1).


View this table:
[in this window]
[in a new window]
 
Table 1 Problems of Setting a Diagnosis in Clinical Medicine and Toxicology

 
It is impossible to judge the relative contribution of the many variables. The conclusion is simple: Our tools to assign a toxic health effect are imperfect. This is well accepted in the field of clinical diagnostics (Boyko et al., 1988Go; Sackett et al., 1991Go; Hunink et al., 2001Go; Knottnerus et al., 2002Go). However, in toxicology, we are not used to estimating and incorporating uncertainty, but we base our conclusions (i.e., labeling/classification as well as use or non-use of the substance for certain purposes) on this imperfect assessment.

It must be pointed out that this review omits any discussion of the relationship of a toxicity test to any adverse human health effect. The analysis is based only on the interplay of test quality and prevalence for a given test. This is principally applicable to any test, be it diagnostic in humans, in animals, or in vitro.


    THE QUALITY OF OUR "DIAGNOSTIC" TOOLS
 TOP
 ABSTRACT
 THOUGHT STARTER
 THE ACCURACY OF THE...
 THE QUALITY OF OUR...
 THE IMPACT OF PREVALENCE
 CONSEQUENCES OF INACCURATE TESTS...
 THE USE OF CONFIRMATORY...
 PREVALENCE IN DISTINCT CHEMICAL...
 CONCLUSIONS
 REFERENCES
 
In the field of carcinogenicity testing, the rodent bioassay's 50% positive rate triggered a detailed discussion of the restrictions and limitations of its predictive capacity for humans (Ames and Gold, 1990Go; Gold et al., 1998Go). However, few toxicity tests have been studied with this scrutiny. The area of validation of alternative methods has pioneered the assessment of the quality of methods employed in toxicology (Balls et al., 1990Go, 1995Go). The crucial achievement here was the concept of relevance, i.e., assessing not only the reliability/reproducibility but also the predictive capacity of a method. This implies, however, a point of reference (usually termed the "reference standard"). In clinical diagnostics this reference is often included in systematic studies (Walter et al., 1999Go; Knottnerus and Muris, 2003Go) and new assessment tools have even been developed for study evaluation (Whiting et al., 2003Go). In toxicology this optimal way of direct comparison most often is not applied for reasons of cost or animal welfare. If the reference is evaluated at all, a retrospective comparison (e.g., based on information from databases) is carried out (Fentem et al., 1998Go). In both disciplines, this reference standard is usually but not necessarily another test. In toxicology a consensus of experts on the toxicological properties or classification of a given substance could substitute for the standard. Likewise in clinical diagnostics: sometimes the reference standard for assessing the performance of a diagnostic measure is established by consensus, when an independent expert panel establishes a patient's diagnosis from established clinical criteria (Weller and Mann, 1997Go; Knottnerus and Muris, 2003Go).

It is of utmost importance to understand that validation assesses the reliability and relevance of methods. Figure 1 illustrates the validation process and which type of information constitutes the validity of a test method. Because the primary test result is often not expressed in the same way as the reference test, a prediction model is required to translate one into the other (Worth and Balls, 2001Go). For example, results of the test method might be continuous but must be classified into positive/negative or negative/mild/moderate/severe employing thresholds. In case of alternatives to animal experiments, the prediction model would translate from the in vitro result, e.g., an IC50 value, to an in vivo end point, e.g., LD50.



View larger version (25K):
[in this window]
[in a new window]
 
FIG. 1. The process of validation. The graphic presentation highlights the three aspects of test validity. The test to be validated consists of the test system and its analysis procedure. Via a prediction model, the results have to be converted into the results of the reference standard.

 
The quality and the adjustment of the prediction model are key determinants of the predictive capacity of a test. In clinical diagnostics, designs and sample sizes often allow threshold determination from the study data themselves (Sackett and Haynes, 2002Go). In contrast, the prediction model in toxicology is normally developed before the validation study, using a training set of substances (Bruner, 1996Go). The quality and properties of this set pre-determines the quality of results and applicability of the test method. Similarly, the selection of patients to establish a diagnostic method determines its quality for its intended use, the so-called patient spectrum (Irwig et al., 2002Go). If the selection is not representative or is somehow flawed—e.g., if it includes only severe cases—the relevance of the test will be impaired or restricted. Consequently, it is instrumental that validation studies in toxicology include a sufficient number of weak toxicants. Optimally, although often a dichotomous (i.e., positive vs. negative), test outcome is chosen, the selection should cover the whole range of toxic potency. This allows a better test assessment by expressing probabilities (e.g., of being positive or negative), for each chemical: A highly toxic compound will more likely be classified as such than a moderately toxic compound. Another difference in prediction model development/threshold definition in clinical diagnostics is that the sample sizes are usually substantially larger. This eases biometrical assessment, but several advantages of setting a diagnosis of toxicity can compensate here:
Testing of substances can be synchronized.
Testing can be repeated.
Positive and negative controls are readily available.
Replication and related substance testing are feasible.
The number of toxic health effects is limited.


    THE IMPACT OF PREVALENCE
 TOP
 ABSTRACT
 THOUGHT STARTER
 THE ACCURACY OF THE...
 THE QUALITY OF OUR...
 THE IMPACT OF PREVALENCE
 CONSEQUENCES OF INACCURATE TESTS...
 THE USE OF CONFIRMATORY...
 PREVALENCE IN DISTINCT CHEMICAL...
 CONCLUSIONS
 REFERENCES
 
As demonstrated in our thought starter, the prevalence of a disease is a key determinant of the practical value of a diagnostic measure (Buck and Gart, 1966Go; Linnet, 1988Go; Grimes and Schulz, 2002Go). If you are looking for something rare, even the best test will produce too many false-positives to provide a reliable result. It is therefore crucial to use descriptors of test relevance, which take the prevalence into account. In most simple cases (two outcomes of the test as well as of your reference standard, which will be mainly considered here for reasons of simplicity), this means describing the relevance by the PPV (positive predictive value) and NPV (negative predictive value) instead of the sensitivity (i.e., the probability of a correct negative result) and the specificity (i.e., the probability of a correct positive result). It would be challenging but also demanding to expand this concept of including prevalence information to multiple-class outcomes, as we have recently demonstrated for the case of skin irritation (Hoffmann et al., 2005Go). The predictive values estimate the proportion of correct positive/negative test outcomes in all positives/negatives and are thus an indication for the reliability of a positive/negative test result. In toxicology, however, we often forget that our panel of test compounds is not reflecting the real world but was designed to produce efficiently reliable estimates for sensitivity and specificity. For example, we choose 20 negatives and 20 positives, not caring what the toxic effect is. Thus, the predictive values based on the artificial study prevalence are only telling us the predictive capacity of the test if the same distribution of positives and negatives is found in the real world. This is usually not the case. Unfortunately, for most toxic health effect, we have no idea about their actual prevalence, for example in chemicals of general use. Not only are we lacking complete information on basic toxic properties for a large number of high production volume chemicals on the market (EPA, 1998Go; Allanou et al., 1999Go), but even for the existing data sets no such analysis is available. Therefore, efforts should be made to retrieve reliable estimates of those prevalences for the most relevant areas of toxicology.

Taking the example of skin irritation, this problem of estimating prevalences was approached (Hoffmann et al., 2005Go). Although in that report a detailed distribution of skin irritating potential was presented and analyzed, here we restrict ourselves to the prevalence analysis of the dichotomous outcome, i.e., irritant vs. non-irritant. In the New Chemicals Database of the European Chemical Bureau, which includes 3121 chemicals mainly registered in the last 15 years, the prevalence of skin irritating substances (according to EU-regulation) assessed by an animal experiment was 7.9%. The applicability domain with this prevalence would be the population of newly developed chemicals, whereas its use for other domains would have to be discussed. Because the database contains only results from one test in one laboratory for each chemical, the predictive capacity of the in vivo experiment could only be modeled for the outcome of a repetition of the same experiment. This resulted in a specificity of 99.7%, i.e., three out of 1000 nonirritating chemicals would be classified false-positive, and a sensitivity of 94.1%, i.e., 59 out of 1000 irritating chemicals would be classified false-negative. Thus a NPV of 99.5%—i.e., only one of 200 chemicals classified negative would in fact be an irritant—and a PPV of 96.8%—i.e., out of 1000 chemicals classified as irritating 32 would be not irritating—were calculated. It is evident, that modeling further aspects of variability—e.g., the within- and between-laboratory reproducibility—would decrease the predictive capacity estimates, resulting in the respective decrease of the predictive values. For some combinations, the effect of prevalences and test accuracy—i.e., assuming that sensitivity equals specificity—on the predictive values is illustrated (Table 2). The most important consequence of these considerations is that, for rare toxic events, we can rely on the negative test results but not on the positive ones.


View this table:
[in this window]
[in a new window]
 
Table 2 Predictive values for combinations of prevalence and test accuracy

 
As the negative predictive value is always close to 100%, this shows that the idea of controlling negative test results by means of a second test—e.g., confirm negative in vitro results in vivo, as suggested in the field of skin corrosion and irritation (OECD, 2002Go) —makes no sense at all for rare toxicities. Because the exposure events are rare, in most cases both tests will be done anyway, although the negative predictive value is high. In contrast, the positive test results have to be challenged; i.e., in regulatory toxicology it is necessary to avoid over-classification and unnecessary restrictions of substances. In this context, we are well aware of the most crucial safety aspect of false-negative classifications, but the calculation shows that a second test can hardly improve the NPV, which is impaired by the false-negative findings of the first test. For example, for skin irritation, even a test with only 70% accuracy will identify negatives correctly in more than 96% of the cases.


    CONSEQUENCES OF INACCURATE TESTS FOR PREVALENCE DETERMINATIONS
 TOP
 ABSTRACT
 THOUGHT STARTER
 THE ACCURACY OF THE...
 THE QUALITY OF OUR...
 THE IMPACT OF PREVALENCE
 CONSEQUENCES OF INACCURATE TESTS...
 THE USE OF CONFIRMATORY...
 PREVALENCE IN DISTINCT CHEMICAL...
 CONCLUSIONS
 REFERENCES
 
An important question often overlooked is this: How reliable are prevalences of rare diseases/toxicities if they are being assessed with our imperfect tools? If we agree that an in vivo experiment is not 100% accurate, many false-positives will populate our databases in case of rare toxicities. This means that rare toxicities are rarer than we believe. For illustration, in Table 3 we present some combinations of prevalence determined by a test with a given accuracy. For example, applying a test with 90% accuracy and finding a prevalence of 20% means that the true prevalence is only 12.5%; i.e., only 5 of 8 positive test substances are truly positive. Similarly, if we assume that the rabbit skin irritation test is 95% accurate, more than half of the selected skin irritants (prevalence 7.9%) would be false-positives.


View this table:
[in this window]
[in a new window]
 
Table 3 Real Prevalences for Some Combinations of Test Accuracy and Determined Prevalences

 
An obvious consequence is that the usefulness of databases for selecting the proper reference standard data is limited. As long as confirmatory testing in vivo is not carried out, it might be favorable to rely on the fewer, but more extensively studied substances from the scientific literature.


    THE USE OF CONFIRMATORY TESTS
 TOP
 ABSTRACT
 THOUGHT STARTER
 THE ACCURACY OF THE...
 THE QUALITY OF OUR...
 THE IMPACT OF PREVALENCE
 CONSEQUENCES OF INACCURATE TESTS...
 THE USE OF CONFIRMATORY...
 PREVALENCE IN DISTINCT CHEMICAL...
 CONCLUSIONS
 REFERENCES
 
It is common practice to apply a second test to confirm or challenge the results of a first test. When retesting positives, the specificity of the test procedure can be improved; when retesting negatives, the sensitivity of the test procedure can be improved. As we have seen above, this makes sense for relatively low prevalences only for positives, the NPV being almost optimal anyway. However, sensitivity and specificity of a test are interdependent: By defining, for example, the threshold value for a classification as positive or negative, one can be increased at the expense of the other, usually demonstrated with receiver operation curves (ROC) as illustrated in Figure 2 (McNeil et al., 1975Go; van der Schouw et al., 1995Go). This offers the opportunity to render tests extremely sensitive, accepting impaired specificity, a situation typical for screening tests.



View larger version (12K):
[in this window]
[in a new window]
 
FIG. 2. Illustration of a receiver operation curve (ROC). The interdependence of sensitivity and specificity of a test by moving the classification threshold can best be visualized by ROC graphs. The more steeply the curve ascends, the better the test—i.e., combining high sensitivity with high specificity.

 
A commonly applied and simple strategy in clinics as well as in toxicology is the combination of a screening test followed by a confirmatory test. With an oversensitive test a population is screened in order to detect as many positives as possible. Inevitably, this approach produces a lot of false-positive results in the first step. In the second step, all positively screened patients/substances are subjected to a confirmatory test, which should be able to discriminate positives from negatives. The advantage of this strategy is often a reduction of costs, as screening tests with their lower overall predictive capacity are often substantially cheaper than their associated confirmatory tests. Nevertheless, the usefulness of this approach again strongly depends on the prevalence of the health effect (Buck and Gart, 1966Go) and the tests' dependence (Marshall, 1989Go). Although, the overall testing costs are significantly decreased, the positive predictive value does not change substantially, when compared to the PPV of the confirmatory test. For example, let us assume a prevalence of 1%, an extremely sensitive screening assay with a sensitivity of 100%, but a specificity of only 50%, and a good confirmatory assay with an accuracy of 95%. Testing 10,000 substances, of which according to the assumed prevalence 100 are positive, the initial screen reduces the number of substances subjected to the confirmation test by 4950, i.e., all negatives. Applying subsequently the confirmatory test reduces the overall NPV from 100% to 99.95%; i.e., 9652 of 9657 negatives are true negative. But this results in a PPV of only 27.7% (Table 4); i.e., 95 of the 343 positives are true positives, compared to a PPV of the confirmatory test alone of 16.1% (Table 2). This means that, for rare health effects, a sufficient specificity of the screening has to be maintained; even close-to-perfect confirmatory tests cannot compensate. In the given example, an improved screening test specificity of 80% would result in a PPV of 49.0%, and a value of 90% would result in a PPV of 65.7% (Table 4).


View this table:
[in this window]
[in a new window]
 
Table 4 PPV of Combination of a Screening Test and a Confirmatory Test for Prevalences of 1% and 10%

 
A solution to further increase PPV is the application of a series of complementary tests in a sequence. This solution carries with it the problems of loss in cost savings and of the evaluation of the dependencies between tests, as complementary screens might be difficult to find. Furthermore the efficacy of combining tests for rare health effects is limited even under optimal conditions—i.e., assuming test independency (Table 5).


View this table:
[in this window]
[in a new window]
 
Table 5 PPV Resulting from Combining a Number of n 90% Accurate and Independent Tests Independent Test in Sequence

 
When combining a screening test and a confirmatory test, the screening test balancing sensitivity and specificity has to be carefully designed and adjusted because false-negative results in this step will have a negative effect on patient health or consumer safety. A more detailed insight into the prevalence is needed here to estimate consequences. One also has to consider the strength of a response and not only the dichotomized classification: It makes an enormous difference whether a large proportion of actual results are borderline to a given threshold or whether the negatives and positives are clearly distinct (Brenner and Gefeller, 1997Go; Bruner et al., 2002Go). Especially in the case of low prevalences, an extremely skewed distribution toward the negative end of the scale can be expected.


    PREVALENCE IN DISTINCT CHEMICAL CLASSES
 TOP
 ABSTRACT
 THOUGHT STARTER
 THE ACCURACY OF THE...
 THE QUALITY OF OUR...
 THE IMPACT OF PREVALENCE
 CONSEQUENCES OF INACCURATE TESTS...
 THE USE OF CONFIRMATORY...
 PREVALENCE IN DISTINCT CHEMICAL...
 CONCLUSIONS
 REFERENCES
 
So far we have handled the chemicals from the chemical universe only as single entities. But each one is related to others, e.g., by chemical structure, physiochemical properties, or mechanism of action. There is an increased probability that related chemicals will exhibit similar toxicological effects; thus if information about a related chemical is available, it should be considered. This can be compared to the clinical situation where the integration of family anamnesis might change the probability of a diagnosis dramatically. Consider, for example, mutagenicity: in contrast to being a relatively rare toxic effect among all chemicals, it is a common effect of the chemical group of nitrosamines. Making use of this kind of a priori information, structural alerts or read-across approaches should help to assign chemicals to families with high or low prevalence of health effects. This is by no means new but reflects practices of priority setting by (Q)SAR and similar computational approaches, as wells as the experience of the risk assessor. What we desperately need, however, are measures of how closely two substances are related. As long as such measures are lacking, groups of chemicals should be considered mainly as classes with different prevalences and therefore different certainty of test results. This approach allows us to apply tests with different sensitivities and specificities, and it calls for test strategies that take into account general prevalence, chemical classes with their individual prevalences, and the use of proper tests with suitable predictive capacities.


    CONCLUSIONS
 TOP
 ABSTRACT
 THOUGHT STARTER
 THE ACCURACY OF THE...
 THE QUALITY OF OUR...
 THE IMPACT OF PREVALENCE
 CONSEQUENCES OF INACCURATE TESTS...
 THE USE OF CONFIRMATORY...
 PREVALENCE IN DISTINCT CHEMICAL...
 CONCLUSIONS
 REFERENCES
 
Toxicological tests can be considered as diagnostic tools to assess the toxicological properties of substances. Taking this point of view, it is possible to draw parallels between diagnostic medicine and toxicology and to explore the possibility of adopting evidence-based medicine methodology in toxicological evaluations. To properly assess toxicological tests, their reliability and relevance need to be explored, a process, in which the reference standard is of crucial importance. If no appropriate reference standard exists, expert consensus can be a valid alternative. Nevertheless, reference standards will always be imperfect. Accounting for this imperfection is crucial for a complete test evaluation. In the present study, the extent to which imperfect reference data might populate databases is demonstrated, especially with false-positives for low prevalence cases.

Furthermore, a systematic assessment of prevalences of toxic health effects in the chemical universe as well as in defined classes of chemicals is required. Only when taking into account prevalence considerations, can "diagnostic" value of a test can be estimated. For example, in low prevalence situations, negative predictive values are almost optimal, which challenges the approach of confirming negative results. In addition, we emphasized that the use of confirmatory tests strongly depends on prevalence and test accuracy. Additional information on substances, including chemicophysical properties, chemical structure, or classes, might affect the prevalence. Therefore, for toxicological hazard identification and testing strategies, integration of prevalence information is crucial. In this process, lessons can be learned from the medical diagnosis setting, and especially from the evidence-based evaluation of diagnostic measures, as a step toward evidence-based toxicology.


    REFERENCES
 TOP
 ABSTRACT
 THOUGHT STARTER
 THE ACCURACY OF THE...
 THE QUALITY OF OUR...
 THE IMPACT OF PREVALENCE
 CONSEQUENCES OF INACCURATE TESTS...
 THE USE OF CONFIRMATORY...
 PREVALENCE IN DISTINCT CHEMICAL...
 CONCLUSIONS
 REFERENCES
 
Allanou, R., Hansen, B. G., and van der Bilt, Y. (1999). Public availability of data on EU High Production Volume Chemicals. EUR 18996 EN (http://ecb.jrc.it).

Ames, B. N., and Gold, L. S. (1990). Chemical carcinogenesis: Too many rodent carcinogens. Proc. Natl. Acad. Sci. U.S.A. 87, 7772–7776.[Abstract/Free Full Text]

Balls, M., Blaauboer, B. J., Brusick, D., Frazier, J., Lamb, D., Pemberton, M., Reinhardt, C., Roberfroid, M., Rosenkranz, H., Schmid, B., Spielmann, H., Stammati, A. L., and Walum, E. (1990). Report and recommendations of the CAAT/ERGATT workshop on validation of toxicity test procedures. Alt. Lab. Anim. 18, 303–337.

Balls, M., Blaauboer, B. J., Fentem, J. H., Bruner, L., Combes, R. D., Ekwall, B., Fielder, R. J., Guillouzo, A., Lewis, R. W., Lovell, D. P., Reinhardt, C. A., Repetto, G., Sladowski, D., Spielmann, H., and Zucco, F. (1995). Practical aspects of the validation of toxicity test procedures. The report and recommendations of ECVAM workshop 5. Alt. Lab. Anim. 23, 129–147.

Boyko, E. J., Alderman, B. W., and Baron, A. E. (1988). Reference test errors bias the evaluation of diagnostic tests for ischemic heart disease. J. Gen. Intern. Med. 3, 476–481.[ISI][Medline]

Brenner, H., and Gefeller, O. (1997). Variation of sensitivity, specificity, likelihood ratios and predictive values with disease prevalence. Stat. Med. 16, 981–991.[CrossRef][ISI][Medline]

Bruner, L. H. (1996). No prediction model, no validation study. Alt. Lab. Anim. 24, 139–142.

Bruner, L. H., Carr, G. J., Harbell, J. W., and Curren R. D. (2002). An investigation of new toxicity test method performance in validation studies: 3. Sensitivity and specificity are not independent of prevalence or distribution of toxicity. Hum. Exp. Toxicol. 21, 325–334.[CrossRef][ISI][Medline]

Buck, A. A., and Gart, J. J. (1966). Comparison of a screening test and a reference test in epidemiologic studies. I. Indices of agreement and their relation to prevalence. Am. J. Epidemiol. 83, 586–592.[ISI][Medline]

Environmental Protection Agency (EPA). (1998). Chemical Hazard Data Availability Study. What do we really know about the safety of high production volume chemicals? EPA's 1998 baseline of hazard information that is readily available to the public. EPA, Office of Pollution Prevention and Toxics, Washington, D.C. (available at: www.epa.gov/opptintr/chemtest/hazchem.htm).

Fentem, J. H., Archer, G. E. B., Balls, M., Botham, P. A., Curren, R. D., Earl, L. K., Esdaile, D. J., Holzhütter, H. G., and Liebsch, M. (1998). The ECVAM international study on in vitro tests for skin corrosivity. 2. Results and evaluation by the management team. Toxicol. In Vitro 12, 483–524.[CrossRef][ISI]

Gigerenzer, G., Hoffrage, U., and Ebert, A. (1998). AIDS counselling for low-risk clients. AIDS Care 10, 197–211.[CrossRef][ISI][Medline]

Gold, L. S., Slone, T. H., and Ames, B. N. (1998). What do animal cancer tests tell us about human cancer risk?: Overview of analyses of the carcinogenic potency database. Drug Metab. Rev. 30, 359–404.[ISI][Medline]

Grimes, D. A., and Schulz, K. F. (2002). Uses and abuses of screening tests. Lancet 359, 881–884.[CrossRef][ISI][Medline]

Haynes, R. B., Sackett, D. L., Gray, J. M., Cook, D. J., and Guyatt, G. H. (1996). Transferring evidence from research into practice: 1. The role of clinical care research evidence in clinical decisions. A.C.P. J. Club 125, A14–16.

Haynes, R. B., Devereaux, P. J., and Guyatt, G. H. (2002). Clinical expertise in the era of evidence-based medicine and patient choice. Evid. Based Med. 7, 36–38.[Free Full Text]

Hoffmann, S., Cole, T., and Hartung, T. (2005). Skin irritation: Prevalence, variability and regulatory classification of existing in vivo data from industrial chemicals. Regul. Toxicol. Pharmacol. In press.

Hunink, M., Glasziou, P., Siegel, J., Weeks, J., Pliskin, J., Elstein, A. S., and Weinstein, M. C. (2001). Decision making in health and medicine: Integrating evidence and values. Cambridge University Press, New York.

Irwig, L., Bossuyt, P., Glasziou, P., Gatsonis, C., and Lijmer, J. (2002). Designing studies to ensure that estimates of test accuracy are transferable. B.M.J. 324, 669–671.[Free Full Text]

Knottnerus, J. A., van Weel, C., and Muris, J. W. (2002). Evaluation of diagnostic procedures. B.M.J. 324, 477–480.[Free Full Text]

Knottnerus, J. A., and Muris, J. W. (2003). Assessment of the accuracy of diagnostic tests: the cross-sectional study. J. Clin. Epidemiol. 56, 1118–1128.[CrossRef][ISI][Medline]

Linnet, K. (1988). A review on the methodology for assessing diagnostic tests. Clin. Chem. 34, 1379–1386.[Abstract/Free Full Text]

Marshall, R. J. (1989). The predictive value of simple rules for combining two diagnostic tests. Biometrics 45, 1213–1222.[ISI]

McNeil, B. J., Keller, E., and Adelstein, S. J. (1975). Primer on certain elements of medical decision making. N. Engl. J. Med. 293, 211–215.[Abstract]

OECD (Organisation for Economic Cooperation and Development). (2002). OECD guideline for testing of chemicals No. 404: Acute dermal irritation/corrosion, pp. 1–13. Organisation for Economic Cooperation and Development, Paris.

Sackett, D. L., Haynes, R. B., Guyatt, G. H., and Tugwell, P. (1991). Clinical epidemiology: A basic science for clinical medicine. 2nd ed., Little Brown, Boston.

Sackett, D. L., and Haynes, R. B. (2002). The architecture of diagnostic research. B.M.J. 324, 539–541.[Free Full Text]

van der Schouw, Y. T., Verbeek, A. L., and Ruijs, S. H. (1995). Guidelines for the assessment of new diagnostic tests. Invest. Radiol. 30, 334–340.[ISI][Medline]

Walter, S. D., Irwig, L., and Glasziou, P. P. (1999). Meta-analysis of diagnostic tests with imperfect reference standards. J. Clin. Epidemiol. 52, 943–951.[CrossRef][ISI][Medline]

Weller, S. C., and Mann, N. C. (1997). Assessing rater performance without a "gold standard" using consensus theory. Med. Decision Making 17, 71–79.[ISI][Medline]

Whiting, P., Rutjes, A. W., Reitsma, J., Bossuyt, P. M., and Kleijnen, J. (2003). The development of QUADAS: A tool for the quality assessment of studies of diagnostic accuracy included in systematic reviews. B.M.C. Med. Res. Methodol. 3, 25.[CrossRef][Medline]

Worth, A., and Balls, M. (2001). The importance of the prediction model in the validation of alternative tests. Alt. Lab. Anim. 29, 135–143.