National Institute of Public Health and the Environment (RIVM), P.O. Box 1, 3720 BA Bilthoven, The Netherlands
Received August 23, 2001; accepted October 31, 2001
ABSTRACT
A family of (nested) dose-response models is introduced herein that can be used for describing the change in any continuous endpoint as a function of dose. A member from this family of models may be selected using the likelihood ratio test as a criterion, to prevent overparameterization. The proposed methodology provides for a formal approach of model selection, and a transparent way of assessing the benchmark dose. Apart from a number of natural constraints, the model expressions follow from an obvious way of quantifying differences in sensitivity between populations. As a consequence, dose-response data that relate to both sexes can be efficiently analyzed by incorporating the data from both sexes in the same analysis, even if the sexes are not equally sensitive to the compound studied. The idea of differences in sensitivity is closely related to the assessment factors used in risk assessment. Thus, the models are directly applicable to estimating such factors, if data concerning populations to be compared are available. Such information is valuable for further validation or adjustment of default assessment factors, as well as for informing distributional assessment factors in a probabilistic risk assessment. The various applications of the proposed methodology are illustrated by real data sets.
Key Words: dose-response modeling; benchmark dose; continuous endpoints; critical effect size; probabilistic assessment factor.
In the field of risk assessment the common way of analyzing dose-response data from animal studies is by statistically testing each dose group against the controls, resulting in a no observed adverse effect level (NOAEL). Because of various serious objections against this approach, Crump (1984) introduced the "benchmark dose" as an alternative to the NOAEL. In the benchmark approach a dose-response model is fitted to the data, and this model is used for estimating the dose at a certain level of response. Although the benchmark approach is gaining attention from toxicologists and risk assessors, its implementation in practice is still limited. One of the main practical difficulties of the benchmark approach is the quantification of the level of response that can be considered as biologically insignificant, or as acceptable from a risk management perspective. A second difficulty concerns the choice of the model to be fitted to the dose-response data. As yet, virtually no biologically based dose-response models for noncancer effects are available. Therefore, the benchmark approach has to rely on purely descriptive dose-response models. Clearly, numerous descriptive dose-response models can be fabricated, and the goodness of fit is usually the only criterion for deciding that one model is better than another. However, various alternative models may be found that equally satisfy the goodness of fit criterion for a particular data set. This is a serious problem as different models fitted to the same data may result in different benchmark doses. Therefore, the benchmark approach would be much improved if a formal procedure were available to guide the choice of model, ensuring that a particular data set yields reproducible results, so that the estimated benchmark dose can be regarded as an objectively assessed value.
This article proposes a family of dose-response models, together with a formal method to decide which member of this family of models will be used for describing any particular data set. As will be shown, the formulation of the proposed models follows directly from a number of desirable properties that are imposed by both biological and risk assessment considerations. A number of examples will illustrate the various applications of these models. In all examples benchmark doses (CEDs) are given, but it should be noted that the methodology discussed here also applies to the analysis of dose-response data that serve to answer a particular scientific question without being directly related to quantitative risk assessment.
Critical Effect Dose and Continuous Data
Toxicological observations may be reported as quantal, as ordinal, or as continuous data. In the toxicological literature the benchmark approach is often applied to and discussed for quantal data (e.g., Allen et al., 1994; Faustman et al., 1994
). Accordingly, the benchmark response is usually defined in terms of a response rate, in particular a given increase in incidence of a particular effect. It can be shown (Slob, in preparation) that this approach is problematic, due to the fact that experimental errors interfere with response rates that are observed in or estimated from experimental studies.
For continuous data this problem does not arise. Here, the dose-response relationship can be estimated separately from the experimental error (reflected by the variation in the observations within dose groups). A second fundamental difference is that a continuous dose-response relationship represents the change of the size of the effect with dose. It is thus natural to define the benchmark response for continuous data in terms of a particular effect size, but there are various ways of doing this. Some authors have proposed to define effect size relative to the "natural" variation in the controls (Crump, 1995; Gaylor and Slikker, 1990
). However, in a typical toxicity study (using inbred animals) the variation in the observations mainly reflects the heterogeneity in the experimental conditions. Normally, we try to minimize the experimental variation in toxicity studies, and the "natural" variation that we observe in any such study quantifies the extent to which we did not succeed in that. Clearly, this is not natural variation in any biological sense, and therefore it cannot serve as a standard in defining the benchmark response. This issue is further addressed elsewhere (Slob, in preparation).
Instead of using the observed variation as a reference, the benchmark response can be defined in a biologically meaningful way as a particular change in the size of the effect that is considered acceptable or without adverse consequences for the subject. A toxicity study only allows for estimating the size of the effect (as a function of dose) as shown by the test animal under the average of all experimental conditions associated with the particular study. From a biological point of view the most natural way of measuring an effect size is in terms of a percent change relative to the background value of the particular endpoint.
To distinguish terminology from that currently used for quantal data, Slob and Pieters (1998) introduced the terms Critical Effect Size (CES) and Critical Effect Dose (CED) for continuous data. The CES reflects the quantitative change in a particular endpoint considered as nonadverse or acceptable, at the level of the individual organism. For instance, one might postulate that inhibition of cholinesterase of less than 20% is nonadverse for an individual, i.e.,
![]() |
Murrel et al. (1998) proposed another way of measuring the continuous benchmark response, viz. relative to the range between maximum and background response. The advantage of this approach would be that in this way a single numerical value, e.g., 10%, might be considered valid for different endpoints. Indeed, it does seem reasonable to take the maximum response level into account. For instance, a given percent change (CES) would most likely be less serious for a serum enzyme, which may increase 2- or 3-fold at high exposures, than it would be for relative liver weight, which typically has a much lower maximum response. However, it is better to let this aspect play a role in choosing a percent change (CES) for any particular endpoint, and maintain the freedom to take other considerations into account as well (e.g., normal fluctuations within individuals, biological meaning of the endpoint). A practical drawback of Murrel's definition is that it is often impossible to estimate the maximum response level due to insufficient data.
A General Family of Models
For most toxicological (continuous) endpoints no dose-response models are available that have been derived from the underlying biological mechanisms. Given this situation one has to resort to purely descriptive models. However, numerous descriptive models can be devised that are all suitable for fitting dose-response data. Yet, even though knowledge on the underlying biological mechanisms is usually poor, not all models may be considered equally plausible for mimicking the true dose-response relationship. Further, not all models are equally suitable in serving the purposes of risk assessment. The following considerations are relevant in developing a generic family of dose-response models to be used for analyzing dose-response data from toxicity tests.
(1) To begin with, toxicological (continuous) measurements are nonnegative, and models allowing negative responses are unrealistic. These models should not be used, even when the fitted model only reaches negative values outside the range of observation. For that reason the often used polynomial functions (y = a + bx + cx2 + ...) will be excluded. (One of the reasons for the popularity of this family of models was ease of calculation, but this has become less compelling with modern computers.) The reader might note here that growth data could be negative, e.g., when for each animal initial body weight is subtracted from body weight after exposure. However, since small and large animals are likely to respond proportionally to a given treatment rather than by a similar absolute change, it makes more sense to measure growth by the ratio of body weight over initial body weight.
(2) Another fundamental choice is to exclude threshold models. The first reason is that the existence of dose-thresholds in a strict sense is hard to defend in dose-response relationships. In the case of continuous endpoints one necessary condition for a dose-response relationship with a threshold would be that all individual animals have exactly the same dose-threshold for that endpoint. This is highly unlikely, even in a highly controlled experiment with inbred animals. A second reason is that the estimation of the threshold as a parameter in a model is problematic: depending on the start values (see Appendix) of the model's parameters different estimates may result from the estimation procedure. The issue of dose-thresholds was extensively discussed elsewhere (Slob, 1999). Instead of using threshold modelsin the strict sensethreshold-like dose-response relationships may be described by smooth functions that are sufficiently curved (e.g., sigmoidal). Such functions allow the assessment of a dose associated with any small effect. Therefore, a generic family of models must include functions that have the flexibility to take on a sufficiently curved shape for describing threshold-like dose-response relationships.
(3) An important issue in toxicology and risk assessment is that of relative sensitivity between different populations, e.g., between test animals and humans, between average individual and sensitive individual, or between sexes. Therefore it would be convenient if relative sensitivity is reflected by one of the model's parameters. Before defining the concept of relative sensitivity, we first need to make explicit what we mean by "equally sensitive." Clearly, we would demand that the response is equal for all doses. This is obvious for two populations that do not differ in the background value of a particular endpoint. However, if they do differ, this background value needs to be taken into account. For example, body weights are usually different between sexes. An equal response in both sexes will now be defined as:
![]() |
Thus, if the relative change compared to the background value is the same for all doses, males and females are considered equally sensitive regarding the particular endpoint. To express this idea in a dose-response model, it needs to be of the form:
![]() | ((1)) |
The next question to be answered is how to express differences in sensitivity. The most convenient way of doing this appears to be in terms of a dose factor (potency factor or toxicity equivalence factor). Such a factor is consistent with the concept of assessment factors (also termed safety, uncertainty, or extrapolation factors), and if suitable data are available these factors can thus be directly estimated. Therefore, the general dose-response model should obey
![]() | ((2)) |
(4) Next, a decision must be made on the function f(x) in Equation 2. It should, of course, obey f(0) = 1, which can be achieved by exponential functions.
(5) A family of models is needed having the property that, for any particular dataset, an appropriate member can be chosen in a formal way. Therefore, it is convenient to work with a family of nested models, so that their fits can be compared using the likelihood ratio test (see section Further Statistical Methods).
(6) The family of models should include models that level off. Also it should contain models that change very little at low doses but that do change at higher doses (to mimic threshold-like responses).
Taken all together these conditions suggest the following family of models:
Model 1: y = a with a > 0
Model 2: y = a exp(x/b) with a > 0
Model 3: y = a exp(±(x/b)d) with a > 0, b > 0, d 1
Model 4: y = a [c (c 1) exp(-x/b)] with a > 0, b > 0, c > 0
Model 5: y = a [c (c 1) exp(-(x/b)d)] with a > 0, b > 0, c > 0, d 1 where y is any continuous endpoint, and x denotes the dose. In all models the parameter a represents the level of the endpoint at dose 0, and b can be considered as the parameter reflecting the efficacy of the chemical (or the sensitivity of the subject). At high doses Models 4 and 5 level off to the value ac, so the parameter c can be interpreted as the maximum relative change. Models 3 and 5 have the flexibility to mimic threshold-like responses. All these models are nested to each other, except Models 3 and 4, which both have three parameters. These two models cannot be compared to each other by a likelihood ratio test, and other considerations must be used to make a choice between them (see Illustrative Example 4).
In all models the parameter a is constrained to being positive for obvious reasons (it denotes the value of the endpoint at dose 0). The parameter d is constrained to values larger than (or equal to) 1, to prevent the slope of the function at dose 0 being infinite, which seems biologically implausible. The parameter b is constrained to be positive in all models. Parameter c in Models 4 and 5 determines whether the function increases or decreases, by being larger or smaller than unity, respectively. To make Model 3 a decreasing function a minus sign has to be inserted in the exponent.
Figure 1 illustrates the separate members of this family of models, and how they are related to each other. For example, Model 2 is nested within Model 3, i.e., Model 2 is a special case of Model 3, as can be seen by setting d = 1. Vice versa, Model 3 can be considered as an extension of Model 2, by incorporating the parameter d. As another example, Model 5 can be regarded as an extension of Model 2, by adding two more parameters (c and d).
|
When a dose-response model is fitted to two (or more) subpopulations simultaneously, the methodology presented here allows for testing the homogeneity of variances between the subpopulations, as described below.
Likelihood Ratio Test
The likelihood ratio test may be used to test if extension of a model by increasing the number of parameters results in a statistically significant improvement of the fit. It can be shown that twice the difference of the log-likelihoods associated with the two model fits (approximately) follows a chi-square distribution, with the number of degrees of freedom equal to the difference in the number of parameters in both models. Table 1 shows the critical (
= 0.05) difference in the two log-likelihood values, as a function of the number of degrees of freedom, i.e., the difference in the number of parameters between the two models.
|
No Individual Data Available
In reported studies individual observations are usually not given. Instead, means and SDs (or standard errors of the mean) for each dose group are commonly reported. Since the mean and SD are "sufficient" statistics for a sample from a normal distribution, a dose-response model can just as well be fitted based on these statistics without any loss of information (except possible outliers). However, the normality assumption is not plausible, and it is better to assume lognormality. In that case sufficient statistics are provided by the geometric mean and the geometric SD, or by the (arithmetic) mean and SD on log-scale. These can be estimated from the reported mean and SD as follows:
![]() |
![]() |
![]() |
![]() |
Selecting a Member from the Model Family
The nested character of the proposed family of models makes it possible to formally choose a model for describing a particular data set. In general, when a model is extended by one or more parameters the resulting fit criterion will achieve a higher value (or at least remain equal), as compared to the model with fewer parameters. However, it is unfavorable to use a model with too many parameters, as this results in reduced precision of model predictions. Therefore, a formal criterion is needed to decide whether any extension in the number of parameters should be accepted or not. For nested models this can easily be done by the likelihood ratio test, and an obvious formal decision criterion is the 5% significance level using this test. Thus, the model can be formally selected by consecutively fitting the members of the model family and choose the model that cannot be significantly improved by a model having more parameters, as determined by the likelihood ratio test.
Illustrative example 1.
For illustrative purposes an extraordinarily complete set of dose-response data will be considered first. The data concern fetal weights as a function of po exposure to butyl benzyl phthalate (BBP) in maternal animals during gestation (Piersma et al., 2000). These data are plotted in Figure 2
, with three members of the model family consecutively fitted to them. The log-likelihoods associated with all 5 models are given in Table 2
. Comparing Model 2 with Model 1 (no response) results in an enormous increase in the log-likelihood, not leaving any doubt regarding the existence of an effect of BBP on fetal weights. Since Model 2 has only one parameter more than Model 1, a difference of 1.92 would already be sufficient for significance at the 5% level (see Table 1
). Further, it can be seen that Model 3 performs significantly better than Model 2, while Model 5 on its turn performs better than Model 3. Model 4, however, does not lead to any improvement compared to Model 2 (indeed, parameter c was estimated to be 0 here).
|
|
Parenthetically, in this analysis it was assumed that the fetal weights are independent, which was in fact not the case (fetuses from the same mother were correlated). However, for point estimation of the parameters this has only minimal consequences, unless the design is highly unbalanced. In establishing confidence intervals the intraclass correlations should not be ignored and be taken into account in the statistical analysis.
Illustrative example 2.
The sigmoid dose-response relationship as found in Figure 2 agrees with what toxicologists generally expect: at low doses the response is negligible, and at higher doses the response levels off. However, a 28-day study on Rhodorsil Silane resulted in close to linear dose-response relationships for all continuous endpoints considered (Woutersen et al., 2001
). Figure 3
shows the dose-response data for one of the endpoints measured in this study, viz. red blood cell counts, illustrating that the proposed family of models, despite being nonlinear, is flexible enough to cover (nearly) linear dose-response relationships. One may note that this example further illustrates that assumptions on thresholds or threshold mechanisms for noncancer endpoints may not be valid in general. Further, this example illustrates that the NOAEL will decrease with increasing sample sizes, while the CED (point estimate) remains stable.
|
Suppose, for example, that both sexes respond by the same relative decrease in body weight at any given dose of a particular compound. In that case one may consider both sexes as equally responding to the compound (with respect to body weight). Because of the nature of the proposed models, one can describe the dose-response relationship in both sexes by the same member of the model family, by only assuming a different value for the parameter a in both sexes, e.g., for Model 5 we have
![]() |
Next, consider the situation that both sexes are not equally sensitive to the compound with respect to body weight. In that case we make parameter b sex dependent, in addition to parameter a. We then again have one more parameter to be estimated, and it can be tested if the associated log-likelihood has significantly increased, using the likelihood ratio test with one degree of freedom.
Of course, it could be that fitting the model to both sexes separately is even better. This can be tested by comparing the sum of the two log-likelihoods associated with the separate fits, with the likelihood of the simultaneous fit, taking account of the difference in the total number of parameters estimated in both situations. For example, fitting Model 5 to both sexes separately involves a total of 10 parameters (4 regression parameters and one variance for each sex), while a simultaneous fit with both a and b sex dependent comprises 7 parameters. Therefore, the log-likelihood for the separate fits should be at least 3.91 larger than the simultaneous fit to be significantly better (see Table 1).
Illustrative example 3.
Figure 4 shows the individual (rat) body weights at the end of a chronic study as a function of the applied dose. Clearly, male and female rats differ in body weights, and there is no need to statistically test this difference. The first step of the analysis, that of choosing a member of the model family, can be performed by assuming parameter a being sex dependent right away. While Model 2 results in a clearly significantly better log-likelihood compared to Model 1 (no response), Models 3 and 4 do not add anything to the fit (see Table 3
). Therefore, Model 2 may be selected for this data set. Next, Model 2 is fitted to males and females separately. The sum of the associated log-likelihoods (375.57, 6 parameters, 3 for each sex) is compared to the simultaneous fit with parameter a sex-dependent (365.42, 4 parameters). The difference between these two values is clearly significant (
loglik = 10.15, df = 2). Therefore, the model is fitted with b sex-dependent as well, resulting in an only marginally higher log-likelihood (365.99) compared to situation that b is the same for both sexes. It may be concluded that males and females do not appear to differ in sensitivity. Finally, the model is fitted by assuming that, next to a, also the residual variation differs between sexes. This fit results in a log-likelihood of 375.07, only marginally smaller than the sum of the separate analysis, so this model is selected for describing this data set (see Fig. 4
).
|
|
|
|
|
It should be noted that in this example the sum of the separate fits for males and females results in a log-likelihood that is, although not significantly, nonetheless considerably higher. Indeed, considering the fact that the geometric means in Figure 6 do not coincide with the fitted model very well, one might wonder if the model is close to the true dose-response relationship. So, the question remains here if the deviations of the means from the model are "real" or due to experimental error (note the large variation and the small number of animals). By deciding the latter (based on the nonsignificant likelihood ratio test), the deviations of the means from the model are taken into account in the statistical analysis as additional uncertainty, resulting in wider confidence intervals for the CED.
Illustrative example 5.
When fitting a dose-response model to dose-response data from a published study one usually has to do with the reported (arithmetic) means and SDs (or standard errors of the mean) for each treatment group. These are "sufficient" statistics for a sample from a normal distribution, and one may fit a dose-response model to these statistics instead of to the individual data. (Of course, an analysis of outliers is not possible anymore.) However, it is more plausible to assume lognormality, and in that case the geometric means and geometric standard deviations are the proper (and sufficient) statistics. The latter two statistics can be estimated from the former two (see section Further Statistical Methods), but this procedure will not give identical results to direct estimation from the individual data. Nonetheless, the resulting inaccuracies may be negligible. To illustrate this, the data of the Illustrative Example 3 were reanalyzed based on the reported (arithmetic) means and SDs instead of on the individual data. Table 5 shows that the same model would be selected, with a similar fit result (compare Figs. 4 and 7
).
|
|
Illustrative example 6.
Figure 8 shows the cholinesterase (ChE) activity measured in plasma after 1, 2, and 4 months of exposure to an OP-ester. The process of model selection resulted in Model 5 with parameters a and b both dependent on time. The fact that parameter a is significantly different between the three points in time implies that the (background) ChE activity changes with age of the animals. The dependence of b on time can be interpreted as an effect of the OP-ester: with increasing exposure the dose-response curve gets steeper, i.e., the ChE inhibition increases with exposure duration. Accordingly, the CED decreases with increasing exposure duration.
|
Instead of using ratios of NOAELs, assessment factor distributions could be estimated by ratios of CEDs. This appears a better approach for several reasons. First, a NOAEL is the dose at which effects could not be observed, and therefore depends to a large extent on the particular experimental conditions (including sample size, dose spacing, experimental noise). Thus, a NOAEL ratio can easily differ from one, even if the dose-response relationships related to both situations (associated with numerator and denominator) were in reality identical. A more fundamental problem of a NOAEL ratio is that it is not possible to define a (biological) entity that it intends to estimate. This makes it a scientifically unsound approach.
A CED ratio, on the other hand, does estimate a clearly defined entity: it estimates the dose factor needed in one situation (or population) to make its response, in terms of a particular CES, equal to the other situation (or population). Thus, the CED ratio may be regarded as an analogue to the toxicity equivalence factor, or potency factor. Since values for CES cannot be precisely defined, and are open to debate and future insights, the concept of a CED ratio has only practical meaning when its value does not depend on the particular CES, at least in the range of smaller values of CES. This assumption always needs to be checked before assessing the CED ratio. The methodology discussed in this article has the advantage that checking this assumption is inherent in it: when two dose-response data sets can be described by the same model, with parameters a and b possibly being different, the assumption is fulfilled. The estimation of a CED ratio is illustrated in the following example.
Illustrative example 7.
Figure 9 shows cholinesterase activity as a function of dose, observed in two studies using different species. Model 5 was fitted to both species simultaneously, allowing for parameters a and b to be different between species, while parameters c and d are assumed equal. The data for both species appear to be adequately described by this model. By accepting this model, the assumption that the ratio of CEDs for both species does not depend on the CES chosen is fulfilled.
|
As already indicated, collecting more of these factors (for other compounds/endpoints) will inform the distribution of the interspecies adjustment factor. This empirical distribution may be regarded as reflecting the interspecies sensitivity variation among compounds. However, it is inflated by estimation errors, and therefore may be unnecessarily wide. To gain insight into the extent to which this happens, it is helpful to quantify the uncertainty of each of these factors, or CED ratios, e.g., by the bootstrap method (Slob and Pieters, 1998). The resulting uncertainty distribution for the CED ratio in this example is shown in Figure 10
. The associated GSD (which may be considered as a geometric standard error for the estimated ratio) is 1.24.
|
DISCUSSION
This article presents a generic family of dose-response models, together with a formal procedure of model selection based on the likelihood ratio test. Over the past 4 years this methodology has been extensively applied to a large number of (noncancer) toxicity studies, performed at RIVM and TNO in the Netherlands (Piersma et al., 2000; Woutersen et al., 2001
), or obtained from the open literature (Appel et al., 2001
; Janssen et al., in preparation). Thus far, the models of the proposed family were always applicable in analyzing the observed dose-response data. Apparently, the family of models proposed here is both flexible and comprehensive enough to describe virtually any continuous dose-response data as resulting from toxicity studies. Together with the formal procedure of model selection, this methodology may serve as a generic approach with the important advantage that the analysis and the estimated CEDs are much less dependent on the person who performed the analysis.
Nonetheless, the formal procedure of model selection should be considered as a guideline, and not as a dictating rule. The basic problem of analyzing dose-response data is to decide what part of the observed differences in response levels are due to experimental fluctuations, and what part to real effects from the applied dose. The formal methodology of statistically testing different models only accounts for the random variation between animals. However, it cannot account for the fact that dose groups may differ in response due to other factors than the dose. Unfortunately, toxicologists (and statisticians) often overlook the fact that in a typical toxicity study treatment groups are in fact not replicated. Since in practice it is virtually impossible to completely randomize a study with respect to all relevant aspects, systematic differences in treatments (other than the dose) or in circumstances between the dose groups can always occur. This may result in systematic errors (Cox, 1958), i.e., systematic deviations in response between dose groups that are not caused by the dose, but by any other experimental factor(s). Such systematic errors may result from obvious factors (e.g., all animals in a dose group share the same cage), or to less obvious ones (e.g., systematic order in section, or in feeding). Further, they may occur in some endpoints but less so in others, in the same study (Woutersen et al., 2001
). The problem of unreplicated dose groups is especially prominent in the NOAEL approach: an observed difference between a dose group and the controls could be caused by systematic errors between the treatment groups, while a real dose effect could be blurred by it. This problem is attenuated somewhat when the data are analyzed by dose-response modeling. First, one will be more aware of it by considering the dose-response pattern as a whole: a single dose group deviating from the general pattern may alert one to this possibility. Second, compared to the 2-by-2 comparisons in the NOAEL approach, fitting a model to all dose groups simultaneously reduces the influence of a single deviating dose group, especially in multiple dose studies. But nonetheless, the problem of systematic errors between treatment groups could lead to spurious model fits. Therefore, the formal procedure of model selection should not be taken as a dictating principle in all circumstances: biological arguments may overrule it and lead to another model selection. But, of course, these arguments must be made explicit, so that other experts can judge their plausibility.
The problem of systematic errors in (dose) groups has an important bearing on the issue of nonmonotonous dose-response relationships. When we consider a typical toxicity study with three dose levels and a control group, it is easy to see that systematic errors between treatment groups can lead to nonmonotonous dose response data that are enforced by statistical significance. Therefore, this type of study can never provide evidence that the true dose-response relationship is really nonmonotonous. Real evidence for a nonmonotonous dose-response relationship can only be given by multiple dose studies: to exclude systematic errors between treatment groups as an explanation, a nonmonotonous dose-response relationship needs to be substantiated by various consecutive dose groups. And even then, one needs to make sure that the study protocol did not contain any systematic order in the consecutive dose groups.
Whatever the plausibility of nonmonotonous dose-response relationships, be it in the lower (hormesis) or in the higher dose-range (qualitative changes in physiology), we never found, in the hundreds of dose-response data sets analyzed, compelling reasons to extend the family of models with nonmonotonous members. As a matter of fact, in the multiple dose studies (i.e., studies having the potential to detect hormesis) that we analyzed the data gave evidence for monotonous rather than for nonmonotonous relationships (see also several of the illustrative examples). Further, it should be noted that the number of regression parameters that needs to be estimated would further increase when extending the model family with nonmonotonous members. A sensible standpoint appears to be to stick to monotonous dose-response models, unless this is clearly inadequate a priori (e.g., micronutrients), or shown to be inadequate by convincing data. But, as already indicated, a nonmonotonous dose-response relationship is more difficult to assess, and the standard OECD design is not adequate for that purpose.
One of the arguments that has been raised against the benchmark approach is that it strongly hinges on the assumed dose-response model that was fitted, while the nature of the "true" dose-response model remains unknown. Indeed, different models may result in different (point estimates) of the CED. The proposed family of models and the associated methodology described in this article aims to solve this by providing for a formal method such that different persons analyzing the same data set get the same result (unless they have reason to deviate from the formal method, as discussed above). However, when any other family of models would result in highly different CED estimates, there would be not much ground to regard the proposed methodology, formal as it may be, as a tool that deserves general acceptance. After all, the family of models presented here was only based on a number of general considerations, without any firm scientific substantiation, using knowledge on toxicological mechanisms, for instance. Yet, this is the only thing that can be done: if at all known, toxicological mechanisms are too variable to be captured in a single model. Indeed, the nature of the "true" dose-response model remains unknown, and in such a situation the purpose of using a model can be no more than describing the data. Filling gaps between data points can only be done with a model that contains information (e.g., on the underlying mechanisms). As long as we have to work with models that basically do not contain any information, only the quality of the data determines the quality of the outcome. Therefore, fitting a model to dose-response data is solely intended to smooth the fluctuations in the data, and to provide for a formal method for yielding a point estimate of the parameter of interest (the CED), together with a confidence interval. Such "black box" models should not add anything to the data; they should only follow them. Therefore, a reliable estimate of the CED depends on the quality of the data rather than on the model chosen. When the data show gaps, the model gets the freedom to go its own way, which may be a wrong one, as illustrated by Figures 11 and 12.
|
|
|
When such indications of insufficient information in the data do not occur, visual inspection of the data with the fitted model should give some confidence that the model accurately describes the data, without leaving the possibility that the true dose-response relationship could as well be fairly different.
To implement the proposed methodology in routine toxicity testing, it is important that easy-to-use software is available. The software (PROAST) that is currently used at RIVM is meant to be made publicly available, as soon as the user-interface has been finished.
APPENDIX
Nonlinear Regression
The models proposed here are, in statistical terminology, nonlinear (strictly, Models 3, 4, and 5 are). A nonlinear model can only be fitted using an iterative algorithm. Such algorithms are now readily available in many software packages, and can usually be applied fairly easily. However, the user should be aware of some idiosyncracies involved in such iterative procedures. Some of these will be discussed briefly.
An iterative algorithm tries to find "better" parameter values by trial and error. It can only start such a procedure when the parameters have values to start with. The user must provide such start values for the parameters, and these should be not too far off. A (interactive) graphical interface therefore is almost indispensable.
The algorithm keeps on varying the parameter values until it decides to stop. There are two possible reasons for the algorithm to stop the trial and error process:
1. The algorithm has converged, i.e., it has found a clear maximum in the log-likelihood function. In this case the associated parameter values can be considered as the maximum likelihood estimates. However, it can happen that the log-likelihood function has not one but more (local) maxima. This means that one may get other results when running the algorithm again, but with other start values. This can be understood by conceiving the likelihood function as a mountainous area with several peaks. Since the algorithm can only "feel" the slope locally, it usually finds the peak that is closest to the starting point. If there are other (higher) peaks, the algorithm may not see those.
2. The algorithm has not converged, i.e., the algorithm was not able to find a clear peak in the likelihood function, but it stops because the maximum number of iterations (trials) is exceeded. This may occur when the information in the data is poor relative to the number of parameters to be estimated. For example, a dose-response model with 5 unknown parameters cannot be estimated with a 4-dose-group study. As another example, the variation between the observations within dose groups may be large compared to the overall change in the dose-response. Such situations are usually associated with high correlations between parameter estimates, i.e., changing the value of one parameter may be compensated by another, leaving the model prediction practically unchanged.
Because of the lognormality assumption, the model is fitted by maximizing the likelihood function based on the normal distribution (data and model on log-scale). This may be achieved by minimizing the sum of squares, or by maximizing the full likelihood function including the variance as a parameter to be estimated simultaneously. In our experience, using Splus as software, the first approach (Splus function: nls) is faster, but the second (Splus function: nlminb) more stable, i.e., in cases of poor data sets the sum of squares approach is more likely to fail.
NOTES
1 For correspondence via fax: 31 30 2744446. E-mail: wout.slob{at}rivm.nl
REFERENCES
Allen, B. C., Kavlock, R. J., Kimmel, C. A., and Faustman, E. M. (1994). Dose-response assessment for developmental toxicity. II. Comparison of generic benchmark dose estimates with no observed adverse effect levels. Fundam. Appl. Toxicol. 23, 487495.[ISI][Medline]
Appel, M. J., Bouman, H. G. M., Pieters, M. N., and Slob, W. (in press). Evaluation of the applicability of the Benchmark approach to existing toxicological data. Framework: Chemical compounds in the working place. RIVM/TNO Report, Bilthoven, The Netherlands.
Baird, S. J. S, Cohen, J. T., Graham, J. D., Shlyakter, A. I., and Evans, J. S. (1996). Noncancer risk assessment: A probabilistic alternative to current practice. Human Ecolog. Risk Assess. 2, 79102.
Cox, D. R. (1958). Planning of Experiments. John Wiley & Sons, New York.
Crump, K. S. (1984). A new method for determining allowable daily intakes. Fundam. Appl. Toxicol. 4, 854871.[ISI][Medline]
Crump, K. S. (1995). Calculation of benchmark doses from continuous data. Risk Anal. 15, 7989.[ISI]
Faustman, E. M., Allen, B. C., Kavlock, R. J., and Kimmel, C. A. (1994). Dose-response assessment for developmental toxicity. I. Characterization of database and determination of no observed adverse effect levels. Fundam. Appl. Toxicol. 23, 478486.[ISI][Medline]
Gaylor, D. W., and Slikker, W., Jr. (1990). Risk assessment for neurotoxic effects. Neurotoxicology 11, 211218.[ISI][Medline]
Hattis, D., Banati, P., and Goble, R. (1999). Distributions of individual susceptibility among humans for toxic effects. How much protection does the traditional tenfold factor provide for what fraction of which kinds of chemicals and effects? Ann. N.Y. Acad. Sci. 895, 286316.
Kramer, H. J., van den Ham, W. A., Slob, W., and Pieters, M. N. (1996). Conversion factors estimating indicative chronic no-observed-adverse-effect levels from short-term toxicity data. Regul. Toxicol. Pharmacol. 23, 249255.[ISI][Medline]
Murrel, J. A., Portier, C. J., and Morris, R. W. (1998). Characterizing dose-response I: Critical assessment of the benchmark dose concept. Risk Anal. 18, 1326.[ISI][Medline]
Piersma, A. H., Verhoef, A., te Biesebeek, J. D., Pieters, M. N., and Slob, W. (2000). Developmental toxicity of butyl benzyl phthalate in the rat using a multiple dose study design. Reprod. Toxicol. 14, 417425.[ISI][Medline]
Pieters, M. N., Kramer, H. J., and Slob, W. (1998). Evaluation of the uncertainty factor for subchronic-to-chronic extrapolation: Statistical analysis of toxicity data. Regul. Toxicol. Pharmacol. 27, 108111.[ISI][Medline]
Rulis, A. M., and Hattan, D. G. (1985). FDA's priority-based assessment of food additives. II. General toxicity parameters. Regul. Toxicol. Pharmacol. 5, 152174.[ISI][Medline]
Slob, W. (1987). Strategies in applying statistics in ecological research. Ph.D. Dissertation, Vrije Universiteit, Amsterdam.
Slob, W. (1999). Thresholds in toxicology and risk assessment. Int. J. Toxicol. 18, 259268.[ISI]
Slob, W., and Pieters, M. N. (1998). A probabilistic approach for deriving acceptable human intake limits and human health risks from toxicological studies: General framework. Risk Anal. 18, 787798.[ISI][Medline]
Swartout, J. C., Price, P. S., Dourson, M. L., Carlson-Lynch, H. L., and Keenan, R. E. (1998). A probabilistic framework for the reference dose. Risk Anal. 18, 271282.[ISI][Medline]
Vermeire, T., Stevenson, H., Pieters, M. N., Rennen, M., Slob, W., and Hakkert, B. C. (1999). Assessment factors for human health risk assessment: A discussion paper. Crit. Rev. Toxicol. 29, 439490.[ISI][Medline]
Weil, C. S., and McCollister, D. D. (1963). Relationship between short- and long-term feeding studies in designing an effective toxicity test. J. Agric. Food Chem. 11, 486491.[ISI]
Woutersen, R. A., Jonker, D., Stevenson, H., te Biesebeek, J. D., and Slob, W. (2001). The benchmark approach applied to a 28-day toxicity study with Rhodorsil Silane in rats. The impact of increasing the number of dose groups. Food Chem. Toxicol. 39, 697707.[ISI][Medline]