1 Institute for Medical Technology Assessment, Erasmus University, Rotterdam and 2 Division of Reproductive Endocrinology and Fertility, Institute for Endocrinology, Reproduction and Metabolism, Vrije Universiteit Medical Centre, Amsterdam, The Netherlands
3 To whom correspondence should be addressed at: Division of Reproductive Endocrinology and Fertility, Institute for Endocrinology, Reproduction and Metabolism, Vrije Universiteit Medical Centre, 1081 AV Amsterdam, The Netherlands. Email: j.mcdonnell{at}vumc.nl
![]() |
Abstract |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Key words: carryover effects/crossover design/infertility/maximum likelihood estimation/parallel design
![]() |
Introduction |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Since 1993, an extended and sometimes heated debate has been conducted in Fertility and Sterility on the place of the crossover design in infertility trials. Daya (1993) opened the debate by stating that, in his opinion, the crossover design has no place in infertility trials. His concerns about the crossover design include the fact that some women will become pregnant at the first attempt and will therefore not be exposed to the second treatment, leading to possibly misleading results and a loss of statistical efficiency. Khan et al. (1996)
conducted a meta-analysis to examine the hypothesis that a difference in the estimates of treatment effect exists between parallel and crossover designs. After considering 34 overviews, they came to the conclusion that the crossover design may greatly overestimate the treatment effect. Other authors did not fully agree with this conclusion. Olive (1997)
agreed that the crossover trial will overestimate the treatment effects but suggested that this may be due to inadequate statistical analysis. He also stated that a crossover design may be more acceptable to patients, leading to easier accrual and reduced drop out, a point echoed by Mol and Bossuyt (1997)
. Ananth and Rhoads (1997)
criticized the approach used by Khan, citing the method of pooling used by Khan and claiming that the statistical methods used were inappropriate. te Velde et al. (1998)
also criticized the methods of analysis used by Khan and re-analysed the data used by Khan. They subsequently concluded that the observed differences were statistically insignificant. Cohlen et al. (1998)
conducted a series of simulations which indicated that the crossover design did slightly overestimate the treatment effects, but that this overestimation was insignificant in comparison with random variation. Finally, Norman and Daya (2000)
constructed a simple but ingenious model which showed an underestimation of effectiveness in the parallel arm and an overestimation in the crossover arm.
In this study, we extend the simulation analysis of Cohlen using a maximum likelihood approach based on a parametric model, similar in spirit to one derived from a study carried out in The Netherlands to compare the efficacy of intra-uterine insemination (IUI) and IVF, to examine differences between the two designs and to examine the effect of censoring and carryover effects.
The question of interest is: does the design structure and/or the presence of carryover effects lead to a bias in the statistical analysis, leading to possible under- or overestimation of treatment effects?
![]() |
Materials and methods |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
We assumed that the carryover effects of a stimulated cycle (should they exist) will lead to a decreased chance of pregnancy in the following cycle. The existence of carryover effects (as postulated here) will have different consequences under the two designs. In a parallel study, IVF cycles following the first cycle will be subject to a decreased probability of success as compared with the stand alone per cycle probability. In contrast, in the crossover study, it is the probability of conception on IUI cycles (which, by definition, follow IVF cycles) which is decreased.
This model differs from the models of Cohlen and Norman in that it is a parametric model and estimates treatment effect by means of a regression model rather than simply counting the numbers of pregnancies achieved under the two designs. Moreover, it incorporates both carryover effects and censoring, both of which may effect the estimation of treatment efficacy.
Four baseline scenarios were examined, each involving a particular combination of censoring and carryover assumptions. The scenarios examined were: (i) no couples dropped out (no censoring) and the treatment had no effect on the following cycle (no carryover effect); (ii) no censoring, but there was a negative carryover effect; (iii) couples could be censored, but there was no carryover effect; and (iv) both censoring and carryover effects were present.
In 2000, we presented the results of a clinical trial carried out in The Netherlands which examined the efficacy and cost-effectiveness of IUI in a spontaneous cycle, IUI in a mildly stimulated cycle and IVF in a prospective, randomized study of 258 couples (181 with idiopathic subfertility and 77 with male subfertility) seeking treatment for infertility. After entry into the study, couples were randomized into one of the three treatment groups. Couples received a maximum of six treatment cycles. The design of the trial and patient characteristics are described elsewhere (Goverde et al., 2000). We analysed the results of this trial by directly examining the likelihood of the observed data (McDonnell et al., 2002
). Our results indicated that there was no significant difference in the chance to conceive between the two IUI groups, and we therefore combined these groups. Subsequently, we developed a parametric model to examine the differences between two treatment groups (IUI and IVF). The approach used in this article is based on the methods and results of that trial. However, we extend this model to examine the differences between the parallel and crossover designs, including scenarios not encompassed by the trial.
In the clinical study, we estimated both the chance of achieving pregnancy and the chance of drop out using logistic functions of patient characteristics and treatment, and explicitly modelled the probability both of achieving pregnancy and of censoring. In the present study, we assumed that in the baseline scenarios, the (per cycle) probabilities of pregnancy and censoring were equal to those found in the clinical study, namely: (i) the probability of both pregnancy and drop out were logistic in form and are dependent on the clinical characteristics of the couple presenting for treatment; (ii) the (per cycle) probability of pregnancy was equal to that of patients undergoing that treatment in the clinical study; and (iii) the (per cycle) probability of drop out was equal to that of patients undergoing that treatment in the clinical study and that this probability was not dependent on the treatment given in earlier cycles.
We also assume that each couple is offered a maximum of six treatment cycles. Couples are censored if they leave the trial before the maximum number of cycles is reached, unless pregnancy is achieved. In the no carryover scenarios, we assumed that treatment in one cycle had no effect on treatment in the following cycle. In the scenarios involving carryover effects, we assumed that in a cycle which followed a stimulated (IVF) cycle, the log-odds of pregnancy associated with that cycle was reduced by ln(1.15), where ln(.) is the natural log function.
More exactly, we assumed that
![]() |
![]() |
![]() |
We also assumed that there was no period effect (Senn, 2002). This is an important assumption as carryover effects will be confounded with the interaction of treatment and the period effect. Under our assumptions, good prognosis patients are more likely to become pregnant and leave the study. This will be reflected in a decrease in pregnancy rates in later cycles (as is seen in practice) and is not a result of any carryover effect.
Sensitivity analysis
To examine the robustness of the results, the analyses were re-run under a number of other assumptions, namely (i) the carryover effect was much stronger than that presumed in the baseline analyses; more precisely, the odds ratio associated with the stimulated cycle in models involving carryover was 1.30 instead of 1.15; (ii) no difference in treatment effect existed, i.e. the per cycle probability of pregnancy for IVF was equal to that of IUI; (iii) the difference in treatment effect was much stronger than in the baseline scenarios; more exactly, the coefficient associated with IVF was equal to twice that of the value used in the baseline scenarios; (iv) there was no difference in the probability of censoring, i.e. the per cycle probability of censoring following IVF was equal to that of IUI; (v) the carryover effect was positive in nature; and (vi) the probability that a couple would conceive was not in fact constant but is scaled by a random variable distributed as a Beta(2,2) distribution.
Simulation and statistical analysis
Each cohort consisted of 100 couples, with each couple being simulated separately. The age of the female patient was randomly generated using the formula
![]() |
It is important to stress that in the statistical analyses, we pretend to be unaware of the possibility of the existence of a carryover effect which may lead to a bias in the results. We wish to observe that bias, if it exists. If we were aware of this possibility, we would adjust our analysis accordingly.
![]() |
Results |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
|
|
|
|
|
![]() |
Discussion |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Based on our calculations, the estimated pregnancy rates and the statistical estimates of treatment effects obtained from a crossover trial (and the conclusions which follow from these estimates) are largely the same as those from the parallel design. The results of this study do not support the conclusion of Daya that the crossover design should be avoided as an inappropriate design.
Cohlen et al. (1998) and Norman and Daya (2000)
both used models to examine the role of trial design. Of the two models, that of Cohlen looks more like ours. Cohlen examined the progress of a cohort of couples, drawn from a heterogenous population in which fecundity was assumed to follow a
distribution, under both designs. However, they did not explicitly examine the effect of study design on the estimation of treatment effect within a parametric model, which was our focus. Nor did they examine the effects of differences in treatment effect, censoring and carryover effect. Thus, our results are more general than theirs. However, our conclusions are essentially the same. We disagree somewhat with them when they state that crossover designs tend to overestimate the effect of the best treatment', although they mitigate this statement by stating that the overestimation is clinically irrelevant. In the Appendix, we indicate algebraically why the crossover design should produce more pregnancies, at least in the absence of censoring. A non-algebraic argument is the following: suppose treatment A is very effective (99% per cycle probability of pregnancy) while treatment B is very ineffective (1% chance of pregnancy), and there are just two treatment cycles with no censoring. In the parallel design, almost all women receiving treatment A fall pregnant while few receiving treatment B do so. If the two groups are equal in size,
50% of the women become pregnant. In the crossover design, almost all the women receiving treatment A become pregnant, while few women receiving treatment B fall pregnant. In the second cycle, those from the latter group will receive treatment A, with most becoming pregnant. Therefore, after just two cycles, almost all the women in the crossover design will be pregnant, compared with just 50% in the parallel design. Of course, these probabilities are extremely implausible but the conclusion remains unaltered if more realistic values are substituted. These conclusions apply even if it is not known which is the more effective treatment.
The model of Norman and Daya (2000) is rather more simple. They compare two groups, each consisting of two subgroups, one (constituting 80% of the group) with low fecundity (equal to 10% chance of pregnancy per cycle under a control treatment), while the other 20% have a much higher fecundity (40% chance per cycle under the same treatment). An experimental treatment is assumed to double the per cycle chance of pregnancy (20 and 80% chance per cycle, respectively). Both groups undergo treatment under both the parallel and crossover designs. Their conclusions, based on this model, are rather different from ours. They found that a constant sequence design (parallel) consistently underestimates the treatment effect in all but the first cycle, whereas an alternating sequence design (crossover) overestimates the treatment effect in even cycles, but correctly estimates the treatment effect in odd cycles. However, we take issue with their method of calculation. They assume that the experimental treatment has a relative risk (RR; in this context, probability of pregnancy) of 2 compared with the control treatment. The question is as to how this assumption should be interpreted. Our interpretation is that a couple undergoing the experimental treatment have twice the probability (per cycle) of achieving pregnancy than an identical couple undergoing the control treatment. Translating this to the aggregate level, a given group of couples would experience twice as many pregnancies (per cycle) under the experimental treatment as they would under the control treatment. The point of this statement is that the value of the RR is relevant only when the groups being compared are essentially identical. In Norman's model, this is the initial situation but this is not true for subsequent cycles in the parallel arm nor is it true in even cycles in the crossover arm. For example, in the second cycle in the parallel arm, 14% of patients have high fecundity compared with just 6% in the experimental group. The calculated RR of 1.75 applies to a comparison of two groups who differ markedly in their average fecundity, and it is difficult to know how to interpret this value. In the crossover arm, the situation is reversed, with the experimental group containing 14% high fecundity patients compared with 6% in the control group. The comparison of the per cycle RR with the true RR is therefore not valid. This situation is not dissimilar to that of the well known Simpson's paradox: at the patient level, the treatment RR is still 2, but this is masked at the group level due to differences in group composition. We therefore disagree with the method of calculation used by Norman and argue that the concept of relative risk should not be applied on a per cycle basis.
The existence and degree of both carryover and period effects in ART trials have not been investigated. The presence of both can lead to confounding between the carryover effect and a treatment by period interaction; indeed they are the same in the crossover trial of the form AB/BA (11). Whether this holds for ART trials is unclear. In pharmacological trials, treatment is not according to a timetable devised by physicians. In contrast, ART trials are shaped largely by patient decisions: patients may decide to delay a treatment cycle for any number of reasons and can choose the length of the delay. One such reason may well be the nature of the treatment itself. For example, IVF is a much more physically and psychologically demanding treatment than IUI. Data from the clinical study indicate that the time between successive IVF cycles is significantly longer than that between cycles involving IUI with or without ovarian stimulation (4 months as opposed to 1.5 months, unpublished observations). These differences may largely allow for washout, with a subsequent reduction in carryover effects.
In this model, we make fairly strong assumptions about the strength of the carryover effects. We assume a simple effect which persists only for the following cycle and which can (in principle) be observed. In practice, this may not be the case. The carryover effect may vary from cycle to cycle (e.g. due to the time difference between them) or may affect subsequent cycles (higher order carryover). Gauging the nature and extent of carryover (if it exists) is difficult and has consequences for the modelling of treatment effect. Ideally, a randomized trial comparing the crossover and parallel designs and investigating any carryover effect could be carried out but, for reasons both practical and ethical, such a trial will probably never be carried out.
Relatively few crossover trials have been carried out in the ART field. Few crossover trials with a binary outcome have been carried out in other areas. Taylor and Dominik (1999) report data from a study on condom failure using a crossover design. In this study, the outcome measure was condom failure. Unlike ART trials, participating couples did not leave the study.
In this article, we investigated possible bias introduced as a result of either study design or carryover effect. Study design had little effect on the estimation of treatment effect. On the other hand, carryover effects may well introduce bias in the estimation of treatment effects. However, is the bias associated with carryover effects actually a bias? The answer is yes and no. If the treatment given in one cycle affects the outcome of a given treatment in the following cycle, we are indeed observing some form of distortion in the estimate of the effect of that treatment in comparison with its stand alone form. However, we must realize that, should carryover effects exist, the estimates of treatment effect gathered relate to that type of treatment given the total treatment regime. The estimation of a particular treatment within an overall treatment may not give a true picture of its actual efficacy.
Opponents of the use of the crossover design in ART trials often argue that couples achieving pregnancy should be considered as having been censored. This strikes us as being bizarre since the main outcome measure of an ART trial is pregnancy! Indeed, it is interesting to speculate what they consider is the main outcome measure if it is not pregnancy. This apparent contradiction suggests that the consideration of a standard crossover trial (whatever that may be) is inappropriate. However, this does not imply that the design itself is inappropriate, just that another form of analysis is required.
In the Amsterdam study, censoring following IVF was greater than that following IUI. As several authors have pointed out, a crossover design may be more acceptable to couples, with possibly reduced censoring following IVF. We believe that a crossover design will lead to essentially the same conclusions as a parallel design, does not lead to over- or underestimate treatment effects, and may be more attractive to couples, possibly leading to fewer drop outs and more pregnancies. We recommend consideration of the crossover design in future studies.
|
![]() |
Appendix |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
We define pIUI=P[achieving pregnancy during a IUI cycle] and IUI=P[becoming censored following a IUI cycle], with similar definitions for IVF probabilities. We assume these probabilities are constant over cycles.
Since the process is essentially discreet, each couple makes a contribution to the likelihood equal to the probability of their observed progress.
To see how the likelihood function is defined, first consider couples undergoing IUI treatment in a parallel design. There are three possible treatment progress scenarios.
All three possibilities are included in the following expression, as can be seen by substituting the appropriate values of (nIUI, IUI and
IUI):
![]() |
Similarly, couples undergoing IVF treatment in a parallel design have a likelihood contribution of the form
![]() |
In the crossover design, the likelihood contribution is constructed similarly. For each couple, there is an IUI and an IVF component. For example, consider a couple who achieve pregnancy as a result of an IVF cycle. Such a couple have undergone nIUI (0) IUI attempts without success and without censoring, and nIVF (
1) IVF attempts, nIVF1 without success and without censoring, followed by a successful IVF cycle. Their contribution is, therefore
![]() |
The contribution of other couples is constructed similarly.
For each couple, irrespective of the trial design, the likelihood contribution can be written as
![]() |
The likelihood itself is the product of the individual contributions from each couple. This (log-)likelihood is subsequently maximized to achieve the parameter estimates.
This likelihood function can also be used in situations not described above. For example, the protocol for the crossover arm might stipulate that the initial treatment is given for three cycles before a switch is made.
Estimating the difference in proportion of couples achieving pregnancy under the parallel and crossover designs
We can estimate the extra number of pregnancies due to the use of a crossover design, at least in homogenous (with respect to fecundity) populations in which no drop out occurs. We define qIUI=1pIUI=P[no pregnancy in an IUI cycle] and qIVF=1pIVF=P[no pregnancy in an IVF cycle]
In a parallel trial, the expected proportion of couples in the IUI arm not pregnant at the end of the trial is (1pIUI)6=qIUI6, while in the IVF arm the expected proportion is (1pIVF)6 = qIVF6. The proportion for the whole trial (assuming equal sample sizes in both arms) is therefore (qIUI6 + qIVF6)/2. In the crossover trial, the proportion of couples not pregnant at the end of the trial is (1pIUI)3 (1pIVF)3 = qIUI3qIVF3. The difference between the trials in the proportion of couples not pregnant is
![]() |
Denoting qIUI3 by A and qIVF3 by B, =(A2 + B2)/2AB=
(AB)2=
(qIUI3qIVF3)2.
0 since it is proportional to a perfect square and equals 0 if and only if qIUI = qIVF, i.e the two treatments are equally effective. Therefore, the overall proportion of couples not pregnant in the parallel trial is (in general) greater than the proportion in the crossover trial, irrespective of the values of qIUI and qIVF (and hence pIUI and pIVF). Equivalently, the proportion of couples pregnant in the crossover trial exceeds the proportion of couples pregnant in the parallel trial. Whether this holds in practice depends on the difference in censoring between the different types of trial. However, we can estimate the difference in the proportions falling pregnant under the two designs for various values of pIUI and pIVF (see Table VI). If pIUI=0.10 and pIVF=0.15, the crossover trial has just 0.66% pregnant couples more than the parallel trial, i.e. <1 couple, if both arms contain 100 couples.
If the groups are heterogeneous with respect to fecundity, the situation is rather more complex. Assume fecundity (under IUI) can be described by a variable s. For an IUI patient, the probability of failing to become pregnant after six rounds is
![]() |
Similarly, for an IVF patient with fecundity t (under IVF), the probability of failure after six cycles of IVF is
![]() |
In a parallel trial, the expected proportion of couples failing to achieve pregnancy in the IUI group is
![]() |
![]() |
The total proportion is therefore
![]() |
In the crossover trail, the proportion failing after six trials (three IUI and three IVF and assuming no carryover effects) is
![]() |
Denote q3IUI(s) by A(s) and q3IVF(t) by B(t), then
![]() |
Again, 0 since [A(s)B(t)]2 is non-negative and the crossover design will (assuming no drop out) result in more pregnancies than the parallel design.
![]() |
References |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Cohlen BJ, te Velde ER, Looman CWN, Eijckemans R and Habbema JDF (1998) Crossover or parallel design in infertility trials? The discussion continues. Fertil Steril 70, 4045.[CrossRef][ISI][Medline]
Daya S (1993) Is there a place for the crossover design in infertility trials? Fertil Steril 59, 67.[ISI][Medline]
Goverde AJ, McDonnell J, Vermeiden JP, Schats R, Rutten FF and Schoemaker J (2000) Intrauterine insemination or in-vitro fertilisation in idiopathic subfertility and male subfertility: a randomised trial and cost-effectiveness analysis. Lancet 355, 1318.[CrossRef][ISI][Medline]
Khan KS, Daya S, Collins JA and Walter SD (1996) Empirical evidence of bias in infertility research: overestimation of treatment effect in crossover trials using pregnancy as the outcome measure. Fertil Steril 65, 939945.[ISI][Medline]
McDonnell J, Goverde AJ, Vermeiden JP and Rutten FF (2002) Multivariate Markov chain analysis of the probability of pregnancy in infertile couples undergoing assisted reproduction. Hum Reprod 17, 103106.
Mol BWJ and Bossuyt PMM (1997) Future lettersWorkshops on Internet [letter]. Fertil Steril 67, 179.
Norman GR and Daya S (2000) The alternating-sequence design (or multiple-period crossover) trial for evaluating treatment efficacy in infertility. Fertil Steril 74, 319324.[CrossRef][ISI][Medline]
Olive DL (1997) Future lettersWorkshops on Internet [letter]. Fertil Steril 67, 178179.[CrossRef][ISI][Medline]
Senn S (2002) Cross-over Trials in Clinical Research. 2nd edn. Wiley, New York.
Taylor JT and Dominik RC (1999) Noninferiority testing in crossover trials with correlated binary outcomes and small event proportions with applications to the analysis of condom failure data. J Biopharm Stat 9, 365377.[CrossRef][Medline]
te Velde ER, Cohlen BJ, Looman CWN and Habbema JDF (1998) Crossover designs versus parallel studies in infertility research [letter]. Fertil Steril 69, 357358.[CrossRef][ISI][Medline]
Submitted on December 17, 2002; resubmitted on April 16, 2004; accepted on July 23, 2004.
|