1 Department of Environmental Health, 2 Department of Epidemiology, Harvard School of Public Health, 3 Dana-Farber Cancer Institute, Harvard Medical School, and Department of Biostatistics, Harvard School of Public Health, 4 Massachusetts General Hospital, Harvard Medical School, Boston, 5 Institute of Toxicology and Environmental Health, University of California, Davis, CA, 6 Columbia University College of Physicians and Surgeons, New York, NY, 7 Epidemiology Branch, National Institute of Environmental Health Sciences, Research Triangle Park, NC and 8 Channing Laboratory, Brigham and Women's Hospital, Harvard Medical School, Boston, MA, USA
![]() |
Abstract |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Key words: early pregnancy loss/human chorionic gonadotrophin/immunoradiometric assay/reliability
![]() |
Introduction |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
The landmark study by Wilcox et al. of 221 healthy women who were trying to conceive found that the incidence of pregnancy loss was 32% of all conceptions, and that two-thirds of these were losses that would have been unrecognized without the use of HCG as a biomarker (Wilcox et al., 1988). That study used data from 28 women who had been sterilized by bilateral tubal ligation to obtain `baseline' HCG values. A pregnancy was defined as 3 consecutive days of HCG values >0.025 ng/ml as detected by the B101-R525 assay, which identified intact HCG as well as some of the free beta subunit.
After that study, the assay method underwent continuing development, partly because the rabbit polyclonal antibody used by Wilcox et al. had been depleted, and also because the amount of urine required by the assay was too large for application to full-scale epidemiologic studies (O'Connor et al., 1994). Subsequently, an immunoradiometric assay that uses a combination of capture antibodies for a beta subunit epitope (B204) and the intact heterodimer (B109) (the so called `combo assay') has been widely applied to epidemiologic studies (Lasley and Shideler, 1994
; Hakim, et al., 1995
; Ellish, et al., 1996
; Zinaman, et al., 1996
).
To separate EPL cycles from non-conception cycles requires not only a sensitive test, but also criteria to distinguish noise from the true signal, thus the new assay method requires that criteria be developed for defining an EPL. Since there is no `gold standard' for an EPL, expert judgement becomes an important component in developing any algorithm. As a further challenge, false positives are probably more of a threat than false negatives, simply because there are far more non-conception cycles than EPLs.
To date, each study has used a different algorithm and, therefore, different definitions of pregnancy and subsequent pregnancy loss. For example, Lasley et al. considered a 2 day HCG rise >0.15 ng/ml within 3 consecutive days as an indicator of conception (Lasley et al., 1995). Hakim et al. used 2 consecutive days >0.25 ng/ml (Hakim et al., 1995
) and Zinaman et al. 3 consecutive days
0.15 ng/ml as their cut-off (Zinaman et al., 1996
). Because of the lack of a true gold standard, it is difficult to assess the performance of these criteria. Ellish et al. showed that the frequency of early pregnancy loss ranged from 11.026.9% depending on the definition used (Ellish et al., 1996
). Furthermore, in addition to differences in defining a meaningful HCG rise and therefore an early pregnancy, determining a pregnancy loss often requires the investigator's consideration of the timing of the HCG rise within a cycle, the variability of baseline HCG measurements across cycles, and missing data.
As a first step in developing an EPL algorithm, five experts were invited to interpret assay results subjectively and these interpretations were compared with a crude preliminary algorithm. In this study, we examine the reliability of HCG data interpretation by comparing the assessments of the five experienced researchers who independently reviewed identical graphs containing daily plots of HCG values (Figure 1) to determine whether each menstrual cycle represented `no conception', a `continuing conception' or a `conception lost'.
|
![]() |
Materials and methods |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Women were excluded if they: (i) were already pregnant; (ii) had tried unsuccessfully to conceive for 1 year; (iii) were current or former smokers (defined as ever smoking at least one cigarette per day for 6 months or more); or (iv) planned to quit, change jobs, or move out of the city during the 1 year follow-up period. From 19961998, a total of 190 women were enrolled in the study. The participation rate among eligible women was ~90%.
The participants kept menstrual diaries and collected their first morning urine sample every day from enrolment to the end of follow-up, which ended when a pregnancy was clinically recognized or after 1 year, whichever occurred first. In total, daily urine specimens were collected and analysed for 763 menstrual cycles from the 190 women.
Women were given a health evaluation, educated about prenatal care and given a remuneration for the inconvenience of providing daily urine samples. The Human Subjects Committees at the Harvard School of Public Health and the China Medical Institutes approved all study procedures and informed consent was obtained from each participant.
Laboratory analysis
Urine samples were analysed for HCG by the immunoradiometric assay (IRMA) developed by O'Connor et al. using a combination of anti-fragment B204 and anti-intact B109 clones (O'Connor et al., 1988). The detailed characteristics and behaviour of this assay have been previously described (Ellish et al., 1996
; O'Connor et al., 1998
). A singleton measurement was performed on all samples. Urine creatinine levels were measured according to the method of Jaffe (Husdan and Rapoport, 1968
). All HCG values were normalized to creatinine values to adjust for urine concentration.
Selection of cycles and urine samples
For quality control purposes, we selected a subset of the 763 cycles for duplicate HCG measurements. To increase efficiency, the selection was stratified so that cycles with consistently low or high HCG values in the singleton analysis were less frequently sampled. The selection proceeded as follows: (i) 47 cycles with consistently low HCG values (defined as no 2 consecutive days with HCG levels >0.6 ng/ml) were randomly sampled out of 501 such cycles (~10%); (ii) 18 cycles with high values were randomly selected out of 129 in this category (~15%). High values were defined as HCG levels >1.2 ng/ml for the last 4 days of the menstrual cycle, if there was a subsequent bleeding episode. For the cycles where high HCG values continued and no bleeding episode was observed, HCG was measured for the window 7 to +8 days around the start of the HCG rise, counting from the first of the 3 days with HCG levels >0.6 ng/ml; (iii) All of the remaining 133 cycles were included. Among the 198 cycles in the subset, duplicate assay results were not available for 45 cycles, either because of missing urine samples for >10 days of the cycle or insufficient quantity of urine. These were excluded, leaving 153 cycles from 78 subjects. For these selected cycles, HCG was measured in duplicate for the window of days around day 1 of bleeding (10 to +5); 49 cycles had no missing values within the window, 71 had 13, and 33 had 49 values missing. The duplicate assays were measured 4 months after the original singleton assay. The final analysis included 1950 urine samples from the 153 cycles.
As a comparison, non-conceptive levels of HCG were determined from urine samples contributed by 46 women aged 2034 years who had had a recent bilateral tubal ligation and who had no known fertility problems or chronic illnesses, and at least one successful pregnancy. These women were generally similar in their characteristics to the women trying to conceive. From these women, 2496 daily urine samples were collected over two menstrual cycles. Of those, 696 (27.9%) had detectable HCG levels; the lowest value observed was 0.0005 ng/ml.
Classification of cycles by algorithm
To compare with the expert assessment, we classified the cycles according to an algorithm adapted from Wilcox et al. (Wilcox et al., 1988). They found that out of 28 sterilized women, one had 2 consecutive days >0.025 ng/ml HCG, while no woman had 3 consecutive days above that level. We applied a similar logic and obtained 0.6 ng/ml for the maximum of all consecutive 2 day minima observed in our study. Comparison is made to 3 and 4 day minima (Table I
). We defined an `HCG rise' as 3 consecutive days with HCG levels above this baseline, calculated as the geometric mean of the three measurements (singleton and duplicates). Among the 55 cycles with an HCG rise, 15 were classified as a `continuing conception' because the elevated HCG values were sustained until a clinical pregnancy was confirmed. For the rest of the cycles, those with an HCG rise within the 10 to +5 day window around a bleeding episode were classified as a `conception lost'. Cycles with no HCG rise within the window around the bleeding episode were classified as `no conception'. In this algorithm, cycles with missing values may be classified as `no conception' because of insufficient information to define an HCG rise.
|
Statistical analysis
In evaluating the judgements of the panel, we began by comparing the frequency of the three possible outcomes reported by the experts. Pairwise agreement among any two of the experts was then obtained. The number and percentage were calculated for the cycles that were classified into each of the three possible categories by the two experts being paired. The cycles that were rated as `undetermined' by either of the two experts were excluded. The assessment by five experts was combined to form one criterion to classify all the cycles. The criteria were defined in three different ways: by the agreement of at least three, at least four, and all five experts. The outcomes were compared among the three definitions. The definition by at least three experts was compared with the definition by the algorithm developed from the sterilized women's samples.
To identify the most frequent pattern of disagreement, the cycles were classified into all possible categories involving different outcomes. The number and percentage of the cycles were compared for the different patterns of disagreement.
The overall summary measure of agreement among the experts' assessments was obtained by calculating a multi-rater kappa (Fleiss, 1981).
![]() |
Results |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
|
|
|
|
|
|
![]() |
Discussion |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Some of the variation is due to differences in the number of cycles the experts excluded as having `insufficient information'. The problem is not simply to define EPL, but to decide what is sufficient evidence for a decision. There was significant variability among the raters in deciding whether a specific cycle contained sufficient information to determine the outcome, suggesting that an explicit criterion defining insufficient information may be needed to standardize the procedure. There seemed to be several different factors implicated in the assessment. These include availability of replicate assays, quality of the laboratory assay as observed within the cycle, the characteristics of the cycle such as length and bleeding duration, and the amount and location of missing data within a cycle. Most of these factors are difficult to quantify with individual raters using their subjective judgement to reach a conclusion.
We did not have information on the date of ovulation for each menstrual cycle to precisely identify the luteal phase, in which the conceptus starts to produce HCG. Instead, we used the 10 to +5 day window around the bleeding episode to detect a `conception lost'. Since the luteal phase is less variable than the follicular phase (Harlow and Ephross, 1995), the window we used may be expected to have reasonable accuracy in capturing the relevant period. Nevertheless, precise determination of the luteal phase will reduce false positives in cycles with an unusually short luteal phase, whenever it is practical to obtain the day of ovulation. Identifying cycles in which no ovulation or no intercourse occurs will also reduce false positives by excluding cycles with zero probability of conception. However, measurement validity needs to be assessed in order to utilise additional information on ovulation and intercourse.
The threshold in this study was an HCG level of 0.6 ng/ml, significantly higher than that in previous studies. The early study by Wilcox et al. used an assay method that detected intact HCG and a portion of the free beta subunit (Wilcox et al., 1988). The `combo' assay used in our study also measured the beta core fragment; therefore, we would expect to show higher levels than the Wilcox study. The higher threshold in our study, compared with later studies using similar assays, may reflect that our subjects are newly married and younger than the women in previous studies or may result from differences between laboratories. Whilst we adhered to a strict quality assurance protocol and the baseline levels detected were consistent for each woman, differences between assays and laboratories may warrant investigation through future collaborative work.
Within-cycle variability of the HCG values, whether from technical or biological sources, may also be an issue. In some cycles, the baseline values were very low, yet a distinct pattern of HCG rise was readily identifiable, although the peak was lower than the minimum definition of HCG rise, based on samples from the sterilized women. For such cycles, raters may differ in their assessment depending on their relative emphasis on the pattern within the cycle and the cut-off derived from the sterilized women. Further investigation is needed to elucidate how the between-cycle or between-woman variability of HCG baseline affects the overall results in epidemiologic studies.
Classification of cycles by an algorithm derived from samples from the sterilized women performed reasonably well if combined with the ultimate knowledge of clinical pregnancy status from continued follow-up. However, the algorithm tended to produce false positives for EPLs, particularly for the cycles in which HCG levels were more variable or generally high. Such variability and higher levels may be biological or technical. Some cycles in our data set showed obvious differences in assay variability between the initial measurements and the additional measurements performed 4 months later. Appropriate standardization of laboratory quality control procedures to minimize technical variability will help reduce false positives. Further studies are needed to better characterize the biological variability in baseline HCG levels.
The factors that cause variation in HCG urinary excretion are imperfectly understood. This is especially a problem when interpreting the subtle patterns of HCG rise and fall produced by a faltering blastocyst and measured by less-than-perfect assays. We found that experts were much more likely to agree with each other than with an `objective' algorithm. When no gold standard is available, expert human judgement may offer a surrogate standard for refining an objective algorithm. In addition, expert human judgement may provide a tool for extracting information from biological patterns beyond that which any explicit algorithm could accomplish. Biological information at the margins of interpretability offers a special challenge. While departures from strict objectivity must be treated with caution, a judicious combination of explicit rules and expert opinion may come closer to the truth than either alone. Similar issues have been discussed in the field of image analysis, where the semi-automatic interactive method was more accurate than the automatic method (Flygare et al., 1997).
In summary, we arrive at the following conclusions. The algorithm for defining EPL is as important as the assay, although much less work has been done on algorithms. Lacking a gold standard, the input of expert opinion is a necessary step toward developing an objective algorithm. In choosing an algorithm, specificity should have priority over sensitivity in order to minimize false positives. Finally, similar assays and criteria for EPL are necessary to allow comparisons among studies.
![]() |
Acknowledgements |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
![]() |
Notes |
---|
10 To whom correspondence should be addressed at: Occupational Health Program, Department of Environmental Health, Harvard School of Public Health, 665 Huntington Avenue, Boston, MA 02115, USA. E-mail: xu{at}hsph.harvard.edu
![]() |
References |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Ellish, N.J., Saboda, K., O'Connor, J.O., Nasca, P.C., Stanek, E.J. and Boyle, C. (1996) A prospective study of early pregnancy loss. Hum. Reprod., 11, 406412.[Abstract]
Fleiss, J.L. (1981) Statistical Methods for Rates and Proportions. John Wiley and Sons Inc., New York.
Flygare, L., Hosoki, H., Rohlin, M. and Petersson, A. (1997) Bone histomorphometry using interactive image analysis. A methodological study with application on the human temporomandibular joint. Eur. J. Oral Sci., 105, 6773.[ISI][Medline]
Hakim, R.B., Gray, R.H. and Zazur, H. (1995) Infertility and early pregnancy loss. Am. J. Obstet. Gynecol., 172, 15101517.[ISI][Medline]
Harlow, S.D. and Ephross, S.A. (1995) Epidemiology of menstruation and its relevance to women's health. Epidemiol. Rev., 17, 265286.[ISI][Medline]
Husdan, H. and Rapoport, A. (1968) Estimation of creatinine by the Jaffe reaction. A comparison of three methods. Clin. Chem., 14, 222238.
Lasley, B.L. and Shideler, S.E. (1994) Methods for evaluating reproductive health of women. Occup. Med.: State Art Rev., 9, 423433.[ISI]
Lasley, B.L., Lohstroh, P., Kuo, A., Gold, E.B., Eskenazi, B., Samuels, S.J. and Overstreet, J.W. (1995) Laboratory methods for evaluating early pregnancy loss in an industry-based population. Am. J. Ind. Med., 28, 771781.[ISI][Medline]
O'Connor, J.F., Schlatterer, J.P., Birken, S., Krichevsky, A., Armstrong, E.G., McMahon, D. and Canfield, R.E. (1988) Development of highly sensitive immunoassays to measure human chorionic gonadotropin, its beta subunit and beta core fragment in the urine: application to malignancies. Can. Res., 48, 13611366.[Abstract]
O''Connor, J.F., Birken, S., Lustbader, J.W., Krichevsky, A., Chen, Y. and Canfield, R.E. (1994) Recent advances in the chemistry and immunochemistry of human chorionic gonadotropin: impact on clinical measurements. Endocr. Rev., 15, 650682.[ISI][Medline]
O'Connor, J.O., Ellish, N., Kakama, T., Schlatterer, J. and Kovalevskaya, G. (1998) Differential urinary gonadotrophin profiles in early pregnancy and early pregnancy loss. Prenatal Diag., 18, 12321240.[ISI][Medline]
Ronnenberg, A.G., Goldman, M.B., Aitken, I.W. and Xu, X. (2000) Anemia and deficiencies of folate and vitamin B-6 are common and vary with season in Chinese women of childbearing age. J. Nutr., 130, 27032710.
Wilcox, A.J., Weinberg, C.R., Wehmann, R.E., Armstrong, E.G., Canfield, R.E. and Nisula, B.C. (1985) Measuring early pregnancy loss: laboratory and field methods. Fertil. Steril., 44, 366374.[ISI][Medline]
Wilcox, A.J., Weinberg, C.R., O'Connor, J.F., Baird, D.D., Schlatterer, J.P., Canfield, R.E., Armstrong, E.G. and Nisula, B.C. (1988) Incidence of early loss of pregnancy. N. Engl. J. Med., 319, 189194.[Abstract]
Zinaman, M.J., O'Connor, J., Clegg, E.D., Selevan, S.G. and Brown, C.C. (1996) Estimates of human fertility and pregnancy loss. Fertil. Steril., 65, 503509.[ISI][Medline]
Submitted on May 21, 2001; accepted on November 12, 2001.