1 Early Pregnancy, Gynaecological Ultrasound and MAS Unit, St George's Hospital Medical School, London, UK 2 Department of Obstetrics and Gynaecology, University Hospital Gasthuisberg and 3 Department of Electrical Engineering (ESAT), KU Leuven, Belgium
4 To whom correspondence should be addressed at: Early Pregnancy, Gynaecological Ultrasound and MAS Unit, St George's Hospital Medical School, Cranmer Terrace, London UK. Email: gcondous{at}hotmail.com
![]() |
Abstract |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Key words: ectopic pregnancy/failing PUL/intrauterine pregnancy/logistic regression/pregnancy of unknown location (PUL)
![]() |
Introduction |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Ten percent of PULs are EPs (Banerjee et al., 2001). EPs account for 80% of early pregnancy deaths (Why Mothers Die, 1997
), and therefore the ability to predict whether a PUL is an EP remains a great challenge. Currently the established hormonal criteria for the diagnosis of EP are derived from pregnancies associated with pain and abnormal bleeding and not from asymptomatic women, who will have a much lower pre-test probability of EP.
Previous studies have looked at the use of single variable hormonal models for the prediction of PUL outcome. A serum progesterone of <20 nmol/l predicts failing PUL with a positive predictive value (PPV) of >95% (Banerjee et al., 2001), and a serum HCG increase of >66% over 48 h predicts an IUP with a PPV of 96.5% (Condous et al., 2002
). Unfortunately, the discriminatory zone and a suboptimally rising serum HCG predict EP with a PPV of only 18.2 and 43.5%, respectively (Condous et al., 2002
). To date, there is no hormonal index to predict the outcome of persisting PUL.
In this study, we concentrated on developing baseline multi-categorical logit models that could enable the clinician to distinguish between PULs that are failing PULs, IUPs and EPs based on two blood samples taken 48 h apart.
The aim of this study was to generate and evaluate new logistic regression models based on demographic and hormonal parameters to predict the outcome of PUL. The results are compared with those obtained from established diagnostic criteria for the prediction of failing PUL, IUP and EP in women with PUL.
![]() |
Materials and methods |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
All scans were reviewed and followed-up by the same primary investigator (G.C.). Exclusion criteria were: (i) the visualization of any evidence of an intrauterine sac; (ii) identification of an adnexal mass thought to be an EP; (iii) those with the presence of heterogeneous, irregular tissues within the uterus thought to be an incomplete miscarriage; and (iv) women who were clinically unstable or demonstrated the presence of a haemoperitoneum on ultrasound scan.
Indications for sonography included lower abdominal pain, with or without vaginal bleeding, poor obstetric history or to determine gestational age.
The study group consisted of 388 consecutive women with a PUL. The data from the first 189 women classified as a PUL collected between June 2001 and February 2002 were taken as the training set. Statistical analysis and building of the logistic regression models were based on this data set. The data from the next 199 PULs recruited between March 2002 and December 2002 were taken as the test set in order to evaluate the performance of the models prospectively.
Data collected included serum hormone levels (serum HCG and progesterone taken at presentation and at 48 h), demographics (age and gestation) and ultrasound features (endometrial thickness, the character of its midline echo and the presence or absence of free fluid in the pouch of Douglas). The women were followed-up until an outcome diagnosis was established: failing PUL, an IUP or an EP. There were four women in the training set and three in the test set who had serum HCG levels that plateaued and no pregnancy was seen at any time. These were classified as persisting PUL and were treated with methotrexate therapy and excluded from the analysis. These were not included for model development and validation because the final outcome was unknown in this subgroup and also because the numbers were so few.
If the initial serum progesterone level was <20 nmol/l, the women were classified as having a failing PUL (Banerjee et al., 2001). Spontaneous resolution of the pregnancy was defined as a decrease in the serum HCG level to <5 IU/l with the disappearance of symptoms. The location of these failing PULs remained unknown. Serum HCG levels were repeated within 7 days to confirm the diagnosis. If the serum HCG rise over the 48 h period was >66% (Condous et al., 2002
), the women were classified as having an IUP and were rescanned 2 weeks later to confirm the diagnosis. Women who did not fall into either category were reviewed every 48 h until a diagnosis was made by sonography. The diagnosis of EP was based upon the positive visualization of an adnexal mass. Ultrasonographic diagnosis of an EP was based on the following grey-scale appearances: (i) an inhomogeneous or inconglomerate mass adjacent to the ovary and moving separate to thiswe have called this the blob sign; (ii) a mass with a hyperechoic ring around the gestational sac referred to as the bagel sign; or (iii) a gestational sac with a fetal pole with or without cardiac activity. The diagnosis was confirmed subsequently at laparoscopy with histological confirmation of chorionic villi in the fallopian tube. If an EP was not visualized, but there was a high index of suspicion based on symptomatology, clinical findings and suboptimal rises of serial serum HCG levels, a laparoscopy was performed with or without an evacuation of the uterus.
Data analysis
The data have been pre-processed prior to further analysis. Several variables were created by transformation of the original variables. In particular, the HCG ratio refers to the ratio between the two HCG levels, i.e. serum HCG at 48 h/serum HCG at 0 h, which is more informative than a single HCG level alone. Moreover, it was reported that during early normal gestation, the HCG level doubles every 48 h (Kadar et al., 1981). Thus, intuitively, the use of the HCG ratio should be better than using the single HCG levels. The second transformed variable is the progesterone average, i.e. the mean of the two progesterone levels in an interval of 48 h ([serum progesterone at 0 h + serum progesterone at 48 h]/2). It is accepted that during the period of gestation, progesterone levels rise slightly with time or reach a plateau instead of falling dramatically. Hence the progesterone level at 0 h should be close to that at 48 h. It was also observed that the progesterone level average was distributed extremely dispersedly. The averaged progesterone levels were thus transformed further by taking the logarithm.
Statistical analyses were conducted with SAS (version 8.2 for windows). Univariate and multivariate analysis was performed retrospectively on the basic data (training data) in order to highlight the most significant variables in the model development. To compare the group means for the continuous variables, non-parametric Wilcoxon rank sum tests were used, since most of the continuous variables were not normally distributed. For categorical variables, Fisher's exact tests were used to check their association between the groups. A P-value <0.05 was considered to indicate statistical significance.
Model building
Baseline multi-categorical logit models (Agresti, 1996) were constructed to investigate the relationship between the selected variables and the outcome of the PULs. In such a model, each outcome category is paired with a baseline category, i.e. IUP, resulting in two logit equations, revealing contrasts of the EP versus IUP group and the failing pregnancy versus IUP group. Variables were selected by stepwise procedure with the entry and stay significance level of P-value <0.05.
Performance measure and classification rules
Predictions can be made for the three models by using thresholds (cut-offs) on the output probability of the model. However, the setting of the threshold will influence the accuracy of the prediction. The choice of threshold might vary from institution to institution, and depends on the trade-off between the sensitivity and false-positive rate. In order to see the potential predictive power of those three multi-categorical logit models for each individual category, we firstly considered three binary classification problems, i.e. we used the predicted probability for a certain class to distinguish that class of PULs from the other PULs. Receiver operating characteristic (ROC) analysis can be performed on the three binary classifications independently of class distributions and error costs. The ROC curve for a binary classifier is constructed by plotting the sensitivity (true positive rate) versus 1 specificity (false-positive rate) for varying cut-off values. The area under the ROC curves (AUC) can be interpreted statistically as the probability of the test correctly distinguishing the abnormal patients from normal ones. An area of 1 represents a perfect test; an area of 0.5 represents a worthless test. In this study, the AUC was obtained by a non-parametric method based on the Wilcoxon statistic, using the trapezoidal rule, to approximate the area and its associated standard error (Hanley and McNeil, 1982). This also allowed the comparison of two ROC curves (Hanley and McNeil, 1983
).
The performance of the models was also evaluated in terms of sensitivity, specificity, PPV and negative predictive value (NPV).
In order to classify a case into one of the three categories, we needed to set up some diagnostic rules. The rules can be proposed as follows: if the predicted probability for a PUL to be an EP was greater than a threshold, then it was classified as an EP, otherwise it was classified as non-EP. For PULs which were classified as non-EP, if the predicted probability for a PUL to be failing was greater than a threshold, then it was classified as a failing pregnancy; otherwise it was classified as an IUP.
As can be seen from the rules, two probability cut-offs need to be decided. Here we find the best cut-offs by minimizing the square root of [(1 sensitivity)2 + (1 specificity)2], with the hope that the sensitivity and specificity can both be maximized. The cut-off for EP versus non-EP was based on the predicted probability for a PUL to be an EP given the observation, and the second cut-off for distinguishing failing PULs from IUPs (among non-EPs) was sought using the predicted probability of failing to discriminate between failing and non-failing PULs.
One can form the classification rules based on the weighted predicted probability, which incorporates the probability output of the model with the misclassification cost for different classes. Given an observation, the multi-categorical logistic regression model can provide the predicted posterior probability (P) for each class including Pectopic, Pfailing and PIUP. We assume that the costs (C) for misclassifying a failing PUL, an IUP or an EP are equal to Cfailing, CIUP and Cectopic, respectively. For simplicity, here the misclassification costs are assumed to be the same for a certain category of PULs, no matter which class a PUL is wrongly assigned to. The weighted predicted probability for each class can be computed as CfailingPfailing, CIUPPIUP and CectopicPectopic. The predicted class is then the one with the highest weighted predicted probability among the three. The optimal costs for misclassification were chosen according to the training performance.
Model validation
The models were first validated on the training set by use of ROC analysis for three binary classification problems and the confusion tables for the three-category classification problem. We also utilized the bootstrap technique in order to obtain nearly unbiased estimates of the predictive ability of the models (Efron and Tibshirami, 1993). A total of 100 random samples of the same size as the initial data set were drawn with replacement from the initial data set. Then the logistic models were fitted on each bootstrap sample, and the performance was measured both on the bootstrap sample and on the original sample. The average difference between the two performance measures forms an estimate of the optimism. The bias corrected performance measure was then calculated by subtracting the optimism from the measure of the model built on the original data.
Then the models were validated further on an independent data set with 196 PULs after excluding the persisting PULs. The predicted probabilities of the three classes were calculated with the model developed on the training data, based on which the AUCs and confusion tables were obtained for the test data. Additional bootstrap validation was also performed on the test set.
![]() |
Results |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
In the test set of 199 PUL, there were 109 (54.8%) failing PULs, 75 (37.7%) IUPs, 12 (6.0%) EPs and three (1.5%) persisting PULs. In the test set, 136 (69.4%) presented with lower abdominal pain and 60 (30.6%) without. In the test set, 66 (33.7%) presented without any vaginal bleeding, 68 (34.7%) had vaginal bleeding without clots and 62 (31.6%) had vaginal bleeding with clots.
Table I presents the demographic, hormonal and ultrasonographic characteristics of women with a PUL. Also reported are the P-values for statistical significance of the variables for distinction between groups. These results indicated that almost all the variables seem to be significant in discrimination between an IUP and a non-IUP (either a failing or an EP). On the contrary, none of the variables appeared to be significant for distinguishing an EP from a non-EP (including failing PUL and IUP).
|
M1 included the HCG ratio alone:
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
|
All the parameters in the three multi-category logit models are significant, with P-values <0.01, except for the variable age in M3. In model M3, age has a P-value of 0.34 in the equation for the contrast of EP versus IUP group, and a P-value of 0.04 for failing PUL versus IUP group. Table II presents the odds ratios for the HCG ratio, the log progesterone average and age, according to the outcome of the pregnancy.
|
|
|
|
|
|
Classification rules for the three-category classification problem
From the best performing model (on the test set), M1, we computed the posterior probability for a woman having a failing PUL, an IUP or an EP.
Since model M1 has only one explanatory variable, the HCG ratio, we can visualize the relationship between the posterior probability and HCG ratio, as shown in Figures 5 and 6. The dotted line indicates the predicted probability for an observation being an ectopic PUL versus its HCG ratio, the solid line for an observation being a failing PUL, and the dashed line for an observation being an IUP. Also shown in the figure is the observed probability of a PUL being a failing PUL, EP or IUP given the HCG ratio. The variable HCG ratio was first divided into 12 evenly spaced intervals between 0 and 4, then the observed probability was estimated by the proportion of an outcome category within each interval using the data from the training and test set, respectively. The predicted and observed probabilities seem to match quite well in both the training and test data. There is only one exceptional extreme case in the test set when the HCG ratio is close to 4 (see Figure 6). However, the observed probability is not reliable for this last interval with the HCG ratio >3.64, since there is only one PUL (a failing PUL from the test set) in this interval.
|
|
We also tried to derive the classification rules using weighted predicted probabilities, by which we explicitly incorporated the misclassification costs into our decision making. By varying the costs, we obtained different results. The optimal (relative) cost values for misclassifying a failing PUL, an IUP and an EP were 1, 1 and 4, respectively, which were selected based on the performance on the training set. The corresponding results of these rules on the training and test set are shown in Table V(a).
|
We also notice that the PPVs for EP are quite low in Tables IV and V, which is due mainly to the low prevalence of EPs in our study group. On the contrary, the likelihood ratios (LRs) are mathematically independent of the prevalence and considered more informative for clinical practice. Therefore, in Table VI, we present the diagnostic results in LR for different intervals of the HCG ratio, together with the occurrence and corresponding ranges of predicted probabilities from M1 for different types of PULs. Since the number of EPs is very small in both the training and the test set, the data from all 381 women have been used in order to obtained a more reliable LR estimates.
|
|
|
![]() |
Discussion |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
In developing the models, gestational age and endometrial thickness were not selected by the stepwise logistic regression procedure, though they appeared to be significant in the univariate logistic regression models. This is due to the strong correlations between these variables and the HCG ratio, which are both 0.4, and their correlations with the average progesterone levels, which are both
0.3. In previous studies, the use of these parameters as an alternative to serum HCG for the diagnosis of EP in women with PUL has not proved diagnostic (Mol et al., 1999b
).
The models do not have to be used at the same gestational period in early pregnancy provided the serum HCG levels are <10 000 IU/l. We know that when the serum HCG concentrations are <10 000 IU/l, the rate of serum HCG change does not increase significantly, i.e. it is linear (Kadar et al., 1990). All the women in this study with an early IUP had an initial serum HCG <10 000 IU/l; therefore, the rate of change of the log HCG was linear. Thus the models do not require the same gestational age in PULs.
The performance results of the training sets of all three multi-categorical logistic regression models, M1, M2 and M3, were very encouraging. All three models outperformed current diagnostic criteria for EP and were as good as current diagnostic criteria for predicting viability.
When the AUC in each model (for the prediction of EP on the training set) is compared with single parameters such as the discriminatory zone, we see that their performance is significantly better.
As these results were obtained retrospectively, we needed to cross-validate the results prospectively in order to assess how robustly each model performed. Each model when tested in this way gave equally encouraging results.
One limitation of this study is that the sample size is rather small with regard to the number of events per variable (EPV). This will influence the stability of the stepwise logistic regression, for example the selected variables might change when deleting or adding a small amount of data. Focusing on the prediction of EPs, the EPV values for model M1 and M2 are both 20, which are probably large enough to obtain a stable parameter estimate of the logit models. However, the EPV value is only 6.7 for M3. Moreover, the incidence of the outcome is different between the training set and test set, and the effects of the predictors may also be different. This might partially explain why the validation results from the test set did not agree well with those from the training set. As a post analysis, we combined the training and test sets into one data set for model development. Again we started the stepwise selection from three sets of candidate variables. Both the HCG ratio and the log progesterone average still keep their important roles in the models. Whereas age was ruled out of the models, contradictorily, the disrupted midline echo appeared to be significant in all the three final models. Therefore, a larger data set is still needed in order to develop a more stable model. Based on the current available data, M1 is the most impressive model among the three. It is simple, while its validation performance is still comparable with or even better than the other two models.
Our optimal logistic regression model M1 represents a significant improvement on current diagnostic criteria for the detection of EP. We believe that multi-centre trials are needed to test its reproducibility and validity before it is adopted in the clinical setting. In the future, we hope that the incorporation of historical factors, such as previous history of pelvic inflammatory disease or EP, and the presence or absence of site-specific tenderness at the time of TVS will result in a more detailed modelling of the probability of an EP. In turn, this could result in a better diagnostic performance.
This logistic regression model can predict which PULs become failing PULs, IUPs and, most importantly, EPs based on the patient's HCG ratio alone. It significantly outperforms current diagnostic criteria for the prediction of EPs.
![]() |
Acknowledgements |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
![]() |
References |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Banerjee S, Aslam N, Woelfer B, Lawrence A, Elson J and Jurkovic D (2001) Expectant management of early pregnancies of unknown location: a prospective evaluation of methods to predict spontaneous resolution of pregnancy. Br J Obstet Gynaecol 108, 158163.[CrossRef]
Cacciatore B, Stenman UH and Ylostalo P (1990) Diagnosis of ectopic pregnancy by vaginal ultrasonography in combination with a discriminatory serum hCG level of 1000 IU/L (IRP). Br J Obstet Gynaecol 10, 904908.
Condous GS (2004) The management of early pregnancy complications. In Bourne T and Valentin L (eds), Best Practice and Research Clinical Obstetrics and Gynaecology Special Issue: Volume 18 Issue 1. Ultrasound in Gynaecology. Elsevier, Amsterdam 3757.
Condous G, Okaro E, Khalid A, Zhou Y, Lu C, Van Huffel S, Timmerman D and Bourne T (2002) Role of biochemical and ultrasonographic indices in the management of pregnancies of unknown location. Ultrasound Obstet Gynaecol 20 Suppl 1, 3637.
Dart RG, Mitterando J and Dart LM (1999) Rate of change of serial beta-human chorionic gonadotropin values as a predictor of ectopic pregnancy in patients with indeterminate transvaginal ultrasound findings. Ann Emerg Med 34, 703710.[Medline]
Efron B and Tibshirami RJ (1993) An Introduction to the Bootstrap. Chapman & Hall, New York.
Hanley JA and McNeil B (1982) The meaning and use of the area under a receiver operating characteristic curve. Diagn Radiol 143, 2936.
Hanley JA and McNeil B (1983) A method of comparing the areas under the receiver operating characteristics curves derived from the same cases. Radiology 148, 839843.[Abstract]
Kadar N, Caldwell BV and Romero R (1981) A method of screening for ectopic pregnancy and its indications. Obstet Gynecol 58, 162166.[Abstract]
Kadar N, Freedman, M and Zacher M (1990) Further observations on the doubling time of human chorionic gonadotropin in early asymptomatic pregnancies. Fertil Steril 54 783787.[Medline]
Mol BW, van Der Veen F and Bossuyt PM (1999a) Implementation of probabilistic decision rules improves the predictive values of algorithms in the diagnostic management of ectopic pregnancy. Hum Reprod 14, 28552862.
Mol BW, Hajenus PJ, Engelsbel S, Ankum WM, van der Veen F, Hemrika DJ and Bossuyt PM (1999b) Are gestational age and endometrial thickness alternatives for serum human chorionic gonadotropin as criteria for the diagnosis of ectopic pregnancy? Fertil Steril 72, 643645.[CrossRef][Medline]
Rosello N, Condous G, Okaro E, Khalid A, Alkatib M, Rao S and Bourne T (2003) Does transvaginal ultrasonography accurately diagnose ectopic pregnancy? Hum Reprod 18 Suppl 1, 160.
Shalev E, Yarom I, Bustan M, Weiner E and Ben-Shlomo I (1998) Transvaginal sonography as the ultimate diagnostic tool for the management of ectopic pregnancy: experience with 840 cases. Fertil Steril 69, 6265.[CrossRef][Medline]
Why Mothers Die Triennial Report 1997-1999. Confidential Enquiry into Maternal Deaths, UK.
Submitted on October 31, 2003; accepted on May 6, 2004.