1 Department of Pediatrics, School of Medicine, University of Washington, Seattle, WA.
2 Department of Epidemiology, School of Public Health and Community Medicine, University of Washington, Seattle, WA.
3 Childrens Craniofacial Center, Childrens Hospital and Regional Medical Center, Seattle, WA.
4 Department of Biostatistics, School of Public Health and Community Medicine, University of Washington, Seattle, WA.
5 Division of Public Health Sciences, Fred Hutchinson Cancer Research Center, Seattle, WA.
Received for publication October 16, 2003; accepted for publication August 30, 2004.
![]() |
ABSTRACT |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
epidemiologic methods; log-linear model; operating characteristic; polymorphism (genetics); risk
![]() |
INTRODUCTION |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
In designing studies of maternal factors, however, either the rarity of a given disease, long induction periods, or both hamper the prospective investigation of maternal pregnancy biomarkers in relation to subsequent offspring disease risk. An alternative approach employs maternal genotypes at polymorphic loci in the relevant pathway as surrogates of genetically determined maternal in-utero biomarker levels in a retrospective epidemiologic study (1, 11, 12). Such methods have been applied in studies of maternal cigarette smoking and low birth weight (13), maternal thrombophilia and intrauterine growth restriction (14), and maternal folate metabolism and neural tube defects (15, 16).
The log-linear approach to case-parent triad data (LCPT) has been proposed as a method for estimating associations between maternal genotypes and disease risk in offspring, independent of disease associations with the offsprings own genotypes (1, 11). The LCPT performs well with 100 triads when the average high-risk variant allele frequency ( f ) is approximately 0.14 (1, 11), but, to our knowledge, no one has explored this approach with other sample sizes (n = number of triads) or values of f. Both n and f, however, greatly affect the expected distribution of triad genotypes and therefore should influence the operating characteristics of the LCPT, which might suffer when the expected number of triads in analytically informative categories is low. Thus, a more comprehensive investigation of the LCPT will aid researchers in planning studies and evaluating their results.
We performed a computer simulation study to evaluate the performance of the LCPT under various ns and f s. We assessed 1) the rate of false-positive detection of maternal genetic associations under the null hypothesis of no maternal genetic associations with disease (type I error); 2) statistical power to detect moderate maternal genetic associations; 3) minimum detectable maternal relative risks (MDRRs) with 80 percent power when n = 200; and 4) bias in maternal relative risk (RR) estimates.
![]() |
METHODS |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
We simulated sample sizes of 100, 200, 500, 1,000, 2,000, or 5,000 triads. For each value of n and f, we simulated maternal hetero- and homozygote relative risks of 1.4 and 2 (log-additive), 2 and 2 (dominant), and 1 and 2 (recessive), respectively. We assessed whether C affected maternal relative risk estimates and tests by varying whether offspring relative risks were null or non-null (non-null being the same as maternal relative risks in a given scenario).
We randomly generated 1,000 data sets for each scenario, performing all simulations in Stata 8.2 (17).
Using the LCPT to estimate relative risks and perform hypothesis tests
The LCPT is one of a class of family-based association tests for estimating offspring relative risks conditional on parental genotypes (18). Additionally, it offers a means to estimate the relative risks associated with maternal genotypes independent of offspring relative risks by using a log-linear regression model stratified on mating type (the combination of parental genotypes). There are six such combinations if one assumes that there is no sex bias in the distribution of parental genotypes in the general population; indeed, the null hypothesis that maternal genotypes are unassociated with offspring disease risk exploits this assumption.
We specified the unrestricted regression model as follows (1):
ln[E(nM,F,C)] = j +
1I{M=1} +
2I{M=2} + ß1I{C=1} + ß 2I{C=2} + ln(2)I{M=F=C=1},
where each of M, F, and C equal 0, 1, or 2; j indexes the mating types; I{M=1} and I{M=2} represent dummy variables indicating whether or not the mother has one (M = 1) or two (M = 2) copies of the variant; I{C=1} and I{C=2} represent analogous terms for offspring genotypes; and the final term in the model represents a correction factor for the one triad (M = F = C = 1) having twice the probability of occurring relative to other triads of the same mating type. Thus, (RR1) and
(RR2) represent the estimated relative risks associated with the mothers having one or two copies of the variant, respectively. Elevated maternal relative risks indicate that among case-parents with discordant genotypes, mothers tend to have more copies of the variant than fathers. Mating types with concordant parental genotypes are uninformative regarding maternal relative risks.
Because of uncertainty about the true mode of inheritance at a locus, primary analyses will often employ the "unrestricted model" (1), in which one estimates separately the relative risks (and 95 percent confidence intervals thereof) associated with the mothers hetero- or homozygosity. One could also incorporate prior information about the mode of inheritance (11, 19). For example, one could parameterize the model log-additively, imposing the assumption that risk multiplies with each additional allele (we restricted offspring parameters similarly in respective restricted models):
ln[E(nM,F,C)] = j + (1/2)
LI{M=1} +
LI{M=2} + (1/2)ßLI{C=1} + ßLI{C=2} + ln(2)I{M=F=C=1}.
To enforce a dominant or recessive genetic model, one could restrict the relative risks associated with maternal homozygosity to be equal to the RR2 or 1, respectively. We subsequently refer to as RRL and to the dominant and recessive relative risks as RRD and RRR, respectively. We subtracted power in the unrestricted model (1) from power in each corresponding restricted model to compare their performances.
In all regression analyses, we tested maternal relative risk parameters statistically by performing likelihood ratio tests. For a given locus, twice the difference between the log-likelihoods of the model estimating maternal relative risks and a model in which maternal relative risks were assumed to be zero was compared with a chi-squared distribution with 2 df (model 1) or 1 df (restricted models). We rejected the null hypothesis on the basis of a two-sided test with a level of 0.05.
In calculating bias, type I error, and power, we excluded data sets in which any maternal relative risk coefficient could not be estimated or its estimated standard error was 0 or >50.
Estimating type I error
To assess type I error in each scenario, we calculated the proportion of data sets in which we rejected the null hypothesis when, indeed, maternal genotypes were unassociated with offspring disease risk. We calculated exact binomial 95 percent confidence intervals for the estimated error rates.
Estimating power
To estimate statistical power in each scenario, we calculated the proportion of data sets in which we rejected the null hypothesis when maternal genotypes were associated with offspring disease risk.
MDRR estimates with 80 percent power
With n = 200 triads, we estimated MDRRs calculated as the relative risk at which the null hypothesis of no association with maternal genotypes was rejected in approximately 80 percent of the 1,000 repeated simulations. For each scenario, we set an initial relative risk and generated 50 data sets. We performed the LCPT analysis in each data set and calculated power. If the 95 percent confidence interval for this proportion excluded 80 percent, we modified the relative risk and generated 50 new data sets. We iterated this process until the 95 percent confidence interval included 80 percent and then raised the number of simulated data sets to 200. We then iterated similarly. When the 95 percent confidence interval based on 200 simulated data sets included 80 percent, we raised the number of data sets to 1,000, modifying the relative risk and repeating sets of simulations until power estimates were between 77.5 percent and 82.5 percent. Across scenarios, we ranged f from 0.05 to 0.95 in the log-additive and recessive scenarios but only from 0.05 to 0.75 in the dominant model, because of prohibitive increases in the MDRR and simulation time.
Estimating bias
We evaluated bias graphically by plotting the exponentiated average of the maternal parameter regression coefficients against the high-risk variant allele frequency. We also calculated the ratio of this average to the true relative risk.
![]() |
RESULTS |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Type I error
The LCPT was generally valid for most values of n and f (figure 1). The 95 percent confidence intervals for the type I error rates usually included 0.05.
|
With the assumed-dominant models (figure 1), the 95 percent confidence intervals lay above 0.05 in 12 of 228 scenarios. When f was very high, the 95 percent confidence interval often lay below 0.05 (in 22 of 228 scenarios).
This pattern was reversed with respect to f when the analytic model was assumed to be recessive (figure 1), with similar numbers of scenarios having 95 percent confidence intervals excluding 0.05. The test of maternal relative risks tended to be conservative at very low f s with smaller ns.
When the analytic model was assumed to be log-additive (figure 1), the 95 percent confidence intervals excluded 0.05 in only 13 scenarios. In 10 of the 13 scenarios, the test was conservative, with little pattern in relation to f.
Statistical power for detecting maternal associations
The statistical power of the LCPT depended greatly on f (figure 2). Power was maximized around f = 0.4, 0.25, and 0.7 for underlying log-additive, dominant, and recessive phenotypes, respectively. Power decreased asymmetrically around these values, and this asymmetry was much more pronounced for maternally dominant and recessive phenotypes.
|
|
Under recessive inheritance, the unrestricted model with n = 200 allowed for power of 80 percent to detect maternal homozygote relative risks of 2 when 0.45
f
0.8 (figure 2); power tended to be much lower at very low f s.
Analyzing the data enforcing a log-additive model when the true model was also log-additive increased power, by an average of 10 percent when n = 200 (figures 2 and 3), and never by more than 18 percent for all sample sizes studied. Making this assumption generally decreased power when the true genetic model was dominant or recessive, however, sometimes by as much as 60 percent. Similarly, correctly specifying the model as dominant or recessive increased power for most f s, by as much as an absolute increase of 20 percent (figures 2 and 3). Mistakenly assuming a dominant or recessive model when the opposite mode of inheritance was true, however, reduced power greatly for most f s (figures 2 and 3), by as much as an absolute difference of 97 percent.
As compared with null offspring relative risks, offspring non-null associations slightly increased power to detect maternal relative risks when f was lower and slightly decreased power when f was higher (figure 4). The differences, though small, were consistent across all scenarios considered.
|
|
Non-null offspring relative risks reduced MDRRs (by 23 percent) at lower f s and increased them at higher f s. The increases depended on the underlying genetic model and were as high as 100 percent under dominant inheritance (data not shown).
Bias in maternal relative risk estimates
At f = 0.95, even at sample sizes of up to 1,000 triads, maternal RR1, RR2, and RRD estimates were sometimes as much as 50160 percent higher or 85 percent lower than their true values, depending on the underlying inheritance (figure 6). At the lowest values of f, maternal RR1 and RRD estimates were unbiased, but the maternal RR2 and RRR estimates tended to exhibit large downward biases (figure 6). Analyzing the data with a dominant or recessive model, even when the assumption was correct, did not greatly reduce these biases (figure 6). The sample size needed to minimize the observed bias depended on the value of f and was sometimes as high as 5,000. Estimates of RRL were not generally biased when the true model was log-additive (figure 6).
|
|
![]() |
DISCUSSION |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
In this study, we found that for moderate values of f, the LCPT approach with 500 triads allowed for approximately 80 percent power to detect unbiased maternal RR1s of 1.4, 2, and 2 when inheritance was log-additive, dominant, and recessive, respectively. Operating characteristics depended greatly on the mode of inheritance, f, and n. Large decreases in power and increases in bias may make the LCPT method impractical for studying putative high-risk maternal alleles that are relatively common when the underlying mode of inheritance is log-additive or, particularly, when it is dominant. Conversely, with underlying recessive inheritance, power and estimation would both suffer if the high-risk allele were very rare, though to a lesser extent than with very common high-risk alleles under dominant inheritance.
We illustrated the cause of these dramatic shifts in the performance of the LCPT by calculating the expected proportion of triads in each triad genotype category under specific alternative hypotheses. Mating type categories consisting of concordant maternal and paternal genotypes (mating types 1, 4, and 6) are uninformative regarding maternal relative risks. When the true mode of inheritance is dominant or recessive, however, mating type 2 or mating type 5, respectively, also offers little information regarding the magnitude of the maternal relative risk. The expected frequency of triads in some informative triad categories drops sharply when f nears 0 or 1, and the number of categories with near-zero counts is higher for f s approaching 1 versus 0. This asymmetry reflects the mechanics of the LCPT analysis, in which one measures skewing away from the expected distribution of triads in the absence of maternal associations with the cases disease. Even under the null-hypothesized distribution, extremely rare or common alleles may lead to many zero cell-counts. Yet, among triads randomly selected from the population, the expected distribution of empty cells when f = 0.05 is exactly the reverse of that when f = 0.95. However, under the alternative hypothesisthat is, when the maternal relative risk is greater than 1the distribution of empty cells among triads ascertained through cases will not be symmetric around f = 0.5, regardless of the mode of inheritance. Thus, even under log-additive inheritance, the LCPT had poorer performance when f = 0.95 than when f = 0.05.
The poor performance of the LCPT for studying very common high-risk alleles raises the question of whether this is a practical limitation or merely a theoretical limitation. The LCPT is potentially most applicable to the study of complex diseasesthose with multiple genetic and environmental factors, each of which typically would be only slightly associated with increased disease risk. A causal maternal genetic variant that occurs in 95 percent of the population and increases offspring disease risk fivefold, for example, would contribute to the occurrence of approximately 90 percent of cases, which would be inconsistent with this paradigm of complex disease etiology (20, 21). However, this proportion increases or decreases with the magnitude of the relative risk (22), such that a causal maternal variant with a prevalence of 95 percent and a relative risk of only 1.5 would contribute to only about 30 percent of cases. Furthermore, if the very common allele acts as a modifier of risk increases caused by other genetic or environmental factors, its associated relative risk and population attributable risk percentage would be even lower in analyses that did not account for this heterogeneity (indeed, the situation just described drives much of the current, widespread research emphasis on "gene-environment" interactions). Thus, the high prevalence of a putative high-risk allele is not sufficient to exclude it as a candidate of potential etiologic or public health importance. Despite their potential scientific interest, however, very high prevalence might impede the investigation of such variants via the LCPT approach unless the true mode of inheritance were recessive.
In addition to f and n, any factor influencing the expected triad frequencies could affect the performance of the LCPT, particularly at small ns. Although the tests of maternal and offspring relative risks are orthogonal (11), we showed that with smaller ns, associations between the offsprings own genotypes and disease can affect the power to detect maternal genetic associations with the offsprings disease. Despite the weakness of these effects, they were consistent across analytic models and modes of inheritance, and the effects were stronger when the magnitude of the offspring relative risks was larger (data not shown). Power did not depend on offspring relative risks when n was 2,000.
Some case-parent triad studies may include a mix of complete and incomplete triadsfor example, if only one parent is available for genotyping (11, 23). Although we did not examine this issue in our simulations, we would expect that missing parental genotypes would reduce the sample size and could thereby worsen LCPT performance. Even if a parents unavailability were related to his/her genotype of interest, missingness should not bias the maternal relative risk unless genotypes were differentially related to mothers probability of being unavailable as compared with fathers.
In contrast to LCPT-derived offspring relative risk estimates, the maternal relative risk estimates may be confounded by associations between maternal ethnicity and maternal alleles (24). We suspect that the fact that mating partners are often chosen on the basis of ethnicity may serve to decrease potential confounding from population stratification, since the LCPT implicitly matches case mothers to case fathers rather than to control mothers. Even within homogeneous groups, however, any underlying genetic basis for sex-specific choices of mates could confound observed associations with maternal variants of interest, because the null hypothesis for the test of maternal associations depends on the assumption of mating type symmetry. For nonrandom mating to cause inflated error rates, the nonrandomness would have to correlate with variation in the genetic region of interest and also be sex-specific. For example, if it is more important to women than to men to choose a mate who is tall relative to the general population, LCPT analysis of variants in genes related to height, such as the human growth hormone (GH1) gene (25), may yield spurious maternal associations with offspring disease risk. While it is difficult to assess how commonly such spurious results might arise, differences between mothers and fathers genotypes could be explored among randomly selected triads.
It has been suggested that the power of the LCPT design would increase if one incorporated into the analysis prior information about the mode of inheritance (11). We demonstrated that the consequent gain in power that occurs when the model is correctly specified may be outweighed by the potential loss of even greater power when model assumptions are false, particularly in assumed-dominant or assumed-recessive models. One might be tempted to fit a restricted model (e.g., a dominant one) if a primary, unrestricted analysis did not yield results of "statistical significance," perhaps in the hopes of consolidating sparse analytic categories. However, basing the choice of analysis on the results of the first model fit would probably still inflate the overall error rate due to making multiple comparisons, even though we showed that falsely assuming either a dominant or a recessive model in a single model generally proved to be valid.
Analyzing data by using a log-additive model led to only moderate loss of efficiency with smaller sample sizes, even when the true model was dominant or recessive. This is consistent with what has been shown regarding the performance of the LCPT for assessing offspring relative risks (19). However, the loss of efficiency was much greater when n was greater, particularly at extreme values of f (data not shown). Also in its favor, the assumed-log-additive model allowed for identifiable parameter estimates more often than did the other models considered, even when inheritance was dominant or recessive.
We have presented results of a simulation study in which we investigated the performance of the LCPT over a wide range of epidemiologic scenarios. On the basis of these findings, epidemiologists may consider the LCPT a useful approach for assessing maternal genetic associations, unless they expect a very rare or fairly common maternal allele to increase disease risk.
![]() |
ACKNOWLEDGMENTS |
---|
The authors thank Dr. David M. Umbach for his thoughtful suggestions on this article.
![]() |
NOTES |
---|
![]() |
REFERENCES |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|