a Department of Public Health and Primary Care, Institute of Public Health, University of Cambridge, Cambridge CB2 2SR, UK.
b Department of Mathematics, The Hong Kong University of Science and Technology, Hong Kong.
Dr Nicholas J Wareham, Department of Public Health and Primary Care, Institute of Public Health, University Forvie Site, Robinson Way, Cambridge CB2 2SR, UK. E-mail: njw1004{at}medschl.cam.ac.uk
![]() |
Abstract |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Methods The underlying model considered in this paper is a simple linear regression
and relating a continuous outcome to a continuously distributed exposure variable.
Results The slope of the regression line is taken to be dependent on genotype, and the ratio of the slopes for each genotype is considered as the interaction parameter. Sample size is affected by the allele frequency and whether the genetic model is dominant or recessive. It is also critically dependent upon the size of the association between exposure and outcome, and the strength of the interaction term. The link between these determinants is graphiscally displayed to allow sample size and power to be estimated. An example of the analysis of the association between physical activity and glucose intolerance demonstrates how information from previous studies can be used to determine the sample size required to examine gene-environment interactions.
Conclusions The formulae allowing the computation of the sample size required to study the interaction between a continuous environmental exposure and a genetic factor on a continuous outcome variable should have a practical utility in assisting the design of studies of appropriate power.
Keywords Genotype, environmental exposure, gene-environment interaction, sample size, quantitative trait
Accepted 7 February 2001
![]() |
Introduction |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
The alternative situation where the outcome variable is continuously distributed has received less attention but is likely to become important as researchers investigate the genetic basis of quantitative traits such as blood pressure and obesity. A method for calculating power in this situation was recently described but was limited to a number of specific situations in which some main and interaction effects were fixed to zero.7 In the approach presented here, we consider the situation of an effect of a categorical genetic factor on the association between a continuous environmental exposure and a continuously distributed outcome. We illustrate the utility of this approach with an example of the investigation of the interaction between genes and physical activity in the determination of glucose tolerance.
![]() |
Methods |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
![]() |
![]() |
The regression parameters i and ßi are weights reflecting the contribution of the genetic factor and the environmental exposure to the continuously distributed outcome y. If there is no gene-environment interaction, then the regression parameters ß1 and ß2 are equal.
is a stochastic error term and is assumed to be normally distributed with mean zero and variance
2y. We assume the distributions of the residual of y in each group are the same, and the variances of exposure E in each group are
2. In order to give the ß parameters a clear interpretation, we have standardized both the outcome and the environmental exposure by making
2y =
2 = 1.
2y is the residual variance of y after adjusting for E. In most situations E would account for 20% or less of the total variation in y and therefore
2y would be within 10% of the population standard deviation. Thus the ß coefficients are interpretable as the approximate proportion of a standard deviation change in y for a standard deviation change in E.
We consider a general situation for a polymorphissm where p is the frequency of the rare allele. Assuming that the polymorphissm is in Hardy-Weinberg equilibrium, then the genotype frequencies of aa, aA and AA are p2, 2p(1 p) and (1 p)2, respectively. Accordingly, the proportions of individuals in the two genetic groups are p2 and 1 p2 for a recessive model, and p(2 p) and (1 p)2 for a dominant model, respectively. To study the effect of the environmental exposure on the association of the outcome variable with this genetic factor, we test the null hypothesis that the regression slopes in the genetic sub-groups are equal. If n individuals are studied, then the test statistic (Appendix) is distributed as a F-distribution with degrees of freedom 1 and n 4 under the null hypothesis, and a non-central F-distribution with degrees of freedom 1 and n 4 under the alternative hypothesis.8 The non-centrality parameter is
![]() |
In this paper we adopt the definition of the non-centrality parameter as given by Rencher,9 S-Plus10 and SAS.11 However, in some papers,1213 it is defined as, where
is the non-centrality parameter defined above, and k is the numerator degrees of freedom of the test statistic.
Under the situation that the two slopes are equal, we can study the association of the outcome variable with the genetic factor where E is included as a confounding factor, i.e. to test whether the two intercepts are equal. If the slopes are not equal, then testing the equality of the intercepts is misleading. The test statistic (Appendix) follows an F-distribution with degrees of freedom 1 and n 3 under the null hypothesis, and a non-central F-distribution with degrees of freedom 1 and n 3 under the alternative hypothesis. The non-centrality parameter is
![]() |
Using the distribution and the non-centrality parameter, we are then able to calculate power to detect an interaction effect or alternatively the sample size necessary to detect a given interaction with fixed power and significance. We have not adopted any specific parametric model for describing the interaction. Instead in the results and figures we present power calculations over a range of values for ß1 and ß2.
The range of possible values for ß1 and ß2 are derived from the study of the relationship between physical activity and glucose intolerance. This association is typical of quantitative traits that may be influenced by genetic factors, as evidence from ecological and migration studies suggests the possibility of strong gene-environment interactions.14 In a study by Wareham et al.,15 the relationship between physical activity and a continuous measure of glucose intolerance was quantified using an objective measure of energy expenditure and a multivariate approach to correction for measurement error. The corrected regression coefficient relating habitual energy expenditure to the 2-h plasma glucose was 0.72 mmol/l per standard deviation of the physical activity level, the ratio of the total energy expenditure to basal metabolic rate. The 95% CI for this coefficient were 0.35 to 1.15 mmol/l per standard deviation. As the population standard deviation for the 2-h plasma glucose was 2.2 mmol/l, we may then express this coefficient standardized for the dependent variable too, resulting in a central estimate of 0.33 with 95% CI of 0.16 to 0.52. In the analysis of plausible values for ß2, we have, therefore taken 0.1 to 0.5 as the range of overall effect that would be of interest in the study of gene-environment interactions. We have simplified the reporting of associations by only considering positive associations, as the results would be symmetrical for associations that were in the opposite direction. This range of ß2 values is plausible and would include the central estimates from other studies that have examined the association between continuous outcomes and continuous exposures. For example in the Intersalt study16 the pooled regression coefficient relating 24-h sodium excretion to systolic blood pressure was 0.0354 mm Hg/mmol sodium per day. As the standard deviation of the systolic blood pressure in the UK centres was approximately 15 mm Hg and the standard deviation of the sodium excretion was 50 mmol per day, this can be converted to a standardized ß2 value of 0.12, which is within the range we have selected to examine. Although it is possible that stronger effects would be of interest, there are at present few examples of such strong associations and we have limited our attention to those that are less than 0.5.
![]() |
Results |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
|
|
|
![]() |
Discussion |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
The fact that power critically depends upon the magnitude of the association between the environmental exposure and the outcome is an argument for utilizing exposure measurement instruments that have small degrees of error, because less precise instruments will result in attenuated regression coefficients, making it harder to detect gene-environment interactions. Given that the cost of epidemiological studies is determined not only by the total sample size but also by the cost of measuring the main exposures, the balance between investing in large studies with imprecise but inexpensive exposure measurement compared to smaller studies with expensive but more precisely measured exposures becomes critical in planning future studies to detect possible gene-environment interactions.
![]() |
Appendix |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
![]() |
![]() |
Xß is the design matrix when ß1 = ß2
![]() |
![]() |
The power with 5% significance level for fixed values of n, p, ß1 and ß2 can be obtained easily using any statistical software, e.g. in SAS, the command for the power calculation is
![]() |
![]() |
![]() |
KEY MESSAGES
|
![]() |
Acknowledgments |
---|
![]() |
References |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
2 Hwang SJ, Beaty TH, Liang KY, Coresh J, Khoury MJ. Minimum sample-size estimation to detect gene environment interaction in case-control designs. Am J Epidemiol 1994;140:102937.[Abstract]
3 Foppa I, Spiegelman D. Power and sample size calculations for case-control studies of gene-environment interactions with a polytomous exposure variable. Am J Epidemiol 1997;146:596604.[Abstract]
4 Garcia-Closas M, Lubin JH. Power and sample size calculations in case-control studies of gene-environment interactions: Comments on different approaches. Am J Epidemiol 1999;149:68992.[Abstract]
5 Sturmer T, Brenner H. Potential gain in efficiency and power to detect gene-environment interactions by matching in case-control studies. Genet Epidemiol 2000;18:6380.[ISI][Medline]
6 Lubin JH, Gail MH. On power and sample-size for studying features of the relative odds of disease. Am J Epidemiol 1990;131:55266.[Abstract]
7 van den Oord E. Method to detect genotype-environment interactions for quantitative trait loci in association studies. Am J Epidemiol 1999;150:117987.[Abstract]
8 Mood AM, Graybill FA, Boes DC. Introduction to the Theory of Statistics. Third Edn. New York: McGraw-Hill Book Company, 1974.
9 Rencher AC. Linear Models in Statistics. New York: Wiley, 2000.
10 MathSoft Inc. S-Plus 5 for Unix Guild to Statistics. Seattle, Washington: MathSoft Inc., 1998.
11 SAS Institute Inc. SAS/IML(R) Software: Usage and Reference, Version 6. Cary, NC, USA: SAS Institute Inc., 1990.
12 Pearson ES, Hartley HO. Charts of the power function for all analysis of variance tests, derived from the non-central F-distribution. Biometrika 1951;38:11230.[ISI]
13 Odeh RE, Fox M. Sample Size Choice: Chart for Experiments with Linear Models. Second Edn. New York: Marcel Dekker, 1991.
14 Hamman RF. Genetic and environmental determinants of non-insulin-dependent diabetes mellitus (NIDDM). Diabetes Metab Rev 1992;8:287338.[ISI][Medline]
15 Wareham NJ, Wong MY, Day NE. Glucose intolerance and physical inactivity: the relative importance of low habitual energy expenditure and cardiorespiratory fitness. Am J Epidemiol 2000;152:13239.
16 Intersalt Cooperative Research Group. Intersaltan international study of electrolyte excretion and blood-pressure results for 24 hour urinary sodium and potassium excretion. Br Med J 1988; 297:31928.[ISI][Medline]
17 Myers RH. Classical and Modern Regression with Application. Second Edn. Boston: PWS-KENT, 1990.