1 Department of Epidemiology, University of North Carolina School of Public Health, Chapel Hill, NC.
2 Carolina Population Center, University of North Carolina at Chapel Hill, Chapel Hill, NC.
3 Department of Preventive Medicine and Epidemiology, Loyola University Stritch School of Medicine, Maywood, IL.
ABSTRACT
Numerous authors have critiqued the use of race as an etiologic quantity in medical research. Despite this criticism, the use of variables encoding racial/ethnic categorization has increased in epidemiology, and most researchers agree that important variation in disease risk is captured by this classification system. Previous discussions have generally neglected to articulate guidelines for appropriate use of racial/ethnic information in etiologic research. The authors summarize the logical, conceptual, and practical problems associated with the "ethnic paradigm" as currently applied in biomedical sciences and offer a set of methodological recommendations toward more valid use of racial/ethnic classification in etiologic studies. These suggested guidelines address issues of variable definition, study design, and covariate control, providing a consistent foundation for etiologic research programs that neither ignore racial/ethnic disease disparities nor obfuscate the nature of these disparities through inappropriate analytical approaches. This methodological analysis of racial/ethnic classification as an epidemiologic quantity provides a formal basis for a focus on racism (i.e., social relations) rather than race (i.e., innate biologic predisposition) in the interpretation of racial/ethnic "effects."
causality; confounding factors (epidemiology); epidemiologic factors; epidemiologic methods; epidemiologic research design; ethnic groups; racial stocks
Abbreviations: SES, socioeconomic status
Over the last 15 years, numerous authors have critiqued the use of race as an etiologic quantity in medical research (18
). Much of the emphasis has been on the traditional notion that human races define "[p]ersons who are relatively homogeneous with respect to biological inheritance" (9
, p. 139). Despite criticism of a biologic concept of race, use of variables encoding racial/ethnic categorization has increased during this period (10
, 11
). Even for those who embrace the view that the biologic content of racial/ethnic categories is limited, a rationale for the continued focus on these quantities is that they encode important variations in environment because of the central role they play in social stratification (6
, 12
14
). Under either set of assumptions, racial or ethnic designation is a remarkably strong predictor of health status (15
). It is understandable, therefore, that researchers would seize upon this observed variability between racial/ethnic groups as an important natural resource for etiologic research.
The application of this approach, which we refer to as the "ethnic paradigm," is fraught with difficulty, however. Several previous treatments have emphasized various deficiencies but have generally failed to provide practical guidelines utilizing racial or ethnic status in etiologic research. Although some journals have recently suggested guidelines, these are generally vague, often merely requiring authors to justify their use of race or ethnicity in subject matter terms and to explain how the variables are defined (16, 17
). The topic also requires fresh examination in light of the revolution in molecular biology that allows genetic polymorphisms and their products to be assessed directly, obviating the need to rely on social categorizations as rough surrogates for unspecified biologic attributes. We summarize here the logical, conceptual, and practical problems associated with the ethnic paradigm as currently realized in observational research. We then offer a set of practical methodological guidelines, in light of this critique, to facilitate more consistent use of racial and ethnic classification in etiologic studies.
PRELIMINARY CONSIDERATIONS
Distinction between "race" and "ethnicity"
The prevailing notion of race in biomedical research has long been understood to imply that phenotypic traits like skin color and facial features can be used to categorize people into meaningful genetic subgroups (18). The concept of ethnicity has been suggested as an alternative to race because it is thought to carry less of a strictly biologic connotation, implying that groups may differ by cultural as well as biologic heritage (19
21
). In practice, the distinction between these constructs is often blurred, leading many researchers to collapse them into a single dimension as "race/ethnicity" (14
, 22
) or "ethnorace" (23
). Collapsing the terms is also justified because data are generally gathered by self-report, and many respondents consider the terms to be synonymous (24
).
How do you know what race or ethnicity a person is/has?
The "gold standard" for racial/ethnic assessment is self-report (25). Although there are measurable biologic correlates of ancestry (26
), there is no objective physiologic or anatomic verification of race/ethnicity because this is a descriptor of identity and therefore part of the subjective consciousness of the individual. The circularity of existing definitions for race in terms of ancestry reveals this necessary subjectivity. For example, the US Office of Management and Budget directive 15 definition for Black race is "A person having origins in any of the black racial groups of Africa" (27
, p. 29835). The restriction to Black African ancestry, without guidance for how to operationalize this criterion, is necessary on prima facie grounds to exclude African groups that are not historically recognized as Black (e.g., Afrikaners, Arabs), but it is also vague enough to ensure that the formal definition conforms to any assertion of self-identity. In the broadest interpretation, all of humanity meets this definition (28
).
Although recent innovations in molecular biology facilitate racial/ethnic identification from fragmentary biologic material (26, 29
), this does not supercede self-identification as the ultimate basis for categorization. Although these techniques might seem to suggest a more objective basis for dividing humans into subspecies, this interpretation is faulty because it confuses prediction with validation. Investigators began with groups of people who were identified by self-report or social consensus as Black, White, and so forth and sought measures of physical or molecular traits that would predict the recorded racial classification. The standard against which such predictions are judged for accuracy, therefore, remains subjective identity. Rather than validating a biologic entity called race, these methods simply indicate that we can often predict how people are likely to define themselves or be recognized, even if no such category exists in nature as an objective entity (30
).
The biologic content of racial/ethnic self-classification
Racial classifications that attempt validation rather than prediction presume a priori meaningful clustering of biologic traits in genetically isolated subpopulations. However, the degree and nature of that clustering remain undetermined. While many researchers are quick to cite the possible significance of various biologic differences among regional populations (e.g., sickle cell, Tay Sachs), these sorts of traits have little relevance for etiologic research on the complex diseases that are the focus of most observational epidemiology. Although molecular analyses may reveal that regional populations vary in the frequency of some genetic markers, defining a race in genetic terms would require determining whether the human genome aggregates naturally into subunits. It is well-appreciated, for example, that as few as five tandem repeat or microsatellite markers can unambiguously identify most individuals; DNA fingerprintingthe most dramatic application of this principlehas precipitated a near collapse of the American system of assigning guilt for homicide (31). This observation follows from the fundamental uniqueness of all living organisms, whereas the challenge of classification is the opposite: finding meaningful categories amidst naturally occurring variation by assuming an inherent structure in the distribution of biologic traits.
The genetic basis of the difference between men and women, for example, appears clear at the level of the genome: the X and Y chromosomes throw large "switches" that influence every cell. Nonetheless, the essential genetic difference between the sexes remains obscure in a functional sense because we do not know what the genes are or what they do. Moving to a higher level of complexity, biologists have long speculated about what differentiates two closely related species. We share 99 percent DNA sequence identity with the chimpanzee, yet we are obviously very different, and we possess no meaningful biologic metric with which to quantify that genetic difference.
These considerations reinforce the complexity of defining human races at the level of the genome. There will be no "master switches," like the X/Y chromosomes, but at most an abundance of minor variants leading to subtle differences. Classification based on many minor variants requires a method of summarizing this information, which we currently lack. One popular conceptualization of a genome is a sequence of base pairs, but DNA has no objective dimensionality, only functionality. It follows that looking at sequence differences across groups is wholly insufficient as a basis for classification and comparison when the functional meaning of the variation remains unknown. Even if it were possible to sensibly quantify genomic variation, we would still need to determine how much variation is enough to make separate races. These problems must be resolved before we can make much use of the general concept of race as a coherent biologic entity, and until then we argue that creating categories within our species more often obscures than reveals meaning. Conceivably, genetics may someday advance to the point that variation across the genome can be measured, understood, quantified, and aggregated, but our first glimpse of that task reveals how difficult this will be and how hopelessly inadequate existing racial classification schemes have been for this purpose.
LOGICAL FOUNDATIONS
What is a cause?
Studies of disease etiology are concerned with causation. The counterfactual model of causality, which dominates modern quantitative inference in biomedicine, defines a "cause" in relation to an "effect" as a contrast between (hypothetical) intervention scenarios (3234
). A consequence of this definition appears to be that factors under consideration as potential causes must be plausibly manipulable; they cannot include fixed attributes such as race (35
). In the social sciences, where manipulability is rarely possible, there has been greater resistance to counterfactual definitions of causality because of this implicit restriction, although no consistent alternative definition has been proposed (36
, pp. 1358; 37
, pp. 405). When causal definitions are tied to human action, by analogy with experimental manipulation, there is no ambiguity about the meaning of a causal attribution; the effect is a contrast between the outcome distributions under various manipulative regimens (36
, pp. 702). When such manipulation is not tenable, even hypothetically, then effects can only correspond to contrasts between conditional distributions such as Pr(Y = y|X = x1) and Pr(Y = y|X = x2), where x1 and x2 are observed levels of X. These contrasts provide no distinction between association through causation or through a common antecedent cause, a long-standing philosophic objection to nonmanipulative approaches to defining causation (38
).
Race as a cause
The causal effect of race in an etiologic model is presumably the contrast between outcome distributions for subjects manipulated to various racial/ethnic states, for example, Pr(Y = y|SET[Race = Black]) versus Pr(Y = y|SET[Race = White]) (36, p. 70). To estimate such quantities we must accept the existence of counterfactual distributions, such as the outcome distribution for Whites, had they been Black (39
). Both the untenability of the intervention and the absurdity of the counterfactual distribution have led several authors to reject race as a valid cause in this sense (40
, 41
).
Positing a counterfactual racial/ethnic state may be judged plausible if racial identity is not considered to be a fundamental or unalterable characteristic of an individual (7). Individuals in racialized societies are not free to adopt any identity they wish, however, but rather must generally adhere to an identity consistent with social expectation based on phenotype and behavior. Moreover, identity is not generally a product of individual volition. If we consider a hypothetical manipulation in which a fetus in a White mother is treated to induce dark skin at birth and then endowed with an African-American self-identity by being raised culturally as Black, we would achieve something approaching the counterfactual contrast necessary for viewing race as a well-defined cause. The probability of a given outcome among treated individuals is contrasted with the probability that would have pertained if, counter to fact, the intervention had not occurred. Of course, for any given individual, only one of the two states is directly observable.
The imaginary intervention above reveals the extent to which race/ethnicity is ill-suited to be considered a cause, even if such an intervention were feasible. The hypothetical contrast is considered in the "closest possible world," in which only exposure is manipulated and all other variables are unperturbed (32; 36
, pp. 23840), because factors affected by exposure are part of its total effect (42
). For race/ethnicity as an exposure, this contrast is difficult to articulate because the exposure is a state of lifelong identity. Virtually all other relevant variables in a study (e.g., diet, socioeconomic status, neighborhood characteristics) will, as consequences of exposure, be differentially distributed in the two contrasting states because few covariates are plausibly unaffected by race/ethnicity. Because this hypothetical manipulation is so global in its total effect, some have referred to social factors such as race/ethnicity as fundamental or ultimate causes (43
).
IMPLICATIONS OF STUDY DESIGN
Study designs that permit race as a well-defined cause
When race/ethnicity is a trait of an individual and we wish to infer disease causation within that individual, it may be difficult to posit an alternate status. When the etiologic process under study is not internal to the individual whose race/ethnicity is assessed, however, valid causal contrasts are more readily defined. An example of this paradigm is the use of "testers" in discrimination investigations: actors who attempt activities such as renting an apartment or securing a bank loan with identical presentations (using fixed scripts) except for their racial/ethnic status. The difference in the experiences of the testers is attributable to racial discrimination, because the study design ensures that other relevant details of the encounter are held constant. Because the causal effect of race is directly estimated by contrasting the outcome distributions under each treatment in this experimental design, the use of race as an exposure is valid and interpretable. Similar conclusions have been stated regarding the causal effect of gender (36, pp. 12830; 44
).
This general approach has proliferated recently in medical research as well and constitutes a useful paradigm for understanding one aspect of racial/ethnic variation in health status (4549
). Designs may involve scripted case presentations from actors of various racial/ethnic backgrounds (50
) or diagnostic decisions from duplicate medical records on which racial/ethnic status is systematically varied (51
). Even for observational studies in which race/ethnicity is not directly manipulated, this genre of study still allows for valid causal inference in principle, under the assumption that other factors predictive of the outcome are included as covariates (52
). Although such investigations may not be considered strictly etiologic because they address differential diagnosis and access to health services as opposed to a purely biologic hypothesis, most etiologic research on racial/ethnic differentials must address the contributions of systematic diagnostic and treatment differentials to measured status (53
).
Characteristics of race as a cause in standard study designs
When a racial/ethnic contrast is estimated in standard designs and interpreted as an effect internal to study participants, inference is complicated because variables that are intrinsic are causally antecedent to nearly all measurable covariates. That is, a person's race/ethnicity is fixed prior to his/her measured social, physiologic, and psychological status; all of these measurable factors are downstream of the exposure in a racially stratified society (43). A consequence of this temporal primacy is that virtually all potential covariates in analyses of racial/ethnic disparities are causal intermediates. It has been suggested that adjustment for covariates, such as social class in racial/ethnic comparisons, risks overcontrol because social class is itself affected by race (2
, 39
). Indeed, conditioning on almost any other covariate will bias estimates of total effect because adjustment for causal intermediates using standard methods is not generally valid (42
).
ANALYTICAL ISSUES
Effect decomposition by adjustment for consequences of race/ethnicity
Some authors acknowledge that adjustment for consequences of racial/ethnic status yields a biased estimate of total effect, but they contend that adjustment for causal intermediates decomposes the total effect into indirect effects (transmitted through the intermediate) and direct effects (transmitted through unspecified pathways) (figure 1a). This is the strategy underlying the common practice of adjusting for socioeconomic variables to see if the race effect "goes away" (e.g., testing the null hypothesis that the partial correlation between race/ethnicity and outcome equals zero). This is presently the most common analytical framework for studying race/ethnicity in epidemiologic research (5458
). Despite the popularity of this approach, it is highly prone to providing misleading inference. Not only is this method likely to suggest spurious direct effects of race due to misspecification of the intermediates (22
, 59
, 60
), but it is also prone to bias because adjustment for intermediates does not generally provide valid estimates of direct effects (36
, pp. 1635; 61
). The exchangeability conditions that provide for valid causal inference for a given exposure are not sufficient to provide for separate identification of direct and indirect effects of that exposure (36
, pp. 1278).
|
Race/ethnicity as a covariate
A common use of racial/ethnic categorization in observational research is as a covariate when another quantity is the primary exposure of interest. Adjustment in this context is equivalent to standardizing the distribution of study subjects to some alternate set of racial/ethnic proportions. Adjustments of this sort have been criticized because they provide no insight into the role or the meaning of the race/ethnicity quantity (63, 64
). Despite this problem, adjustment does not invoke the logical and technical dilemmas described above. Whatever the unspecified myriad factors for which racial/ethnic status is a surrogate, these may be partially controlled when analyses are stratified or standardized by this variable. Because this is analogous to simply stratifying or sampling the populations with weighted probabilities (Pr), it does not lead to the same problems of interpretation created by focusing on race/ethnicity as the causal factor of interest.
For example, table 1 shows a hypothetical population (n = 18,750) with two race groups (r0, r1), two SES levels (i0, i1), and binary outcome D (d0, d1). The crude association between SES and disease (relative risk = 5.14) is confounded by an unbalanced representation of levels of SES within racial/ethnic groups in the observed population, so that the association measure does not equal the true causal contrast that would result from intervention on SES. Reweighting each cell by [Pr(SES)/Pr(SES|race)], we obtain a pseudopopulation with the same number of SES = i1 (n = 3,750) and SES = i0 (n = 15,000) but that is standardized to a new joint distribution so that race/ethnicity no longer confounds the relation between SES and outcome D (65). In the reweighted data in table 2, Pr(SES = i1|race = r0) = Pr(SES = i1|race = i1) = 0.2, so that no confounding is expected.
|
|
SUMMARY RECOMMENDATIONS
When total effect of race/ethnicity is the quantity of interest
Surveillance. In the description of crude population incidence or prevalence, results stratified by race/ethnicity may be crucial for documenting existing inequalities and monitoring disparities over time (67) (table 3). This is an important activity for assessing the population burden of disease, allocating public health and medical resources, and motivating etiologic research. A notable potential drawback is the inadvertent reification of race as a biomedical quantity (68
). Nonetheless, because of dramatic racial/ethnic disparities for many conditions, this is generally considered a significant and consequential research program.
|
Etiologic research: the ethnic paradigm. There is no unambiguous causal interpretation to total race effect estimates in the context of etiologic research. It is questionable, therefore, whether this should ever be a quantity of interest for biomedical researchers, except in cases when race is a marker for a process external to individual physiology, as in the investigation of health services disparities or other sociologic questions. In the event that an investigator is convinced that a pathophysiologic racial/ethnic effect is meaningful, however, covariate sets for adjustment must be chosen cautiously. Factors that may confound the estimated effect measures for race are other invariant attributes of the individuals, including sex, age, and genetic factors. Estimates for the total effect of race should not generally be adjusted for or stratified by other covariates.
When direct and indirect effects of race/ethnicity are the quantities of interest
Although a common analytical strategy is to adjust race/ethnicity for social factors in order to identify a direct biologic effect that is not mediated by measured covariates, this approach is highly problematic. Even if an interpretation could be granted to a racial/ethnic effect, the rigid assumptions necessary in order to render this decomposition strategy reliably valid are so far from plausible that it is difficult to imagine any useful and cogent inference resulting from this practice.
In studies of health care epidemiology, potentially valid covariates for adjustment depend on the particular design but are considerably less limited by consideration of causal order than in studies of individual pathophysiology. For example, in the study of the etiology of heart disease, comorbid diabetic status is affected by race/ethnicity and therefore not a candidate for adjustment when race/ethnicity is the factor of interest. On the other hand, when race/ethnicity is the factor of interest in a study of heart disease diagnosis or management (50), the causal process is external to the study participant. In this case, a comorbid condition, such as diabetes, is not causally subsequent to the exposure of interest and would often be an important and valid candidate for statistical adjustment in the effort to estimate an unbiased effect of race/ethnicity.
When the effect of a variable confounded by race/ethnicity is the quantity of interest
Adjustment for race/ethnicity may be reasonable when attempting to estimate the causal effect of another factor of interest (i.e., when race/ethnicity is merely a nuisance in the data). This use of racial/ethnic information has been criticized because researchers often fail to describe what they believe race/ethnicity represents in such a model. Although an understanding of the observed relation between race/ethnicity and the outcome is not furthered by this usage, this does not detract from the utility of improving the effect estimation of interest. Nonetheless, although conditioning on racial/ethnic status may reduce bias, other more specific measures, for which racial/ethnic status is acting as a rough surrogate, may reduce bias more effectively.
ACKNOWLEDGMENTS
Supported in part by grant R01-HD-39746 from the National Institute of Child Health and Human Development.
The authors thank Chandra Ford and Dr. Sol Kaufman for their insightful critiques of early drafts of the manuscript.
NOTES
Correspondence to Dr. Jay S. Kaufman, Department of Epidemiology (CB#7435), University of North Carolina School of Public Health, McGavran-Greenberg Hall, Pittsboro Road, Chapel Hill, NC 27599-7435 (e-mail: Jay_Kaufman{at}unc.edu).
Editor's note: An invited commentary on this article appears on page 299, and the authors' response appears on page 305.
REFERENCES