A Method for Distinguishing Consanguinity and Population Substructure Using Multilocus Genotype Data

A. D. J. Overall and R. A. Nichols

School of Biological Sciences, Queen Mary, University of London, London, England


    Abstract
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results
 Discussion
 Acknowledgements
 References
 
We use the patterns of homozygosity at multiple loci to distinguish between excess homozygosity caused by consanguineous mating and that due to undetected population subdivision (the Wahlund effect). Clarification of the underlying causes of excess homozygosity is of practical importance in explaining the occurrence of recessive genetic disorders and in forensic match probability calculations. We calculated a likelihood surface for two parameters: C, the proportion of the population practicing consanguinity, and {theta}, the genetic correlation due population subdivision. To illustrate the method, we applied it to multilocus genotypic data of two U.K. Asian populations, one practicing a high frequency of cousin marriage, and another in which caste endogamy was suspected. The method was able to successfully distinguish the different patterns of relatedness. The method also returned accurate estimates of C and {theta} using simulated data sets. We show how our method can be extended to allow for degrees of inbreeding closer than cousin unions, including selfing. With closer inbreeding, the relatedness of recent ancestors beyond the parents becomes an issue.


    Introduction
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results
 Discussion
 Acknowledgements
 References
 
Traditional methods for estimating the magnitude of inbreeding are generally functions of excess homozygosity (Wright 1921Citation ; Nei 1987Citation , p. 159). However, the excess might be attributed to consanguineous mating or population substructure, or it might be an artifact due to factors like null alleles (Brookfield 1996Citation ). Because the consequences for health and forensic science are different, useful insights could be gained from an approach that can distinguish between them. Here, we describe a simple likelihood-based method that can differentiate between consanguinity and substructure through analysis of multilocus genotypes. We illustrate it using microsatellite data from two U.K. Asian populations displaying similar homozygote excesses yet practicing quite different patterns of marriage.

These data illustrate the principle that undetected substructure can be confused with consanguinity, as they both produce excess homozygosity over Hardy-Weinberg (HW) expectations when genotype proportions are given by the product of allele proportions. Population substructure is often viewed as the consequence of geographic subdivision, but there are several alternatives. There may be incomplete admixture between components of the population because of assortative mating, niche specialization, or, particularly in the case of humans, cultural differences.

Our method detects current population subdivision. Past episodes of admixture can be detected by other methods because components of linkage disequilibrium persist for several generations, even in the face of random mating (Hartl and Clark 1997Citation , p. 105). Our method differs in making use of the patterns of excess homozygosity and linkage disequilibrium, which do not persist beyond one generation of admixture.

If we had perfect knowledge of the population subdivision, it would be relatively simple to distinguish between the two causes of excess homozygosity. This can be achieved using a hierarchical approach incorporating F statistics (Wright 1921Citation ; Weir and Cockerham 1984Citation ). These measures deal with the apportionment of allelic identity within individuals and populations, which are sometimes further divided into subpopulations.

Consanguinity produces excess homozygosity over that expected from the subpopulation allele frequencies. This is quantified by the statistic FIS, which is the correlation measuring the increased probability of a match between a pair of alleles from the same individual (I) compared with pairs drawn from the subpopulation (S) (Cockerham 1973Citation ).

If the population is divided into partially isolated subpopulations, individuals from the same subpopulation have an increased probability of sharing a common ancestor and hence an increased probability of homozygosity. It is possible to use F statistics to measure this increase at one hierarchical level relative to one more inclusive (Hartl and Clark 1997Citation , p. 117). For example, if a population is divided into regional populations (R) comprising distinct subpopulations (S), the differentiation between subpopulations is measured as FSR. However, if we are unaware of the finer subdivisions and pool a sample of individuals from different subpopulations, we will observe an apparent excess of homozygosity that might be falsely attributed to consanguinity. This amounts to a confusion of FSR with FIS.

In many instances, investigators might not detect the finer population subdivision. In the case of human populations, they could be unaware of restrictions on marriage that are subtle or unreported. Our method is a likelihood-based approach that can detect such situations by making use of the patterns found in multilocus data.


    Materials and Methods
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results
 Discussion
 Acknowledgements
 References
 
To illustrate the method of estimation, consider a hypothetical population composed of a number of smaller subpopulations. Differentiation between these subpopulations' allele frequency distributions is quantified by the correlation {theta}, which is equivalent to the correlation coefficient FST, which could be estimated directly from the genetic data if the subpopulations could be identified (Cockerham 1973Citation ). This situation is usually modeled by a set of subpopulations of equal size, with differentiation from each other being characterized by FST = {theta}. In practice, {theta} will be a weighted average across heterogeneous subpopulations; the implications are discussed by Nagylaki (1998)Citation . In the present case, we use {theta} to quantify the effect of population subdivision on homozygosity, so it is a correlation between uniting gametes drawn from the same subpopulation. The contribution of each subpopulation to this correlation will depend on its representation in our sample. In our simulations, we assume that individuals in our sample are drawn from subpopulations in proportion to each subpopulation size.

The probability of drawing two alleles of type A from one of these subpopulations is given by the standard equation

where pA is the frequency of allele A in the total population.

For two different alleles, A and B, the probability is

In this situation, where consanguinity is assumed to be zero, the genealogies of genes at different loci are essentially independent (Balding and Nichols 1994Citation ). Nevertheless, the drift between subpopulations will result in random associations within these subpopulations, between alleles at different loci. Although there are correlations between loci, the probability of a multilocus genotype can be calculated by taking the simple product of the single-locus probabilities given in equations (1) with an appropriate value of {theta}. Because this calculation takes into account the effects of drift on the allele frequencies at each locus, it allows for the linkage disequilibrium that drift has generated between unlinked loci (Balding and Nichols 1994).

Next, we need to consider the unions of the individuals within the subpopulations. Within populations that traditionally practice consanguinity, such as the U.K. Pakistani communities (Darr and Modell 1988Citation ; Overall 1998Citation ), consanguinity is essentially due to unions between first cousins. The probability that the offspring of first cousins will inherit two identical-by-descent (IBD) genes at any one locus is approximately 1/16. We will call this probability R. The proportion of the population that practice consanguinity will be represented by the symbol C.

In the offspring of consanguineous individuals, the probability of observing two copies of an allele A which trace their ancestry to an allele in the grandparental generation is pAR. With probability (1 - R), the offspring of first cousins will not inherit alleles IBD. However, if the sample population is also substructured, then the two ancestral lineages are drawn from the same subpopulation (ignoring migration in the last two generations) so that the probabilities of identity given by equations (1) apply. Putting these two expressions together gives


for homozygotes and heterozygotes, respectively.

Again, this single-locus approach can be extended to the evaluation of multilocus genotype probabilities by taking the product of single-locus probabilities. Here, R, the magnitude of excess homozygosity through alleles IBD, is the measure of association between loci caused by consanguinity.

We wish to estimate the parameters {theta} and C using l, the likelihood of the sample. This likelihood function incorporates equations (1) and (2) to calculate the probability of the observed data given the parameter values {theta} and C. The product is over loci and individuals:


where G1 and G2 are the genotypic probabilities given in equations (1) and (2) , respectively. By considering separately the association of genes between loci resulting from substructure (G1) and consanguinity (G2), equation (3) is able to identify the most likely parameter combination for explaining the excess homozygosity observed within a multilocus data set.

In the present situation, we are concerned specifically with cousin marriages. In other cases, the proportion of the population not mating at random will have a variety of different degrees of relatedness, which cannot be represented by a single scalar, R. Instead, we require a vector with elements Ri corresponding to the relatedness produced by each of n categories of union (i 1, ... , n). In that case, the frequency of parents in each class can be represented by the corresponding element of another vector D. A specific example of this type of calculation is described below (Discussion) for a selfing population, in which the various values of Di can be calculated from a single parameter C: Di = Ck(1 - C). In other cases, it is possible to jointly estimate all the values of D using Markov chain Monte Carlo methods such as Gibbs sampling or the Metropolis Hastings algorithm (Gelman et al. 1995Citation ).

Equation (3) does ignore any weak correlation that may occur between individuals, particularly the likelihood that individuals from the same subpopulation will be homozygous at the same loci. This issue appears to be important only when there are very few subpopulations of unequal size, and it is addressed by the simulation study below.

The ability of this method to identify the joint likelihood of C and {theta} relies on the observation that consanguinity and population subdivision can generate distinctive patterns of homozygosity in multilocus data. For example, in the case of a population practicing consanguineous unions, a sample may contain a mixture of offspring: those from consanguineous unions and those not from consanguineous unions. The former will be expected to have excess homozygosity due to IBD alleles at each of their loci, whereas the latter, randomly mating, individuals will be expected to have none.

On the other hand, a sample drawn from a subdivided population is expected to contain a mixture of individuals from different subpopulations. Here, we consider the case in which we know the average allele frequencies for only the whole population, because we are unaware of the subpopulation origin of each individual. If different subpopulations have different allele frequencies at their loci, then individuals drawn from the different subpopulations are expected to be homozygous at different loci.

An example to illustrate the different patterns is given in figure 1 . Figure 1a shows 10-locus genotypes for a randomly mating population. The black circles represent the homozygote loci, and the open circles represent the heterozygote loci, each in HW proportions. Figure 1b represents the case in which a population is subdivided into four subpopulations. The global allele frequencies are the same as those in the panmictic population (fig. 1a ). Each subpopulation is in HW equilibrium, but there is an excess of homozygosity due to variation in allele frequencies between the subpopulations. The excess is spread across loci and across subpopulations. Figure 1c shows a population with high consanguinity, of which 50% are the offspring of first-cousin unions. Again, the global allele frequencies are the same as those in figure 1a, and the excess homozygosity is of the same magnitude as that in figure 1b. The excess in this case, however, is confined to that portion of the population that is consanguineous. Each consanguineous individual is therefore more likely to be homozygous at every locus than are individuals from figure 1b, so these individuals tend to be homozygous at a larger number of loci, whereas the nonconsanguineous individuals are in the HW proportions.



View larger version (38K):
[in this window]
[in a new window]
 
Fig. 1.—Simple population scenarios in which each of the four individuals' 10-locus genotypes are represented by open circles for heterozygous loci and by closed circles for homozygous loci. a, This random-mating population has no excess homozygosity. b, This population is subdivided into four subpopulations, each in Hardy-Weinberg equilibrium. The global allele frequencies are the same as those in a, but variations in frequencies between subpopulations result in a global excess of homozygosity. c, The global allele frequencies are again the same as those in a and b, and the excess homozygosity is of the same magnitude as that in b. This population is divided into two equal-sized groups. One group (top) comprises consanguines, and the other (bottom) comprises progeny from unrelated parents. The distribution of homozygous loci among individuals differs between groups b and c

 
Our method of analysis makes use of this pattern. Not only does the method identify whether excess homozygosity within a sample is due to consanguinity or substructure (or both), it also estimates the proportion of individuals born to consanguines and the magnitude of substructure.

To further illustrate the underlying differences between the pattern generated by substructure and that generated by consanguinity, we contrast the two cases by plotting the probability of s homozygous loci per individual, where s = 1, 2, ... , 10. The number of loci at which an individual is expected to be homozygous can be plotted as a binomial distribution (Sokal and Rohlf 1995Citation , p. 71) where p, the probability of homozygosity, is given by equation (1) or (2) , depending on the scenario considered. In the case of substructure, the number of homozygous loci per individual is distributed as B(10, p), where, for t alleles, p = {Sigma}ti=1 pi({theta} + (1 - {theta})pi), with {theta} = 1/32. In the case of 50% consanguinity, half the population is distributed with p = {Sigma}ti=1 pi(R + (1 - R)pi), where R = 1/16, and the other half is distributed with p = {Sigma}ti=1 p2i. The homozygosity through IBD is therefore equivalent in both scenarios (50% x 1/16 = 1/32). The probability of homozygosity is clearly a function of the allele frequency, so the distributions are given for alleles at global frequencies of pi = 0.25 (fig. 2a ) and pi = 0.05 (fig. 2b ). The distributions are noticeably different, particularly at lower frequencies, demonstrating the benefits of using highly polymorphic markers. The figure also gives some idea of the sample size needed to detect the difference between the two cases. In figure 2b, for example, the largest differences between the two distributions occur at the 0 and 1 homozygous loci per individual category. Consequently, the sample size would need to be large enough to detect frequencies that differ, here, by less than 0.05. In addition to the number of alleles considered, the discrepancy between the distributions becomes more marked with increasing numbers of loci.



View larger version (18K):
[in this window]
[in a new window]
 
Fig. 2.—The probability (y-axis) for the number of homozygous loci per individual (x-axis) within a substructured population ({theta} = 1/32) and a consanguineous population (C = 0.5, R = 1/16). a, All four alleles with frequencies of p = 0.25. b, All 20 alleles with frequencies of p = 0.05

 
To examine the performance of equation (3) , we tested its precision and accuracy using simulated data.

Simulation Study
As a starting point for our simulations, we used the U.S. Caucasian frequency distributions for the 10 SGM Plus loci (PE Biosystems 1999Citation ), which have an average of 11 alleles per locus. For the first study, samples from populations with 50% first-cousin offspring and no substructure were simulated. The U.S. Caucasian allele frequency distributions were used to represent the distributions of a single homogeneous population. From these frequency distributions, arrays of all possible single-locus genotypes were generated in proportions according to equations (1) and (2) for random mating (RM) and consanguineous populations, respectively. For example, the probability of drawing a homozygote for an allele with frequency 0.2 would be 0.04 under the RM model, but it would be 0.05 under the consanguineous model.

When the results for a consanguineously mating population were generated, the first 50% of the sample represented individuals from nonconsanguineous parents. For each of these individuals, a random draw was made from the RM genotypic arrays for each of the 10 loci. The remaining 50% of the sample were consanguineous individuals, and for each of these, a random draw was made from the consanguineous genotypic arrays for each locus. Each sample consisted of 1,000 individuals with genotypes at 10 loci. This sampling procedure was repeated 10 times. Each resulting likelihood surface peaked at the desired parameter combination, but often with broad confidence intervals. The 10 replicate data sets were pooled to produce the narrow likelihood surface shown in figure 3 .



View larger version (14K):
[in this window]
[in a new window]
 
Fig. 3.—The consensus likelihood surface for 10 simulations of samples from populations composed of 50% first-cousin offspring, with no substructure. Each population is composed of 1,000 individuals with 10 loci (an average of 11 alleles). The axes in this and subsequent likelihood surfaces represent C, the proportion of the sample that is consanguineous, and {theta}, the magnitude of subdivision. The shaded envelopes enclose 5%, 10%, and 50% of the most likely values

 
The second study simulated subpopulations that were differentiated by {theta} = 0.05, but with each in HW equilibrium (C = 0). Ten equally divergent subpopulations were generated by simulating drift between them for 4N generations, where N = 1,000. Each of the subpopulations started at generation 0 with the U.S. Caucasian allele frequencies. One of the subpopulations maintained these allele frequencies throughout the 4N generations to prevent fixation. The migrants were drawn randomly each generation with probabilities proportional to their average frequencies in the nine "other" subpopulations. The number of haploid migrants in each generation was Nm = [(1/{theta}) - 1]/2. This procedure resulted in 10 allele frequency distributions for each of 10 subpopulations. From each subpopulation's allele frequency distributions, 100 ten-locus genotypes were generated following the same procedure outlined for the first study. Because each subpopulation was required to be in HW equilibrium, only equations (1) were used to generate these genotypes. The pooled data sets of 1,000 individuals were treated as a sample from a substructured population and analyzed using equation (3) . This sampling procedure was repeated 10 times. Again, each analysis of each 1,000-individual sample produced a likelihood peak at the desired parameter combinations, but pooling all 10 realizations generated the tighter likelihood surface of figure 4 .



View larger version (15K):
[in this window]
[in a new window]
 
Fig. 4.—The consensus likelihood surface for 10 simulations of samples from populations comprising 10 subpopulations simulated to be differentiated by {theta} = 0.05, with no consanguinity

 
Because each analysis generates likelihood values for each simulated sample, the output matrices from replicate samples can be multiplied together to give the consensus likelihood surfaces, shown in figures 3 and 4 . This useful application of the method is used to effect below, where two independent data sets have been obtained for each of two U.K. Asian populations. It was necessary, however, to ensure that the data sets used did not share replicate individuals.

The third study simulated samples from a population that was both substructured ({theta} = 0.03) and consanguineous (C = 50%), combining the procedures outlined in both previous simulation studies. The likelihood surface is shown in figure 5 .



View larger version (20K):
[in this window]
[in a new window]
 
Fig. 5.—The consensus likelihood surface for 10 simulations of samples from populations comprising 10 subpopulations differentiated by {theta} = 0.03 and 50% consanguinity

 
The overall likelihood for each of these three consensus data sets was calculated for a grid of 10,000 combinations of C and {theta} values. These values were sorted into rank order to identify the limits of cumulative frequencies of 50%, 90%, and 95%. The black regions are above the 95% limit and therefore contain the 5% most likely values. These intervals can be considered "credible regions" assuming uniform prior distributions on C and {theta}. These figures suggest that the use of equation (3) can effectively distinguish between the two cases of consanguinity and population differentiation, in addition to recapturing the true parameter values.

In addition to the previous simulations, our method performed well on simulated data in which the population consisted of numerous (>2) differentiated subpopulations of unequal sizes (it is assumed throughout this section that these subpopulations were represented in the sample in their naturally occurring proportions). The method performed less well when the population was subdivided into just two unequal-sized subpopulations (e.g., 0.9:0.1), for the simple reason that the average of the allele frequencies was very similar to that of the larger of the two sampled subpopulations. In this case, a true {theta} = 0.03 was incorrectly estimated as 0.003. Because {theta} was underestimated, individuals from the smaller sample appeared to have additional excess homozygosity, which the procedure interpreted as the result of consanguinity. In this example, C was incorrectly estimated to be 5%, rather than 0.

Population Study
The use of equation (3) can be demonstrated by its application to microsatellite data collected from two Asian communities (Overall 1998Citation ). The U.K. Asian population presents a situation in which high levels of consanguineous unions have been correlated with high rates of morbidity and mortality, particularly within the Pakistani communities (Terry et al. 1985Citation ; Darr and Modell 1988Citation ; Chitty and Winter 1989Citation ; Bundey et al. 1991Citation ). It is not unusual to observe a proportion of first-cousin marriages of around 50% (Darr and Modell 1988Citation ). High rates of consanguinity are therefore common, and because of the implications for inherited disorders, genetic counseling in such situations has been recommended (Modell 1991Citation ).

A recent study of two U.K. Asian communities (Overall 1998Citation ; Ayres and Overall 1999Citation ) observed evidence for similar levels of excess homozygosity. One of these populations, the Mirpuri, comprised Moslems of Pakistani descent, of whom 50% were born to first cousins. Excess homozygosity was expected within this population as a result. The other population, the Jullunduri, were of Indian Sikh origin with no tradition of consanguinity. Interestingly, excess homozygosity was also observed within this sample. There are a number of explanations for this result. It is possible that there has been an increase in consanguinity resulting from disruptions to traditional marriage practices during migration. In addition, restrictions imposed by U.K. immigration law may have played a part in forcing unorthodox unions. Such an event would increase the incidence of recessive disorders within this community and warrant a comprehensive study of this Asian subpopulation. Alternatively, the Jullunduri population could be substructured, in which case the implications for deleterious recessive traits would depend on the history of population size and isolation. Our method was used to investigate the relative plausibility of these two explanations.

Two sample populations were collected from each of the Nottingham Mirpuri and Jullunduri communities on two separate occasions. The earlier collections of Mirpuri and Jullunduri samples (N = 45 and 45, respectively) were typed for six short tandem repeat (STR) microsatellite loci using the SGM multilocus primer set (Kimpton et al. 1996Citation ). The later collections (N = 48 and 32, respectively) were typed for 10 loci using the SGM Plus primer set (PE Biosystems 1999Citation ). In total, 10% of the samples were rerun to check for typing error. There were no typing discrepancies. Both SGM and SGM Plus amplification kits are optimized to minimize the preferential amplification of smaller alleles and loci. Consequently, the potential for incorrectly assigning homozygosity, which could augment estimates of both consanguinity and substructure, is minimal. Each individual was typed with only one of the two kits. These primers were fluorescent-tagged and facilitated automated scoring using GeneScan software.


    Results
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results
 Discussion
 Acknowledgements
 References
 
The cause of the homozygosity in the two Asian populations could be consanguinity and/or substructure. The possibility of null alleles for the SGM and SGM Plus loci was not considered in the analysis because no significant deviation from HW equilibrium had previously been observed for U.K. Caucasian populations (Evett et al. 1997Citation ) or in Madeira Archipelago (Corte-Real et al. 1999Citation ). In addition, excess homozygosity was not observed consistently at any one locus in the present study.

The results of applying equation (3) to each of the two data sets are shown in the likelihood surfaces of figures 6 and 7 . The likelihood surfaces in figures 6 and 7 support contrasting causes for the excess homozygosity observed within each sample. For the Mirpuri data (fig. 6 ), the maximum-likelihood value implies that 50% of the subjects were born to consanguines. This is in accordance with sociological data about U.K. Pakistani Moslem populations. There is less support for a high value of {theta}. The Jullunduri surface (fig. 7 ) differs from the Mirpuri surface, with two maxima giving quite different interpretations of the data. Two maxima are given for the Jullunduri data because the two scenarios {theta} = 1/16 and C = 100% with R = 1/16 have the same probability distribution if represented as in figure 2 , of which both parts explain the excess homozygosity of around 1/16. With alternative values of R, the maximum-likelihood value favors the substructure scenario (not shown). Clearly, then, this method works more effectively when there is variation in the degree to which the individuals of the population sampled are inbred (i.e., C < 100%). Alternatively, the nongenetic data may help rule out one option. Knowledge of the Jullunduri population makes the interpretation of complete consanguinity implausible (R. Ballard, personal communication). The substructure alternative is quite plausible given that caste endogamy is still practiced in many Sikh communities in India (Mukherjee et al. 1999Citation ) as well as in the United Kingdom (Ballard 1994).



View larger version (31K):
[in this window]
[in a new window]
 
Fig. 6.—The likelihood surface for the Mirpuri sample

 


View larger version (36K):
[in this window]
[in a new window]
 
Fig. 7.—The likelihood surface for the Jullunduri sample

 
The detection of consanguinity entails making use of the multilocus data to detect the difference between the patterns illustrated in figure 2 . This can be demonstrated by randomizing the genotypes to disrupt the associations among loci. When the Mirpuri genotypes have been randomized by reallocation of the observed genotypes, one locus at a time, to individuals chosen at random, this has the effect of retaining the homozygote excess but removing the correlations between loci. As we would predict, the maximum-likelihood estimate of the proportion of cousin marriages drops to 0, and the excess homozygosity is assigned to {theta} = 0.04 (not shown).

Randomizing the alleles among individuals disrupts the correlation between alleles at the same locus and hence removes the homozygote excess. The effect on our estimates when this was done with the Jullunduri data was, as would be predicted, that the maximum-likelihood value went to 0 for both parameters (not shown).


    Discussion
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results
 Discussion
 Acknowledgements
 References
 
Because consanguinity has deleterious consequences for health, particularly with regard to recessive alleles, it is important when devising health care policy to correctly identify communities for which this is an issue. In most situations, consanguinity will exist as an important and integral part of a community's tradition. There are, however, possible situations in which consanguinity may arise through restrictions placed on marriage practices. These restrictions may be imposed by immigration law or simply through the founding of a migrant population. The Jullunduri community is an example of a community for which these issues are of relevance, particularly as this community was founded during the 1950s and was subject to various changes in the law governing the immigration process. Our likelihood method is well suited to addressing such problems. It can provide useful insights into the marriage patterns and population substructure, particularly since immigration law may be having an unpredictable influence on the genetic structure of a community, which is not always reflected in oral records (Darr and Modell 1988Citation ).

Another area in which the distinction between consanguinity and subdivision is of practical importance is forensic science. The Forensic Science Service's (FSS) Asian database consists of individuals of mixed Asian origin, each typed for six STR loci (Gill and Evett 1995Citation ). Within our Jullunduri sample, there are 32 individuals with full profiles at these loci. When our Jullunduri and FSS Asian allele frequency distributions are compared, differentiation appears low ({theta} = 0.01; Ayres and Overall 1999Citation ), implying that the FSS database allele frequencies are representative of the Jullunduri subpopulation. However, our analysis suggests that the excess homozygosity observed within the Jullunduri sample, estimated at around 0.06 (Ayres and Overall 1999Citation ), actually reflects substructure within this population at a lower level.

The implications for forensic interpretation can be illustrated by calculating match probabilities for the 32 individuals using the FSS database. Match probabilities quantify how likely it is to observe a matching STR profile in an alternative suspect, given that the defendant matches the crime scene profile. If we interpret the 6% observed homozygosity excess as the consequence of consanguinity, then we can build this into our match probability calculation using the method of Ayres and Overall (1999)Citation , setting FIS = 0.06. On the other hand, if it is due to population subdivision, then the probability of matching an alternative suspect from the same subpopulation is elevated. This probability can be calculated using the method of Balding and Nichols (1994)Citation and setting FST = 0.06.

This issue of the nonindependence of suspect and offender genotypes is also addressed in detail by Evett and Weir (1998Citation , p. 211). In particular, their calculation of the probability of a crime scene profile, given that some person other than the suspect left the DNA evidence, incorporates allelic dependencies, quantified as FST. The authors arrive at their equations (Evett and Weir 1998Citation , eqs. 8.1) by a route different from that of Balding and Nichols (1994Citation , eqs. 1 and 2). Essentially, both treatments avoid the need to assume allelic independence, as well as the need to specify individual subpopulations.

The outcomes when considering excess homozygosity as inbreeding or substructure in match probability calculations are very different, however. Comparing our corrected match probabilities for each individual with FIS = 0.06, the match probabilities are elevated only 1.1 times on average, with the most extreme result being 3 times the uncorrected value. However, if we use a correction of FST = 0.06, the match probabilities are 8–2,000 times as high, with a mean of 234 times the uncorrected value. Our conclusion that the homozygous excess does indeed reflect subdivision is therefore of clear importance.

A specific case has been presented here in which consanguines are first cousins. Using R = 1/16, however, ignores the possible consanguineous history of the grandparents. To see that this effect is very small, consider the case in which all preceding generations showed the same marriage patterns. We wish to calculate : the probability of IBD in the common grandparents, plus the additional probability due to consanguinity in those grandparents. It can be shown that = C/(2n - C), where n is the number of ancestors in each line of descent through each common ancestor (by extension of eqs. 5.1 and 5.15 in Falconer [1989Citation , pp. 87–97]). If we consider C = 50%, ignoring consanguinity in the grandparental generation increases the probability of identity by less than 1/1,000. The use of R = 1/16 in the above calculations is a reasonable approximation. The increase in identity over generations becomes less trivial, however, when unions are between closer relatives. To incorporate closer degrees of relatedness into the estimation procedure, we need to modify R.

With regular systems of inbreeding (e.g., see Falconer 1989Citation , p. 91), the inbreeding coefficient increases in magnitude with each generation of successive inbreeding. We consider a slightly different situation, in which the inbreeding individuals form only a portion of the population, with this proportion remaining constant from one generation to the next. The proportion of individuals born to consanguines (C) then is partitioned into individuals that have experienced 0, 1, 2, ... , n successive generations of inbreeding. The overall inbreeding coefficient of a population is calculated assuming that C(1 - C) have experienced one generation of inbreeding with coefficient r, C2(1 - C) have experienced two successive generations of inbreeding and have, therefore, an inbreeding coefficient of r + r2, and so forth, giving the series


In the limit of n -> {infty}, this converges to = C/[(1/r) - C], which was previously arrived at by the extension of Falconer's equations, detailed above, and which is a general treatment of equations given in Li (1976Citation , p. 243). For practical purposes, it may be noted that for most systems of close inbreeding, the calculation need not be extended beyond 20 generations (Falconer 1989Citation , p. 93).

This value of can be substituted into equation (3) to give the probability for single-locus data. For multilocus data and close inbreeding, this is not appropriate because it takes the product over loci. This implies that the loci are independent, which they are not. Some individuals have been produced by one generation of inbreeding, and the appropriate inbreeding coefficient will apply to all of their loci. Other individuals will be the products of two generations of inbreeding, and this will apply to all of their loci, and so forth. In general, a proportion of individuals Ck(1 - C) will have an inbreeding coefficient, Rk, that will apply to all loci. For M loci, therefore, the probability of a multilocus genotype (1i1j, 2i2j, ... , MiMj) becomes the sum over generations


For close degrees of inbreeding, this equation gives appreciably different results from the naïve approach of substituting into equations (2) and (3) . The importance of this discrepancy can be illustrated by an example in which we score eight unlinked loci in a species that can self-fertilize. Consider observing one individual homozygous at six loci (pm = 0.1). Under equation (3) , a selfing rate of C = 20% is essentially ruled out with a likelihood over 2,000 times less likely than the maximum-likelihood value of C, around 70%, whereas under equation (4) , C = 20% is plausible, being only 3.2 times less likely.

Although we have presented our method in the context of human medical genetics and forensic science, excess homozygosity has been observed in a wide range of plant and animal studies (Shapcott 1994Citation ; Premoli 1996Citation ; Freeland, Noble, and Okamura 2000Citation ). Population substructure has frequently been suspected as a contributory factor but has not been clearly identified. These examples include species which breed with closer relatives (than cousins) or self-fertilize. For these situations, equation (4) should be preferred. In either case, it is necessary to have some understanding of the underlying pattern of mating that is being compared with substructure. In particular, a value of R must be specified, which requires knowledge of the species' reproductive biology and ecology.

Our approach could be extended to include parameters that specify the subpopulation from which each individual is drawn, following the approach of Pritchard, Stephens, and Donnelly (2000)Citation . Although they did not model consanguineous populations, they did show that it is possible to use Markov chain Monte Carlo methods to probabilistically assign individuals to subpopulations. In many studies, however, the objective is to distinguish consanguinity from population substructure or to quantify the magnitude of substructure among inbred individuals (Shapcott 1994Citation ; Premoli 1996Citation ; Freeland, Noble, and Okamura 2000Citation ). In such cases, the computational costs of increasing the number of parameters may not be justified. Given the abundance of multilocus data currently being produced, our method could find application in a variety of both human and nonhuman population studies.


    Acknowledgements
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results
 Discussion
 Acknowledgements
 References
 
We are grateful to Tamsin Burland and Karen Ayres for valuable comments, to Murrium Ahmad for Asian sample collection, and to the FSS and Mark Thomas for sample processing. We also thank two anonymous reviewers whose comments greatly improved the clarity of this manuscript. This research was supported by Sir Jules Thorn Charitable Trust grant 98/28A (to A.D.J.O) and U.K. NERC grant GR9/04474 (to R.A.N).


    Footnotes
 
Jeffrey Long, Reviewing Editor

1 Present address: Institute of Cell, Animal and Population Biology, University of Edinburgh, Edinburgh, Scotland. Back

2 Keywords: consanguinity inbreeding population substructure short tandem repeat loci U.K. Asian population forensic match probabilities Back

3 Address for correspondence and reprints: A. D. J. Overall, Institute of Cell, Animal and Population Biology, University of Edinburgh, Ashworth Laboratories, King's Buildings, Edinburgh EH9 3JT, United Kingdom. andy.overall{at}ed.ac.uk . Back


    References
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results
 Discussion
 Acknowledgements
 References
 

    Ayres K. L., A. D. J. Overall, 1999 Allowing for within-subpopulation inbreeding in forensic match probabilities Forensic Sci. Int 103:207-216[ISI]

    Balding D. J., R. A. Nichols, 1994 DNA profile match probability calculations: how to allow for population stratification, relatedness, database selection and single bands Forensic Sci. Int 64:125-140[ISI][Medline]

    Ballard R., 1994 Desh Pardesh: the south Asian presence in Britain Oxford University Press, Oxford

    Brookfield J. F. Y., 1996 A simple new method for estimating null allele frequency from heterozygote deficiency Mol. Ecol 5:453-455[ISI][Medline]

    Bundey S., H. Alam, A. Kaur, S. Mir, R. Lancashire, 1991 Why do UK-born Pakistani babies have high perinatal and neonatal mortality rates? Paediatr. Per. Epidemiol 5:101-114

    Chitty L. S., R. M. Winter, 1989 Perinatal mortality in different ethnic groups Arch. Dis. Child 64:1036-1041[Abstract]

    Cockerham C. C., 1973 Analysis of gene frequencies Genetics 74:679-700[Abstract/Free Full Text]

    Corte-Real F., L. Souto, M. J. Anjos, M. Carvalho, D. N. Vieira, A. Carracedo, M. C. Vide, 1999 Population distribution of six PCR-amplified loci in Madeira Archipelago (Portugal) Forensic Sci. Int 100:93-99[ISI][Medline]

    Darr A., B. Modell, 1988 The frequency of consanguineous marriage among British Pakistanis J. Med. Genet 25:186-190[Abstract]

    Evett I. W., P. D. Gill, J. A. Lambert, N. Oldroyd, R. Frazier, S. Watson, S. Panchal, A. Connolly, C. Kimpton, 1997 Statistical analysis of data for three British ethnic groups from a new STR multiplex Int. J. Legal Med 110:5-9[ISI][Medline]

    Evett I. W., B. S. Weir, 1998 Interpreting DNA evidence Sinauer, Sunderland, Mass

    Falconer D. S., 1989 Introduction to quantitative genetics. 3rd edition Longman Scientific and Technical, Harlow, United Kingdom

    Freeland J. R., L. R. Noble, B. Okamura, 2000 Genetic consequences of the metapopulation biology of a facultatively sexual freshwater invertebrate J. Evol. Biol 13:383-395[ISI]

    Gelman A., J. B. Carlin, H. S. Stern, D. R. Rubin, 1995 Bayesian data analysis Chapman and Hall, London

    Gill P., I. Evett, 1995 Population genetics of short tandem repeat (STR) loci Genetica 96:69-87[ISI][Medline]

    Hartl D. L., A. D. Clark, 1997 Principles of population genetics. 3rd edition Sinauer, Sunderland, Mass

    Kimpton C. P., N. C. Oldroyd, S. K. Watson, R. R. E. Frazier, P. E. Johnson, E. S. Millican, A. Urquhart, R. L. Sparkes, P. Gill, 1996 Validation of highly discriminating multiplex short tandem repeat amplification systems for individual identification Electrophoresis 17:1283-1293[ISI][Medline]

    Li C. C., 1976 First course in population genetics Boxwood, Pacific Grove, Calif

    Modell B., 1991 Social and genetic implications of customary consanguineous marriage among British Pakistanis J. Med. Genet 28:720-723[ISI]

    Mukherjee N., P. P. Majumder, B. Roy, M. Roy, B. Dey, M. Chakraborty, S. Banerjee, 1999 Variation at 4 short tandem repeat loci in 8 population groups in India Hum. Biol 71:439-446[ISI][Medline]

    Nagylaki T., 1998 Fixation indices in subdivided populations Genetics 148:1325-1332[Abstract/Free Full Text]

    Nei M., 1987 Molecular evolutionary genetics Columbia University Press, New York

    Overall A. D. J., 1998 The geographic scale of human genetic differentiation at short tandem repeat loci Ph.D. thesis, University of London, London

    PE Biosystems. 1999 AmpFSTR SGM Plus user manual Perkin-Elmer, Foster City, Calif

    Premoli A. C., 1996 Allozyme polymorphisms, outcrossing rates, and hybridization of South American Nothofagus Genetica 97:55-64[ISI]

    Pritchard J. K., M. Stephens, P. Donnelly, 2000 Inference of population structure using multilocus genotype data Genetics 155:945-959[Abstract/Free Full Text]

    Shapcott A., 1994 Genetic and ecological variation in Athererosperma-moschatum and the implications for conservation and biodiversity Aust. J. Bot 42:663-686[ISI]

    Sokal R. R., F. J. Rohlf, 1995 Biometry. 3rd edition W. H. Freeman, San Francisco, Calif.

    Terry P. B., J. G. Bissenden, R. G. Condie, P. M. Mathew, 1985 Ethnic differences in congenital malformations Arch. Dis. Child 62:866-879[ISI][Medline]

    Weir B. S., C. C. Cockerham, 1984 Estimating F statistics for the analysis of population structure Evolution 38:1358-1370[ISI]

    Wright S., 1921 Systems of mating Genetics 6:111-178[Free Full Text]

Accepted for publication July 13, 2001.





This Article
Abstract
FREE Full Text (PDF)
Alert me when this article is cited
Alert me if a correction is posted
Services
Email this article to a friend
Similar articles in this journal
Similar articles in ISI Web of Science
Similar articles in PubMed
Alert me to new issues of the journal
Add to My Personal Archive
Download to citation manager
Search for citing articles in:
ISI Web of Science (12)
Request Permissions
Google Scholar
Articles by Overall, A. D. J.
Articles by Nichols, R. A.
PubMed
PubMed Citation
Articles by Overall, A. D. J.
Articles by Nichols, R. A.