Determination of amino acid pairs sensitive to variants in human ß-glucocerebrosidase by means of a random approach

Guang Wu1,2 and Shaomin Yan3

1 Laboratoire de Toxicocinétique et Pharmacocinétique, Faculté de Pharmacie, Université de la Méditerranée Aix-Marseille II, Marseille, France and 3 Cattedra di Anatomia Patologica, Dipartimento di Ricerche Mediche e Morfologiche, Facoltà di Medicina e Chirurgia, Università degli Studi di Udine, Udine, Italy

2 To whom correspondence should be addressed. E-mail: hongguanglishibahao{at}yahoo.com


    Abstract
 Top
 Abstract
 Introduction
 Materials and methods
 Results
 Discussion
 References
 
In this data-based theoretical analysis, we use the random approach to analyse the amino acid pairs in human ß-glucocerebrosidase in order to determine which amino acid pairs are more sensitive to 109 variants from missense mutant human glucocerebrosidase. The rationale of this study is based on our hypothesis and findings that the harmful variants are more likely to occur at randomly unpredictable amino acid pairs and the unharmful variants are more likely to occur at randomly predictable amino acid pairs. This is because we argue that the randomly predictable amino acid pairs should not be deliberately evolved, whereas the randomly unpredictable amino acid pairs should be deliberately evolved with connection of protein function. The results show, for example, that 93.58% of 109 variants occur at randomly unpredictable amino acid pairs, which account for 71.40% of amino acid pairs in glucocerebrosidase, and the chance of occurrence of the variant is about 4.4 times higher in randomly unpredictable amino acid pairs than in predictable pairs. Hence the randomly unpredictable amino acid pairs are more sensitive to variants in human glucocerebrosidase. The results also suggest that human glucocerebrosidase has a natural tendency to variants.

Keywords: Gaucher disease/glucocerebrosidase/probability/randomness/variant


    Introduction
 Top
 Abstract
 Introduction
 Materials and methods
 Results
 Discussion
 References
 
ß-Glucocerebrosidase (EC 3.2.1.45) is a lysosomal enzyme and catalyses the hydrolysis of D-glucosyl-N-acylsphingosine. The deficiency of glucocerebrosidase results in Gaucher disease, which is the most prevalent glycolipid storage disease (Balicki and Beutler, 1995Go; Beutler, 1997Go). It is characterized by an abnormal accumulation of glucocerebrosides in macrophages, involving in the reticuloendothelial system of the liver, spleen, lung, kidney, bone marrow or central nervous system (Brady et al., 1965Go; Petrides, 1998Go). The prevalence of Gaucher disease ranges between 1:30 000 and 1:50 000 in most countries (Niederau and Haussinger, 2000Go). Clinical manifestations of Gaucher disease span an exceptionally broad spectrum, ranging from hyrdops fetalis (Ginsburg and Groll, 1973Go; Daneman et al., 1983Go; Sun et al., 1984Go) to incidental diagnoses in patients older than 70 years (Chang-Lo and Yam, 1967Go; Berrebi et al., 1984Go). A major part of this variability is explained by different mutations of the glucocerebrosidase gene, but even within genotypes variability is marked (Beutler, 2001Go).

The gene for human glucocerebrosidase is located on chromosome 1q21. To date, ~110 different mutations are known to occur in glucocerebrosidase gene (Incerti, 1995Go; Grabowski and Horowitz, 1997Go; Beutler and Gelbart, 1998Go), including point mutations, splice junction mutations, deletions, fusion alleles and recombinant alleles (Stone et al., 2000Go). However, with so many variants in the enzyme, little is known about which amino acid sub-sequences in glucocerebrosidase are more sensitive to variants. It is still difficult to draw a general rule on which amino acid sub-sequences are more sensitive to variants and which amino acid sub-sequences are less sensitive to variants. If such a general rule can be drawn, then we could not only gain more insight into the relationship between the glucocerebrosidase and Gaucher disease, but more important we could also give attention to these sensitive sub-sequences in order to prevent them from variants. Moreover, we could even in principle predict the possible sub-sequences sensitive to the currently unknown variants.

This problem can be assessed from different approaches such as empirical (regression analysis), experimental (artificial and natural mutations) and computational (multiple sequence comparisons and alignments), etc. A pseudogene for glucocerebrosidase is located 16 kb downstream from the functional gene, sharing 97% exonic sequence homology (Horewitz et al., 1989Go; Winfield et al., 1997Go). Many mutations are complex alleles due to recombination events between the gene and pseudogene, such as gene conversion or unequal crossing over (Cormand et al., 2000Go). However, these explanations still do not answer why some amino acid sub-sequences are sensitive to variants.

Probably the probabilistic approach can contribute the understanding of this problem, because in the past we have used two probabilistic approaches to analyse the primary structures of different proteins with the hope that these approaches might throw light on glucosylceramidase constructions and the related Gaucher disease. In general, our first approach can predict the present and absent amino acid sub-sequences in a protein primary structure. We argue that the randomly predictable present and absent sub-sequences should not be deliberately evolved, whereas the randomly unpredictable present and absent sub-sequences should be deliberately evolved. Accordingly, our first approach can classify the present amino acid sub-sequences as randomly predictable and unpredictable sub-sequences. We suggest that the randomly unpredictable amino acid sub-sequences are more related with protein function and the variants in these sub-sequences may lead to the dysfunction of the protein. More recently, we found that a mutation, which leads to the dysfunction of rat monoamine oxidase B, is located in a randomly unpredictable amino acid pair. In contrast, another mutation, which does not affect rat monoamine oxidase B function, is located in randomly predictable amino acid pairs (Wu and Yan, 2001Go).

In this study, we attempted to use our first random approach to analyse amino acid pairs in human ß-glucocerebrosidase with its 109 variants in order to determine which amino acid pairs are more sensitive to variants.


    Materials and methods
 Top
 Abstract
 Introduction
 Materials and methods
 Results
 Discussion
 References
 
The amino acid sequence of human ß-glucocerebrosidase and its 109 variants with missense mutation were obtained from the Swiss-Prot data bank (Bairoch and Apweiler, 2000Go) (access number P04062) (owing to limitations of space, we will not cite the numerous references related to the enzyme and its variants). The detailed calculations and their rationales have already been published in a number of previous papers [for details, see our recent review (Wu and Yan, 2002Go)]. Briefly, the calculation procedure, with examples, is as follows.

Amino acid pairs in human glucocerebrosidase

The human glucocerebrosidase precursor is composed of 536 amino acids. We count the first and second amino acids as an amino acid pair, the second and third as another amino acid pair, the third and fourth, and so on, until the 535th and 536th, hence there is a total of 535 amino acid pairs. As there are 20 types of amino acids, any amino acid pair can be composed from any of 20 types of amino acids, so theoretically there are 400 (202) kinds of amino acid pairs. Again there are 535 amino acid pairs in human glucocerebrosidase, which is more than 400 kinds of theoretical amino acid pairs, hence clearly some of the 400 kinds of theoretical amino acid pairs should appear more than once. Further, we may expect that some of the 400 kinds of theoretical amino acid pairs are absent from human glucocerebrosidase.

Randomly predicted frequency and actual frequency

The randomly predicted frequency is calculated according to the simple permutation principle (Feller, 1968Go). For example, there are 42 alanines (A) in human glucocerebrosidase, and the predicted frequency of amino acid pair ‘AA’ would be 3 (42/536x41/535x535 = 3.213). Actually we can find three ‘AA’s in human glucocerebrosidase, so the actual frequency of ‘AA’ is 3. Hence we have three relationships between actual and predicted frequencies, i.e. the actual frequency is smaller than, equal to and larger than the predicted frequency.

Randomly predictable present amino acid pairs

As described in the last section, the frequency of randomly presence of amino acid pair ‘AA’ would be 3 and ‘AA’ really appears three times in human glucocerebrosidase, so the presence of ‘AA’ is randomly predictable.

Randomly unpredictable present amino acid pairs

There are 44 serines (S) in human glucocerebrosidase, and the frequency of random presence of amino acid pair ‘AS’ would be 3 (42/536x44/535x535 = 3.448), i.e. there would be three ‘AS’s in human glucocerebrosidase. However, ‘AS’ actually appears five times, so the presence of ‘AS’ is randomly unpredictable. This is also a case where the actual frequency is larger than the predicted frequency. Another case is that the actual frequency is smaller than the predicted frequency. For example, there are 60 leucines (L) and 37 prolines (P) in human glucocerebrosidase and the predicted frequency of ‘LP’ is 4 (60/536x37/535x535 = 4.142), whereas the actual frequency of ‘LP’ is 2.

Randomly predictable absent amino acid pairs

There are 23 arginines (R) and eight cysteines (C) in human glucocerebrosidase, and the frequency of random presence of ‘RC’ would be 0 (23/536x8/535x535 = 0.343), i.e. the amino acid pair ‘RC’ would not appear in human glucocerebrosidase, which is true in the real situation. Hence the absence of ‘RC’ is randomly predictable.

Randomly unpredictable absent amino acid pairs

There are 27 phenylalanines (F) in human glucocerebrosidase, and the frequency of random presence of ‘AF’ would be 2 (42/536x27/535x535 = 2.116), i.e. there would be two ‘AF’s in human glucocerebrosidase. However, no ‘AF’ appears in the enzyme, therefore the absence of ‘AF’ from human glucocerebrosidase is randomly unpredictable.

Variants in randomly predictable and unpredictable amino acid pairs

Our rationale for the determination of variants in randomly predictable and unpredictable present amino acid pairs is based on the finding of our previous study (Wu and Yan, 2001Go), which is described as follows. There are two mutations in rat monoamine oxidase B. The first mutation occurs at position 139 changing leucine (L) to histidine (H). The amino acids at positions 138 and 140 are proline (P) and alanine (A), hence this mutation leads to four amino acid pairs changed, i.e. ‘PL’ -> ‘PH’ and ‘LA’ -> ‘HA’. As ‘PL’ and ‘LA’ are randomly predictable amino acid pairs according to our random analysis, consequently we would not expect the first mutation to lead to a substantial change in enzymatic activity, which is true in the real situation. The second mutation occurs at position 199 changing ‘I’ to ‘F’ leading to the changes in amino acid pairs as ‘II’ -> ‘IF’ and ‘IS’ -> ‘FS’. As ‘IS’ belongs to the randomly unpredictable amino acid pairs, we would expect the second mutation to bring about a substantial change in enzymatic activity, and such an expectation also is true in the real situation. In this manner we hope to determine whether a variant occurs at randomly predictable or unpredictable amino acid pairs in human glucocerebrosidase in order to gain more insight into the relationship between variants and sensitivity of amino acid pairs.

Difference between actual and randomly predicted frequencies

For the numerical analysis, we calculate the difference between the actual frequency (AF) and predicted frequency (PF) of affected amino acid pairs, i.e. {Sigma}(AF – PF). For instance, a variant at position 215 substitutes ‘A’ for ‘D’, which results in two amino acid pairs, ‘LA’ and ‘AS’, changing to ‘LD’ and ‘DS‘, because the amino acid is ‘L’ at position 214 and ‘S’ at position 216. The actual frequency and predicted frequency are 6 and 5 for ‘LA’, 5 and 3 for ‘AS’, 4 and 3 for ‘LD’ and 2 and 2 for ‘DS’, respectively. Hence the difference between actual and predicted frequencies is 3 with respect to the substituted amino acid pairs, i.e. (6 - 5) + (5 - 3), and 1 with respect to the substituting amino acid pairs, i.e. (4 - 3) + (2 - 2). In this way, we can compare the frequency differences in the amino acid pairs affected by variants.


    Results
 Top
 Abstract
 Introduction
 Materials and methods
 Results
 Discussion
 References
 
General information on amino acid pairs in human glucocerebrosidase

Of 400 kinds of theoretical amino acid pairs, 134 kinds are absent from human glucocerebrosidase including 37 randomly predictable and 97 randomly unpredictable. Consequently, 535 amino acid pairs in human glucocerebrosidase include only 266 kinds of theoretical amino acid pairs (400 - 134 = 266), i.e. some amino acid pairs should appear more than once. Actually, of 535 amino acid pairs in human glucocerebrosidase, 119 kinds of theoretical amino acid pairs appear once, 82 kinds twice, 37 kinds three times, 17 kinds four times, one kind five times, eight kinds six times, one kind seven times and one kind 13 times.

Of 266 kinds of theoretical amino acid pairs in human glucocerebrosidase, 107 kinds are randomly predictable and 159 kinds are randomly unpredictable. As mentioned above, some kinds of amino acid pairs appear more than once, thus of 535 amino acid pairs in human glucocerebrosidase, 153 pairs are randomly predictable and 382 pairs are randomly unpredictable. We therefore can find how many variants occur with respect to these present amino acid pairs in human glucocerebrosidase (Table IGo).


View this table:
[in this window]
[in a new window]
 
Table I. Occurrences of variants with respect to randomly predictable and unpredictable amino acid pairs in human glucocerebrosidase (GBA)
 
Variants of human glucocerebrosidase in randomly predictable and unpredictable present amino acid pairs

As mentioned in the Materials and methods section, a missense mutant protein leads to two amino acid pairs being substituted by another two and their actual frequency can be smaller than, equal to and larger than the randomly predictable frequency. Tables II and IIIGoGo detail the situations related to substituted and substituting amino acid pairs, respectively, and the relationship between their actual and randomly predicted frequencies.


View this table:
[in this window]
[in a new window]
 
Table II. Classification of substituted amino acid pairs induced by variants in human glucocerebrosidase
 

View this table:
[in this window]
[in a new window]
 
Table III. Classification of substituting amino acid pairs induced by variants in human glucocerebrosidase
 
Table IIGo can be read as follows. The first column classifies the amino acid pairs into randomly predictable and unpredictable. The second and third columns show where the variant occurs in which type of amino acid pairs; for example, the first two cells in columns 2 and 3 indicate that the actual frequencies are equal to the predicated frequencies in amino acid pairs I and II. The fourth and fifth columns indicate how many variants occur in amino acid pairs I and II; for instance, seven of 109 variants (6.42%) occur at amino acid pairs whose actual frequencies are equal to the predicted frequencies. The sixth column indicates the percentage of variants occurring at predictable and unpredictable amino acid pairs.

Tables IGo and IIGo indicate that 93.58% of variants occur at randomly unpredictable present amino acid pairs and 6.42% of variants occur in randomly predictable amino acid pairs. These results imply that 159 kinds of randomly unpredictable present amino acid pairs account for 93.58% variants in human glucocerebrosidase, whereas 107 kinds of randomly predictable present amino acid pairs account for only 6.42% of variants. Still, we can see from the ratio in Table IGo that the chance of occurrence of variants in unpredictable amino acid pairs is far larger than in predictable amino acid pairs. For example, the chance of occurrence of variant is almost 8-fold higher in unpredictable kind than in predictable kind (0.64 vs 0.07). These results strongly support our rationale that the harmful variants are more likely to occur at randomly unpredictable present amino acid pairs, which therefore are more sensitive to the variants.

When looking at the unpredictable pairs in Table IIGo, we find that the majority of these pairs are characterized by one or both substituted pairs whose actual frequency is larger than the predicted frequency (the first three rows in unpredictable pairs). Comparing each variant, we find that the impact of variants is to diminish the difference between actual and predicted frequencies by means of reducing the actual frequency, which indicates that the variants lead to the construction of amino acid pairs being randomly predictable. In other words, the variants result in the construction of amino acid pairs being more naturally easy to occur. It is interesting that there are only five variants occurring in the amino acid pairs whose actual frequency is smaller than predicted frequency in both pairs. This phenomenon suggests that it is difficult for variants to narrow the difference between actual and predicted frequencies by means of increasing the actual frequency, which, however, would lead to the construction of amino acid pairs opposite to the natural direction.

Table IIIGo can be read as follows. The first and second columns indicate the actual and predicted situations in amino acid pairs I and II, the third and fourth columns indicate the number of variants occurring at amino acid pairs I and II and their percentages and the fifth column is the total percentage of our classifications.

Table IIIGo shows that 44.04% of variants bring about one or both substituting amino acid pairs which are absent in normal human glucocerebrosidase (AF = 0). Also, 57.80% of variants target one or both substituting amino acid pairs with their actual frequency smaller than the predicted frequency ({dagger}). These phenomena indicate that the amino acid pairs in mutant proteins are more randomly constructed.

Frequency difference of amino acid pairs affected by variants

The difference between actual and predicted frequencies represents a measure of randomness of construction of amino acid pairs, i.e. the smaller the difference, the more random is the construction of amino acid pairs. In particular, (i) the larger the positive difference, the more randomly unpredictable is the presence of amino acid pairs; and (ii) the larger the negative difference, the more randomly unpredictable is the absence of amino acid pairs.

Considering all 109 variants, the difference between actual and predicted frequencies is 1.68 ± 0.17 (mean ± SE, ranging from -2 to 6) for substituted amino acid pairs. This means that the variants occur in the amino acid pairs which appear more than their predicted frequency. Meanwhile, the difference between actual and predicted frequencies is -0.11 ± 0.18 (mean ± SE, ranging from -4 to 5) for substituting amino acid pairs, which implies that the substituting amino acid pairs are randomly constructed in the mutant glucocerebrosidase, as their actual and predicted frequencies are about the same. A striking statistical difference is found between the substituted and substituting amino acid pairs (P < 0.0001). Figure 1Go shows the distribution of the difference between actual and predicted frequencies.



View larger version (26K):
[in this window]
[in a new window]
 
Fig. 1. Frequency difference between substituted (shaded) and substituting (unshaded) amino acid pairs induced by variants from human glucocerebrosidase.

 

    Discussion
 Top
 Abstract
 Introduction
 Materials and methods
 Results
 Discussion
 References
 
In this study we used the random approach to analyse the amino acid pairs in human glucocerebrosidase to determine which amino acid pairs are more sensitive to variants. The results confirm our hypothesis that the randomly unpredictable amino acid pairs are more sensitive to the variants. This data-based theoretical analysis may provide a clue for preventing human glucocerebrosidase from variants and shed some light on the nature of the variants.

Based on our previous studies [see our review article (Wu and Yan, 2002Go), for details of references], our argument is that the functional amino acid pairs should be deliberately evolved and hence the actual frequency should be different from the predicted frequency. As the predicted frequency represents the highest chance for construction of amino acid pairs, it is important to find out whether the variants lead the actual frequency to approach the predicted frequency. If so, we can understand that the protein has a natural trend to variants; if not, the protein does not have a natural trend to variants. The present study demonstrates that the human glucocerebrosidase has a natural trend to variants.

In this study, the unpredictable amino acid pairs account for 71.40% of 535 amino acid pairs in glucocerebrosidase, and the unpredictable amino acid pairs account for 69.32 ± 4.48% of amino acid pairs in 13 different proteins (Wu and Yan, 2002Go). If we consider that the proteins chosen in our studies were randomly sampled from the Swiss-Prot data bank, we could estimate that all the proteins might have about 70% randomly unpredictable amino acid pairs in their primary structure.

With respect to randomly unpredictable absent and present amino acid pairs, we are interested in the difference between actual and predicted frequencies, because the predictable absence and presence represent the naturally easiest occurring events, i.e. the construction of amino acid pairs should be the least energy and time consuming. Hence the difference between actual and predicted frequencies should be engineered by the evolutionary process: the larger the difference, the greater the impact of the evolutionary process. A diminishing difference between actual and predicted frequencies has been shown in this study, hence the variants in fact represent a degeneration process inducing Gaucher disease related to the glucocerebrosidase variants.

In this study, we focused our efforts on the linear primary structure, i.e. one-dimensional structure. Therefore, we used the (i, i + 1) amino acid pair rather than (i, i + 2), (i, i + 3) and (i, i + k) amino acid pairs and the (i, i + 1) amino acid pairs are constructed by means of peptide bond. On the other hand, we would use the (i, i + 2), (i, i + 3) and (i, i + k) amino acid pairs when we consider high-level protein structures and these amino acid pairs are constructed by means of S–S bonds, for instance.

In this study, we are dealing with amino acid pairs rather than triplets, quadruplets, multiplets, because a point mutation is directly related to two amino acid pairs. The first pair is composed of an amino acid preceding the mutated amino acid and the mutated amino acid and the second pair is composed of an amino acid following the mutated amino acid and the mutated amino acid, except for the case when point mutation occurs at the beginning and the end of the amino acid sequence. This means that the neighboring amino acids have direct effects on the stability of the mutated amino acid. On the other hand, the amino acids located beyond the preceding and the following amino acid would have less direct effects on the mutated amino acid, although they still have some effects. These amino acids belong to the triplets and quadruplets, multiple amino acid sequences. Although two amino acids in a triplet have direct effects on the stability of the mutated amino acid when the mutated amino acid is just located in the middle of the triplet, this is still the case of an amino acid pair, and we should consider the indirect effect when the mutated amino acid is not located in the middle of the triplet. Therefore, we consider that we should focus our efforts on the amino acid pairs at the first stage, because the amino acid pairs have a direct effect on the point mutation, and this direct effect can be easily quantified.


    References
 Top
 Abstract
 Introduction
 Materials and methods
 Results
 Discussion
 References
 
Bairoch,A. and Apweiler,R. (2000) Nucleic Acids Res., 28, 45–48.[Abstract/Free Full Text]

Balicki,D. and Beutler,E. (1995) Medicine (Baltimore), 74, 305–323.[CrossRef][ISI][Medline]

Berrebi,A., Wishnitzer,R. and Von der Walde,U. (1984) Nouv. Rev. Fr. Hematol., 26, 201–203.[ISI][Medline]

Beutler,E. (1997) Curr. Opin. Hemotol., 4, 19–29.[Medline]

Beutler,E. (2001) Blood, 98, 2597–2602.[Abstract/Free Full Text]

Beutler,E. and Gelbart,T. (1998) Blood Cells Mol. Dis., 24, 2–8.[CrossRef][ISI][Medline]

Brady,R.O., Kanfer,J. and Shapiro,D. (1965) Biochem. Biophys. Res. Commun., 18, 221–225.[ISI]

Chang-Lo,M. and Yam,L.T. (1967) Am. J. Med. Sci., 254, 303–315.[ISI][Medline]

Cormand,B., Diaz,A., Grinberg,D., Chabas,A. Vilageliu,L. (2000) Blood Cells Mol. Dis., 26, 409–416.[CrossRef][ISI][Medline]

Daneman,A., Stringer,D. and Reilly,B.J. (1983) Radiology, 149, 463–467.[Abstract]

Feller,W. (ed.) (1968) An Introduction to Probability Theory and Its Applications, Vol. I. 3rd edn. Wiley, New York.

Ginsburg,S.J. and Groll,M. (1973) J. Pediatr., 82, 1046–1048.[ISI][Medline]

Grabowski,G.A. and Horowitz,M. (1997) In Zimran,A. (ed.), Gaucher’s Disease: Molecular, Genetic and Enzymological Aspects. Baillière’s Clinical Haematology, London, pp. 635–656.

Horewitz,M., Wilder,S., Horowitz,Z., Reiner,O., Gelbart,T. and Beutler,E. (1989) Genomics, 4, 87–96.[ISI][Medline]

Incerti,C. (1995) Semin Hematol., 3(suppl 32), 3–9.

Niederau,C. and Haussinger,D. (2000) Hematogastroenterology, 47, 984–997.

Petrides,P.E. (1998) Arzneimitteltherapie, 16, 49–51.

Stone,D.L., Tayebi,N., Orvisky,E., Stubblefield,B., Madike,V. and Sidransky,E. (2000) Hum. Mutat., 15, 181–188.[CrossRef][ISI][Medline]

Sun,C.C., Panny,S., Combs,J. and Gutberlett,R. (1984) Pathol. Res. Pract., 179, 101–104.[ISI][Medline]

Winfield,S.L., Tayebi,N., Martin,B.M., Ginns,E.I. Sidransky,E. (1997) Genome Res., 7, 1020–1026.[Abstract/Free Full Text]

Wu,G. and Yan,S.-M. (2001) Biomol. Eng., 18, 23–27.[CrossRef][ISI][Medline]

Wu,G. and Yan,S.-M. (2002) Mol. Biol. Today, 3, 55–69.

Received June 25, 2002; revised January 2, 2003; accepted January 23, 2003.





This Article
Abstract
FREE Full Text (PDF)
Alert me when this article is cited
Alert me if a correction is posted
Services
Email this article to a friend
Similar articles in this journal
Similar articles in ISI Web of Science
Similar articles in PubMed
Alert me to new issues of the journal
Add to My Personal Archive
Download to citation manager
Search for citing articles in:
ISI Web of Science (7)
Request Permissions
Google Scholar
Articles by Wu, G.
Articles by Yan, S.
PubMed
PubMed Citation
Articles by Wu, G.
Articles by Yan, S.