Protein Design Group, CNBCSIC, Cantoblanco, E-28049 Madrid, Spain
![]() |
Abstract |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Keywords: bioinformatics/co-evolution/phylogenetic tree/protein interaction/proteomics
![]() |
Introduction |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
In parallel with these developments, a number of computational techniques have been designed for predicting protein interactions from the information contained in sequences and genomes (Dandekar et al., 1998; Enright et al., 1999
; Marcotte et al., 1999a
,b
; Pellegrini et al., 1999
). These computational techniques still have a limited range of applicability; for example, Enright et al. predicted a total of 64 interactions in three bacterial genomes (Enright et al., 1999
). The accuracy and coverage of these techniques were recently compared (Huynen et al., 2000
) (see Discussion).
It has been observed previously that phylogenetic trees of ligands and receptors, e.g. insulin and insulin receptors (Fryxell, 1996), were more similar to what could be expected from a general divergence between the corresponding species under the standard molecular clock hypothesis (Zuckerkandl, 1987
). The similarity between the phylogenetic trees of interacting proteins was interpreted as an indication of their coordinated evolution and a direct consequence of the similar evolutionary pressure applied to all the members of a given cellular complex.
An extreme of co-evolution of two interacting proteins would be those cases in which both proteins are simultaneously lost in the same species, probably because one of them cannot perform its function without the other. One such example could be His5 (His synthesis) and TrpC (Trp synthesis). This observation is the base of the `phylogenetic profiles' method (Pellegrini et al., 1999).
In this work, we went one step beyond the binary information (presence/absence of the genes in different species) using the information contained in the full structure of the phylogenetic tree. We measured the similarity between trees as the correlation between the distance matrices used to build the trees, with a methodology similar to that recently published by Goh et al. (Goh et al., 2000). In that work, they assessed the similarity of the trees in two examples, the two domains of phosphoglycerate kinase (PGK) and the chemokinereceptor system, quantifying the degree of symmetry between the corresponding trees. Here we extend the idea to a search for interaction partners in a large collection of possibilities. The results indicate that it is indeed possible to distinguish statistically a few true interactions among many possible alternatives, opening up the possibility of searching for interaction partners in large collections of proteins and complete genomes.
![]() |
Materials and methods |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Structural domains. The first data set was composed of 13 proteins of known structure for which two structural domains in close interaction are clearly visible (Pazos et al., 1997). These proteins were used to produce a collection of domains. The multiple sequence alignments were taken from the HSSP database (Sander and Schneider, 1993
), March 2000 version. The calculation of the similarity of phylogenetic trees was carried out for those pairs of domains with at least 11 sequences from the same species (see below). The final set contained 133 pairs of domains including 13 pairs of truly interacting domains, that is, pairs in which the two domains belong to the same original protein of known structure.
Proteins. A second set was build with 53 Escherichia coli proteins extracted from a set previously analyzed by Dandekar et al. (Dandekar et al., 1998). The multiple sequence alignments for those proteins were made searching with BLAST (Altschul et al., 1990
) using a cut-off value of P(N) < 1x105 and aligning with ClustalW (Higgins et al., 1992
) the homologous sequences in 14 fully sequenced microbial genomes (M.tuberculosis, Rhizobium sp., E.coli, H.pylori, Synechocystis sp.,M.thermoautotrophicum, A.aeolicus, B.burgdorferi,P.horikoshii, T.pallidum, B.subtilis, M.jannaschii, H.influenzae, A.fulgidus). As in the previous case, only pairs of proteins for which it was possible to collect more than 11 sequences from the same species were analyzed, leading to a final number of 244 pairs, that included eight pairs of known interactions and eight pairs of possible interactions (see below).
Genomes. A whole genome experiment was performed by collecting alignments for 4300 E.coli proteins and combining them in 67209 pairs of matrices for the analysis. The alignments were collected as in the set above, from 14 complete genomes by searching with BLAST and aligning the homologous sequences with ClustalW. The results of the searches are available at http://www.pdg.cnb.uam.es/mirrortree.
Methodology
For each pair of proteins, the two initial multiple sequence alignments were refined by selecting the sequences that correspond to common species, producing two trees with the same number of leaves (Figure 1). In the cases in which one species contained more than one homologous sequences of a given protein (paralogous sequences), only one of them was selected. We used a simple criterion for the selection, choosing the sequence more similar to the E.coli protein or to the HSSP master sequence in the cases of the E.coli and structural domains test sets.
|
The multiple sequence alignments were used for building matrices containing the distances between all possible protein pairs. Distances were calculated as the average value of the residue similarities taken from the McLachlan amino acid homology matrix (McLachlan, 1971) (Figure 1
). Finally, the linear correlation coefficient (r) between the data of these two matrices was calculated according to the standard equation (Press et al., 1992
):
|
It is important to note that the method does not require the construction of the phylogenetic trees and only the underlying distance matrices are analyzed, which makes this approach independent of any given tree-construction method.
![]() |
Results |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Table I contains the full list of correlation values for the 133 pairs of domains analyzed. The positions of the 13 real interactions are highlighted. It can be seen that most of them correspond to high correlation values (nine out of 13 have correlation values better than 0.77).
|
|
This experiment with the structural domains could be considered an `easy' test, since the interaction partners are domains of the same protein and therefore likely to be subject to stronger evolutionary pressure and co-adaptation. However, it is still an interesting test since it provides an upper threshold for the correlation value of true interactions. The average value of the true interactions (Table I) is 0.78, very similar to the value obtained by Goh et al. in the two-domain protein assessed by them (Goh et al., 2000
) (i.e. r = 0.79 for the two domains of PGK).
Interactions between proteins
The second test was carried out on the 244 pairs of proteins derived from the Dandekar et al.'s set (Dandekar et al., 1998) (see Materials and methods). This set contains eight true interactions between well-known proteins and a small number of other possible interactions, e.g. different ribosomal proteins which form part of the same macromolecular complex even though they may not interact directly.
As in the previous test, most of these pairs of truly interacting proteins have high correlation values (Figure 3) and there is a clear correlation between interaction index and true interactions with eight out of eight true interations and seven out of eight possible interactions with correlation values better than 0.8. The pair with the highest correlation value corresponds to the known interaction between the
and ß subunits of the membrane ATP synthase and the first `false positive' corresponds to the pair formed by the chaperonin GroEl and the ribosomal protein S7.
|
We carried out a fully automatic prediction of protein interactions at the genomic scale with the aim of obtaining a significant number of predictions. We generated alignments for 4300 E.coli proteins which allowed the study of 67209 possible interaction pairs. This number is still far from the total of 9.2x106 possible pairs between E.coli proteins, of which about 20 000 true interactions are expected if we consider the average of interactions per protein detected in H.pylori by Yeast Two-Hybrid screening (Rain et al., 2001). In our case, the main limitation for building a larger data set was the use of a relatively small set of 14 genomes for constructing the alignments.
The analysis of the possible interactions leads to the proposal of more than 2700 pairs of proteins that were found to have scores better than 0.8 (Figure 4). Well-known interactions are included among the stronger predictions, including proteins such as ATP synthase
and ß, elongation factors Tu and G, and ribosomal proteins S2S10 and S2S11. Among the pairs with interaction predictions better than 0.8, there are 460 proteins labeled as hypothetical. For example, the protein of unknown function YHBZ_ECOLI is predicted to interact with the ribosomal protein S4 and YFGK_ECOLI is predicted to interact with the polynucleotide phosphorylase PNP_ECOLI. For these proteins these predictions are the first clue about their possible function.
|
![]() |
Discussion |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
We based the study presented here on the common previous observation of the similarity between the trees of interacting proteins. Examples of systems in which this relation was previously observed are insulin and dockerin and their corresponding receptors (Fryxell, 1996; Pages et al., 1997
). Recently, the relation between the corresponding phylogenetic trees was carefully quantified for multiple sequence alignments corresponding to the two domains of phosphoglycerate kinase (Goh et al., 2000
).
Here we present a systematic extension of these ideas about the similarity in the evolutionary history of complementary proteins (`mirror trees') by applying it to the large-scale detection of interacting protein partners. The results obtained in different systems demonstrate that the similarity between phylogenetic trees can be used as a predictor of protein interaction, with >66% of true positives detected at correlation values better than 0.8. This value of 0.8 seems to be a good empirical cut-off to discriminate between true and false interactions (Figures 2 and 3) in accordance with Goh et al. (Goh et al., 2000
). This cut-off is probably a stringent one, since it has been derived from domains of the same protein that are likely to be in permanent (not transient) interaction. It is possible that for free independent proteins the pressure for interaction would be smaller, the signals left by their interaction in the corresponding trees probably weaker and correspondingly the correlation values smaller.
One of the more interesting properties of the current approach is its capacity to cover a significant number the potential interactions. For example, in the set of proteins derived from the study of Dandekar et al. (Dandekar et al., 1998), it was possible to study the interaction of 244 pairs, accounting for 18% of the total number of 1378 possible pairs. In the genome-based experiment the number of possible interactions explored was as large as 67209, a substantial number even if it is still a small fraction of the possible 9.2x106 pairs. In this case the number of predictions of interaction was of 2742, which is clearly higher than the 64 interactions predicted from the information about domain arrangements by Enright et al. for three genomes (Enright et al., 1999
) or the 749 predicted by Marcotte et al. for E.coli (Marcotte et al., 1999). The coverage of these techniques was compared using the genome of M.genitaliun for benchmarking (Huynen et al., 2000
) and it ranges from 6% for the techniques based on gene fusion events to 37% for those based on the conservation of gene order. A separate issue is how accurate these predictions would be.
Despite the promising results obtained in the different tests carried out, a number of problems are still present in the current approach. First, the number of possible interactions could have been increased by collecting sequences from a larger number of genomes or by improving the process of selection of the corresponding sequences from the same species in the corresponding pairs of protein families. It is to be expected that the continuous stream of new sequences and genomes would alleviate this problem, allowing us to increase the number of predictions easily. Second, the quality of the underlying alignments is a key factor and a number of false positives are introduced in the case such as the Cys-rich protein discussed in Results. Third, it is possible that some inaccuracies are introduced by comparing distance matrices instead of the real phylogenetic trees, since the distance matrices are not a perfect representation of the corresponding phylogenetic trees. Given that the comparison of phylogenetic trees is a difficult and not fully solved problem, we decided to short-cut the problem by comparing their underlying distance matrices. Finally, it is really difficult to assess definitively the accuracy of any of the protein interaction prediction methods in the absence of a well-accepted and large enough collection of annotated protein interactions.
Among the positive features of the mirror tree approach, it is interesting to mention that it does not require the presence of fully sequenced genomes, as other methods do, e.g. the `phylogenetic profiles' method (Pellegrini et al., 1999), since the mirror tree approach is based only on information about protein families whether they are coming from complete genomes or not.
This approach and the others commented upon here have different ranges of reliability and applicability. A prospect for the future is to combine them to obtain an in silico prediction of the interaction network of an organism.
![]() |
Notes |
---|
![]() |
Acknowledgments |
---|
![]() |
References |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Dandekar,T., Snel,B., Huynen,M. and Bork,P. (1998) Trends Biochem. Sci., 23, 324328.[ISI][Medline]
Enright,A.J., Iliopoulos,I., Kyrpides,N.C. and Ouzounis,C.A. (1999) Nature, 402, 8690.[ISI][Medline]
Fryxell,K.J. (1996) Trends Genet., 12, 364369.[ISI][Medline]
Gaasterland,T. and Ragan,M.A. (1998) Microb. Comp. Genomics, 3, 199217.[Medline]
Goh,C.S., Bogan,A.A., Joachimiak,M., Walther,D. and Cohen,F.E. (2000) J. Mol. Biol., 299, 283293.[ISI][Medline]
Higgins,D.G., Bleasby,A.J. and Fuchs,R. (1992) Comput. Appl. Biosci., 8, 189191.[Abstract]
Huynen,M., Snel,B., Lathe,W. and Bork,P. (2000) Genome Res., 10, 12041210.
Ito,T., Tashiro,K., Muta,S., Ozawa,R., Chiba,T., Nishizawa,M., Yamamoto,K., Kuhara,S. and Sakaki,Y. (2000) Proc. Natl Acad. Sci. USA, 97, 11431147.
Marcotte,E.M., Pellegrini,M., Ho-Leung,N., Rice,D.W., Yeates,T.O. and Eisenberg,D. (1999a) Science, 285, 751753.
Marcotte,E.M., Pellegrini,M., Thompson,M.J., Yeates,T.O. and Eisenberg,D. (1999b) Nature, 402, 8386.[ISI][Medline]
McLachlan,A.D. (1971) J. Mol. Biol., 61, 409424.[ISI][Medline]
Mendelsohn,A.R. and Brent,R. (1999) Science, 284, 19481950.
Pages,S., Belaich,A., Belaich,J.P., Morag,E., Lamed,R., Shohan,Y. and Bayer,E.A. (1997) Proteins, 29, 517527.[ISI][Medline]
Pazos,F., Helmer-Citterich,M., Ausiello,G. and Valencia,A. (1997) J. Mol. Biol., 271, 511523.[ISI][Medline]
Pellegrini,M., Marcotte,E.M., Thompson,M.J., Eisenberg,D. and Yeates,T.O. (1999) Proc. Natl Acad. Sci. USA, 96, 42854288.
Press,W.H., Teukolsky,S.A., Vetterling,W.T. and Flannery,B.P. (1992) Numerical Recipes in C: the Art of Scientific Computing. 2nd edn. Cambridge University Press, Cambridge.
Rain,J.C. et al. (2001) Nature, 409, 211215.[ISI][Medline]
Sander,C. and Schneider,R. (1993) Nucleic Acids Res., 21, 31053109.[ISI][Medline]
Uetz,P. et al. (2000) Nature, 403, 623631.[ISI][Medline]
Wootton,J.C. and Federhen,S. (1996) Methods Enzymol., 266, 554571.[ISI][Medline]
Zuckerkandl,E. (1987) J. Mol. Evol., 26, 3446.[ISI][Medline]
Received January 1, 2001; revised May 25, 2001; accepted June 18, 2001.