From the Howard Hughes Medical Institute, Molecular Biology Institute, UCLA-Department of Energy Laboratory of Structural Biology and Molecular Medicine, University of California, Los Angeles, California 90095-1570
![]() |
ABSTRACT |
---|
![]() ![]() ![]() ![]() ![]() ![]() |
---|
In this paper we explore the usefulness of the Database of Interacting Proteins (DIP)1 for assessing the reliability of measurement of protein interaction. Until two years ago, when high throughput screens of protein interaction were developed, the information within interaction databases was collected from the small scale screens in hundreds of individual research papers. The biological relevance of each interaction had often been investigated thoroughly, sometimes with a repertoire of experimental techniques and often with multiple controls (4, 5). These independent, often repeated observations, coupled with controls and curation in the peer-review process, enhanced the reliability of the published data. In the past two years, high throughput, genome-wide detections of protein interactions by yeast two hybrid (Y2H) and mass spectrometric analysis of protein complexes have increased tremendously the experimental coverage. The new methods can generate rapidly more information than was collected by traditional means in more than a decade (610). However, the large size of such datasets makes it impractical to verify individual interactions by the same methods used previously in small scale experiments (11, 12). The question then arises, Do these new, high throughput methods of detecting interactions provide information as reliable as the small scale experiments? Verifying the interactions from these high throughput methods is vital (1115), because only then can the large and small scale data be combined into one self-consistent interaction network useful for further studies.
To address these issues we have analyzed the complete set of 8063 protein-protein interactions identified in yeast, Saccharomyces cerevisiae, that are described in DIP as of November 2001. We demonstrate that the subset of interactions obtained through the high throughput Y2H screens differs in several respects from the subset based only on the small scale or multiple, redundant experiments. Most notably, analysis of the coexpression profiles of the interacting partners leads to the conclusion that, overall, only about 30% of the high throughput dataset possesses the same characteristic mRNA expression features as the dataset based on the small scale experiments. To further pinpoint the interactions within the dataset that are likely to be correct, interactions were analyzed between protein pairs that are paralogs of the tested proteins. This resulted in the identification of 1400 interactions likely to be correct. A reliable, self-consistent set of interactions totaling
3000 is extracted when these
1400 are combined with the small experiment datasets and with interactions verified by more than one experiment.
![]() |
EXPERIMENTAL PROCEDURES |
---|
![]() ![]() ![]() ![]() ![]() ![]() |
---|
|
Functional Correlation
Proteins have been assigned to 44 "cellular role," 58 "functional," and 29 "compartment" categories in the Yeast Protein Database (YPD) (19, 20). Cellular role is defined as the major biological process involving the protein and function as the principal structural, regulatory, or enzymatic function of the protein. The YPD categories are broad, and a large percentage of proteins are associated with more than one cellular role, function, or compartment (subcellular location).
The functional annotation, cellular role, and compartment, if one exists, were collected for all the S. cerevisiae open reading frames from the YPD database. We counted a correlation if the two interacting proteins shared one or more annotated function in a manner analogous to Schwikowski et al. (15). The background probability that one could expect two proteins to share a common function was calculated using all possible pairs of proteins annotated in a given category.
Expression Profile Reliability Index
The expression profile reliability index (EPR) was extracted from the interaction datasets by solving the equality-constrained linear least squares problem defined by Equation 2 (see "Results") using LAPACK implementation of the GRQ factorization method (21) and a discrete representation of the (d2) distributions (up to 30 bins, 1.25 units wide; only bins with at least five counts were included in the calculations).
2 was calculated assuming binomial distribution of the error for the individual bins in each of the histograms. The accuracy of the fitted parameter was estimated using a bootstrapping approach with 5,000 synthetic datasets as described (22).
The Euclidean expression distance between proteins A and B, dAB, was calculated according to Equation 1,
![]() | (Eq. 1) |
where eiN is a log ratio of the expression level of protein N under the ith conditions as reported customarily by Brown and co-workers (23). The sum is performed over a set of 12 distinct shock conditions using the data provided by Gasch et al. (23).
Paralogous Verification Method
The paralogous verification method (PVM) validates interacting pairs using the existence of paralogous interactions. Paralogs were collected by performing intraproteome comparisons using PSI-BLAST (24). Each predicted open reading frame product of S. cerevisiae served as a query sequence against the entire database of S. cerevisiae. The PSI-BLAST comparisons were performed using the BLOSUM62 substitution matrix and the seg filter to mask compositionally biased regions in the query sequence. To arrive at the optimal definition of family, different PSI-BLAST conditions were examined, and the coverage and sensitivity were measured.
![]() |
RESULTS |
---|
![]() ![]() ![]() ![]() ![]() ![]() |
---|
|
All these factors can lead to the observation of either false negatives (interactions that cannot be detected under the conditions used) or false positives (physical interactions without biological meaning). Here we concentrate on the following two problems: 1) identifying the fraction of false positives within the high throughput datasets (using EPR) and 2) identifying true positives (using PVM). We do this by relating the global properties of these datasets with those of the reference set of biologically relevant interactions extracted from the DIP database. The underlying assumption of this approach is that, by the virtue of its size and diversity, this reference dataset (INT) captures the most prominent features of biologically relevant protein-protein interactions and therefore can be used to judge the quality of other interaction datasets.
Functional Correlation
We began by asking what level of functional resemblance we can find between two interacting S. cerevisiae proteins in DIP. For this study, we divided the interacting pairs into four datasets; DIP-YEAST includes all pairs, EC3 and EC2 are datasets with greater than or equal to three or two observations supporting the interaction, respectively, and INT is the set of interactions observed in at least one small scale experiment. A full description of the subsets is given under "Experimental Procedures."
Fig. 2 shows the percentage agreement of function, cellular role, and compartment as defined by the YPD (19, 20) for the pairs. The horizontal black line gives the background percentage agreement. It shows that if we pick two proteins at random from the set with known functions, the members of 18% of pairs agree in function. The difference between the observed agreement and this background is large in all cases.
|
Notice in Fig. 2 that the INT set and the EC2 and EC3 sets show substantially higher correlation than the DIP-YEAST set. The relative lack of agreement of compartment within the DIP-YEAST data (63%) could be, in part, because of the large number of interactions between nuclear and cytoplasmic proteins (15); these are expected as there are many reports of proteins shuttled between these compartments through the nuclear pore (31). The INT dataset may show higher correlation because of a better relationship between functional annotation and protein interactions described in the small scale studies. However, if we select random pairs of proteins from INT, as opposed to the entire set, a similar level of random correlation is observed. This points to a similar level of multiple annotation and possible cross-talk in both cases.
It should also be remembered that the annotations in these categories may have been transferred from homologous proteins without experimental confirmation and as such are subject to error. However, when we calculate the percentage correlations for the set of experimentally annotated proteins calculated they are similar to the results described above.
Function (the principal structural, regulatory, or enzymatic function) is the least conserved of the three properties. This is not surprising, as an interaction between two proteins does not demand that they share an identical function; rather it demands that they are linked in a functional network. Thus, the linkages observed between functional groups could well be biologically meaningful. For example, Schwikowski et al. (15) found that there are a large number of interactions between the categories of protein folding and protein translocation. Therefore, in the assessment of an individual interaction, identical assignments of function or cellular role should not always be expected; rather consideration should be given to the relationships between the functions of the proteins.
The poorer conservation of function, compartment, and cellular role within the DIP-YEAST dataset than the INT, EC2, and EC3 datasets suggests that small scale studies yield more reliable results than high throughput studies; this calls for methodologies, which determine the reliability of a dataset and the reliability of any given interaction. Here we introduce two computational methods that use mRNA expression data and sequence analysis, respectively, to assess reliability of the high throughput datasets and to identify protein-protein interactions that are likely to be correct. An overview of the two methods is offered in Fig. 3.
|
Fig. 4A shows the normalized distribution of expression level distances (d2) for several sets of protein interaction data. The curve RND1 gives the distribution for randomly generated sets of protein pairs. Notice that it is the broadest distribution shown, with the lowest peak. The curve INT is for the small scale dataset and is seen to have the highest peak and sharpest distribution. Those differences are statistically significant (confidence level p = 10-140), as inferred from a Kolmogorov-Smirnof test. We take the INT set to be a reference set of interacting proteins and the RND1 set to be representative of non-interacting proteins.
|
![]() | (Eq. 2) |
where i and
n are the expression distance probability distributions for the interacting and non-interacting protein pairs, and the expression profile reliability index,
EPR, corresponds to the fraction of the true positives in the experimental dataset.
The n distribution can be obtained as the distribution of expression distances for all protein pairs within a genome, because the full genome distribution is of vast size (
9 · 106 for S. cerevisiae) and must be dominated by the non-interacting pairs. The
i can be approximated by the distribution of the expression distances for all the reliable interactions present in DIP-YEAST (for example INT). The latter assumption seems to be valid as the set of interactions described in DIP-YEAST is in the majority of cases obtained in a manner that did not rely on the expression levels of the interacting partners. Therefore, it can be treated as a representative sample of the entire protein-protein interaction set, random with respect to the expression levels of the interacting proteins.
A linear least-squares fit of the GY2H dataset to the model described by Equation 1 allows us to evaluate the EPR parameter. The
EPR is calculated as 31 ± 3% (Table II) for the GY2H data, suggesting that
70% of the reported pairs in this set are, in fact, false positives. To verify that
EPR indeed reflects the expected accuracy of the experimental results, subsets of the GY2H corresponding to varying stringency of selection were constructed as reported by Ito et al. (7). Ito et al. (7) created these sets by identifying those interactions with at least 1, 2, ... 8 ISTs, labeled here as ITO1 to ITO8, respectively. As expected, the accuracy of the resulting subsets, as evaluated by
EPR, increases with increased selection stringency (Fig. 4B). This indicates that the EPR index can be used to characterize the accuracy of experimental, large scale protein-protein interaction datasets, and corresponds crudely to the fraction of pairs that is meaningful biologically. However, the error on
EPR increases rapidly with decreasing dataset size, therefore limiting the applicability of EPR in general to large (>500 interaction) datasets.
|
Using Paralogous Interactions to Verify Protein-Protein Interactions: the PVM Method
The reliability of a given protein interaction can be evaluated by the presence of paralogous interactions. The basis for this is that if two proteins are paralogs then the proteins that they are observed to interact with are often also paralogs. This observation is related to the notion of interologs proposed by Vidal and co-workers (9).
To validate a given interaction between a pair of proteins, P1 and P2, all the paralogs of P1 and P2 are collected, and the number of interactions observed in DIP between these two families, excluding the interaction P1 to P2, are counted (Fig. 3). This count is the PVM score.
To ascertain the ability of this method (PVM) to identify true interactions and ignore false interactions, the behavior on datasets of interacting proteins must be compared with the behavior on datasets of non-interacting proteins. We generated the datasets of non-interacting proteins computationally because of the difficulty in crafting such a set from reports within the literature. The three random sets of protein interactions (RND1, RND2, RND3) described under "Experimental Procedures" were used as the non-interacting sets; although these sets will not be entirely free of interactions, the percentage should be very small (see "Experimental Procedures").
Three sets of protein interactions were used as true interaction sets, the INT, EC2, and EC3 sets (see "Experimental Procedures"). The EC2 and EC3 sets are smaller than the INT set (Table I) and can be used by PVM but are not suitable as reference datasets for EPR, because the uncertainty in EPR is large for such small datasets (Table II).
The efficacy of the PVM method can be illustrated by a selectivity-sensitivity curve (also known as a receiver-operator characteristic curve) shown in Fig. 5. It shows that a score that selects few (1%) false positives is sensitive to
40% of the true interactions. That is, the method shows high specificity but a lower sensitivity. This lack of sensitivity in part reflects the lack of paralogs of some proteins. Such interactions cannot score >0. Thus if the INT, EC3, and EC2 sets are modified to consider only those pairs where at least one of each of the pairs has more than one paralog (Fig. 5) an improvement in sensitivity of
10% is observed. The low sensitivity is therefore not caused solely by the lack of paralogs but is perhaps because of both the lack of experimental data and, in a number of cases, a lack of any paralogous interactions.
|
The receiver-operator characteristic curve also demonstrates that the magnitude of the score is unimportant, merely that a score greater than zero indicates a high probability that an interaction exists. Thus, if a given low reliability interaction (such as Y2H (32)) has paralogs but a score of zero, it can be validated either directly or by testing for a paralogous interaction.
It is clear that PVM can only be used in cases where the proteins involved in the interaction have paralogs. In S. cerevisiae 3130 of the 6356 proteins have paralogs ( 50%). This level of paralogs appears to be typical. Koonin et al. (33) found that 46% of the Escherichia coli genome has paralogs, and
of the proteins within the COG database (34) are found to have paralogs.
![]() |
DISCUSSION |
---|
![]() ![]() ![]() ![]() ![]() ![]() |
---|
PVM, on the other hand, is able to assess the quality of individual protein-protein interactions. However, it can also estimate the total number of biologically relevant interactions within a dataset. This estimation is based on the observation that in the subsets of EC2, EC3, and INT with paralogs, 50% of the interactions are identified by PVM (Table I). Thus, PVM should identify
50% of the biologically relevant interactions within any given dataset. The number of true interactions within a set can, therefore, be estimated as twice the number given by PVM. In the DIP-YEAST set only 1428 of the 6083 interactions that could score did. Thus the expected number of true interactions is around 2800 of the subset with paralogs. This suggests that
2800 of
6000 interactions are valid, giving an error rate for the overall DIP-YEAST of around 50%. This compares well with the EPR estimation of
47% given in Table II.
The ability of PVM to identify roughly half the true interactions within a given dataset means that it can also be used to indicate the quality of a dataset, by means of the percentage of identified interactions. The different Ito et al. (7) subsets described under "Results" and "Experimental Procedures" were examined separately using PVM, and it was found that as the number of independent observations of the interactions increased from 1 to 8 the percentage of the dataset identified as correct by PVM increased (Table I) much as the EPR index improves (Fig. 4B). The efficacy of PVM can also be demonstrated by examining the EPR of the subset of DIP-YEAST selected by PVM. It demonstrates that this dataset behaves within experimental error like the INT set (Table II).
DIP Yeast Interactions Estimated to be Correct
There are about 5600 interactions within the DIP-YEAST dataset identified solely in the genome-wide Y2H screens. These include roughly 3000 interactions that were reported by Ito et al. (7) as based on only single IST. Although these interactions are expected to contain many false positives (26) the results in Tables I and II demonstrate that they still contain a significant proportion of true positives, and the method such as PVM is suited ideally to identify at least some of them.
A subset of the DIP-YEAST interactions believed to be correct can be identified by merging the PVM (1428), INT (2246), and EC2 (1179) sets (Table I); this gives a total of 3003 interactions. This set is denoted as the CORE and is available on the DIP website (dip.doe-mbi.ucla.edu). Four hundred fifty-four of the CORE interactions are identified by PVM alone and as such could not be validated by any other method. Fig. 2 shows that this CORE set of interactions has a correlation of function pattern that is similar to the sets believed to be correct (INT, EC2, and EC3). The gross number of interactions predicted to be correct based on the EPR index of DIP-YEAST is 4000. Thus though PVM is able to identify putatively correct interactions with very high selectivity it is unable even with the inclusion of INT and EC2 to extract from DIP-YEAST all those interactions, which are estimated to be correct by EPR.
![]() |
ACKNOWLEDGMENTS |
---|
![]() |
FOOTNOTES |
---|
Published, MCP Papers in Press, April 4, 2002, DOI 10.1074/mcp.M100037-MCP200
1 The abbreviations used are: DIP, Database of Interacting Proteins; EPR, expression profile reliability; IST, interaction sequence tag; PVM, paralogous verification method; Y2H, yeast 2 hybrid; GY2H, genome-wide Y2H; YPD, Yeast Protein Database.
* This work was supported in part by National Institutes of Health and the Department of Energy. The costs of publication of this article were defrayed in part by the payment of page charges. This article must therefore be hereby marked "advertisement" in accordance with 18 U.S.C. Section 1734 solely to indicate this fact.
Contributed equally to this work.
Supported by a Wellcome Trust fellowship.
¶ To whom correspondence should be addressed: Howard Hughes Medical Inst., Molecular Biology Inst., UCLA-DOE Laboratory of Structural Biology and Molecular Medicine, University of California, Los Angeles, P.O. Box 951570, Los Angeles, CA 90095-1570. Tel.: 310-825-3754; Fax: 310-206-3914; E-mail: david{at}mbi.ucla.edu.
![]() |
REFERENCES |
---|
![]() ![]() ![]() ![]() ![]() ![]() |
---|