MRC Laboratory of Molecular Biology, Cambridge CB2 2QH, UK
* Author for correspondence (e-mail: cvogel{at}mrc-lmb.cam.ac.uk)
Accepted 3 September 2003
![]() |
SUMMARY |
---|
![]() ![]() ![]() ![]() ![]() ![]() |
---|
Key words: Protein evolution, Cell-cell recognition, Comparative evolution, Reverse genetics
![]() |
Introduction |
---|
![]() ![]() ![]() ![]() ![]() ![]() |
---|
In this paper, we describe the determination of the immunoglobulin superfamily (IgSF) repertoire in the fly Drosophila melanogaster and compare it with that found in the nematode Caenorhabditis elegans. IgSF proteins are well known for their roles in cell-cell recognition and communicationboth crucial processes during embryonal development. A comparison of the functions and the size of this superfamily in the two organisms should give some idea of the nature of the changes in protein repertoires that underlie the increases in physiological complexity in the fly, for example, a more elaborate nervous system.
The IgSF repertoire in C. elegans was initially investigated by
Hutter et al. (Hutter et al.,
2000) and by Teichmann and Chothia
(Teichmann and Chothia, 2000
).
As we show below, refinements of the genome sequence and protein predictions
carried out since then have revealed additional members of the IgSF. Another
smaller superfamily whose members are involved in cell adhesion processes, the
cadherins, has been described previously for both the worm and fly
(Hill et al., 2001
).
We first describe the determination of the IgSF repertoire in Drosophila and of the new IgSF sequences in C. elegans. We then analyse the IgSF proteins common to both organisms and specific to each, in terms of their homologies and functions. In the conclusion, we discuss the implications of our results for an understanding of the role of this superfamily during the metazoan evolution and as a framework for further experimental investigation.
![]() |
Materials and methods |
---|
![]() ![]() ![]() ![]() ![]() ![]() |
---|
The names used here for the predicted proteins are the identifiers given in FlyBase and WormBase except for those proteins with names given by experimentalists who previously determined their sequences and, in most cases, their function. These specific names start with a capital letter to denote that they refer to proteins; small letters refer to genes.
A schematic overview of the procedures used to analyse these sequences is shown in Fig. 1 and described in detail below.
|
Previous to the work described here, the SUPERFAMILY HMMs were matched to
the protein sequences predicted from the available genome sequences including
those of Drosophila and C. elegans. The results of these
matches are available from the public SUPERFAMILY database
(Gough et al., 2001;
Gough and Chothia, 2002
). We
extracted from SUPERFAMILY all Drosophila and C. elegans
sequences that are matched by HMMs for IgSF domains with an expectation value
score (E-value) of less than 0.01. The E-value is a theoretical value for the
expected error rate. Large-scale tests show that these theoretical
expectations are very close to the observed error rates. In our case, an
E-value threshold of 0.01 corresponds to 1% error in the structural assignment
(Gough et al., 2001
).
HMM matches close to the E-value threshold were inspected by eye and judged
for their correctness. In some cases they were also checked by using SMART
(Schultz et al., 2000) to make
domain assignments. As a result, three sequences matched with only marginally
significant scores by SUPERFAMILY were rejected.
Unassigned regions of roughly 100 residues length with IgSF domains on both
sides were inspected for the pattern of key residues that is a characteristic
of the immunoglobulin superfamily (Chothia
et al., 1988; Harpaz and
Chothia, 1994
). Several additional IgSF domains were detected by
this procedure.
Identification of non-IgSF domains, signal sequences, transmembrane
helices and GPI anchors
The proteins identified as containing one or more IgSF domains were
examined for other features and domains, using six servers.
These predictions were edited manually and compared with information from the literature (see below).
The IgSF proteins are either soluble or they are attached to the membrane
by a transmembrane helix or a GPI anchor. For ten proteins, the GPI Predictor
(Eisenhaber et al., 1999)
found sites for attachment of GPI anchors. For proteins with a transmembrane
helix, the IgSF domains are always in the extracellular region. After the
immunoglobulin superfamily itself, the next most abundant superfamily in IgSF
proteins are fibronectin type III domains, followed by the ligand-binding
domain of the LDL receptor, BPTI-like domains and protein-kinase like domains.
Domains from 21 superfamilies are found in both organisms, six and 10 domain
superfamilies are specific to the fly and the worm, respectively.
Revision of gene predictions
In the analyses of metazoan genome sequences, a significant fraction of the
predictions made for large proteins are incomplete, particularly at their N
and/or C termini (Teichmann and Chothia,
2000; Hill et al.,
2001
). Some of these errors can be detected if there are already
experimental determinations of the predicted sequences, or of close
homologues, and corrected by matching the experimental sequences to the genome
using the GENEWISE procedure (see below).
To detect whether predicted protein sequences are incomplete they were matched against three sets of experimental sequences
|
|
|
In addition to these improvements in the sequences of the current FlyBase release number 3 (http://www.fruitfly.org/sequence/dlMfasta.shtml), there are 13 cases of genes predicted by the previous release, number 2, that are shorter or absent in the current release. These sequences are indicated in Tables 1 to 3.
Revision of the C. elegans IgSF repertoire
IgSF proteins in C. elegans were described previously
(Hutter et al., 2000;
Teichmann and Chothia, 2000
).
In Teichmann and Chothia (Teichmann and
Chothia, 2000
), 64 proteins were identified. Since then, new
predictions based on revised genome sequences have been released
(http://www.wormbase.org/downloads.html).
These were analysed using procedures similar to those described above for
Drosophila proteins. This resulted in a new total of 80 IgSF proteins
in C. elegans. Of these 80, 53 are identical or nearly identical to
those found in the previous work, eight are revised versions of old
predictions and 19 are new (Tables
2 and
3). For the revised versions,
the respective homologue in C. briggsae was examined and taken in one
case (SSSD1.1) to improve the gene prediction using GENEWISE
(Birney and Durbin, 2000
).
Classification of IgSF proteins
In discussing the IgSF proteins we find that it is useful to divide them
into six classes. These classes are based on broad functional similarities,
although within each class the proteins also have common features in terms of
domain architecture. Proteins that share a particular domain architecture
belong largely, but not always, to the same cluster of closely related IgSF
proteins. Details of these relationships are described in Tables
1 to
3 and the text below.
Cell surface I (see Fig.
2)
These are proteins that span the cell membrane via a transmembrane helix or
are attached to the cell surface by a GPI anchor. They have an extracellular
region that is exclusively, or almost exclusively, composed of IgSF and
fibronectin type III (FnIII) domains, and cytoplasmic domains that are not
kinases or phosphatases. Experimentally characterised proteins in this class
are mainly cell-adhesion molecules that play important roles in
development.
|
Cell surface III (see Fig.
2)
These are proteins that span the cell membrane via a transmembrane helix or
are attached to the cell surface by a GPI anchor. They have an extracellular
region that is composed of IgSF domains and a variety of different domains.
Experimentally characterised proteins in this class act as signalling
molecules during neural development.
Secreted proteins (see Fig.
3)
These proteins have a variety of different domain architectures that can
consist of just IgSF domains but can also include other domains, some of which
are unusual. They act as intercellular messengers: secreted by one cell and
interacting with cell surface receptors on other cells. Three different groups
of proteins fall into this class: (1) proteins for which it has been shown
experimentally that they are secreted; (2) proteins that have a signal
sequence but no transmembrane helix or GPI anchor predicted; and (3) proteins
that do not have a signal sequence, transmembrane helix or GPI anchor
predicted but show sequence similarity to a proteins from (1) or (2) according
to the E-value threshold described below.
|
Muscle proteins (see Fig.
3)
These proteins are usually rather long with more than ten IgSF domains in a
row, sometimes in combination with FnIII domains in a characteristic pattern.
Some muscle proteins also have kinase domains. Experimentally characterised
proteins in this class are all involved in muscle function.
All proteins were grouped into these six classes if (1) experimental work demonstrated functions characteristic to one class, (2) features in domain architecture clearly pointed towards affiliation to one class, and/or (3) the protein showed sequence similarity to a protein member of a specific class according to the E-value threshold described below. The few proteins for which none of the criteria (1), (2) or (3) apply were grouped into a `bin' class called `proteins of unknown cellular localisation'.
The final set of IgSF protein sequences in the two organisms have a variety of domain architectures. Figures 2 and 3 illustrate the variety of these domain architectures we found in the IgSF repertoire of fly and worm in terms of the number and kind of different domains observed in the proteins. The number of domains per protein varies from one in small signalling proteins to 68 in fly Titin. There are a few very long proteins that are in the muscle and extracellular matrix proteins classes.
Detection of relationships between IgSF proteins in
Drosophila and C. elegans by sequence comparisons
In the following sections we describe and compare the IgSF proteins. To
discover the relationships described below for IgSF proteins in C.
elegans and Drosophila, we considered a combination of E-values
for the matching sequence pairs or, for closely related proteins, sequence
identities, match lengths and domain architectures. For proteins that are
closely related to known structures or are very short, we also used key
residue analysis (Chothia et al.,
1988; Harpaz and Chothia,
1994
). But before presenting this it is useful to discuss the
different levels of sequence similarities that exist in these proteins and
their relation to function.
By definition, all the proteins considered here contain at least one IgSF
domain and are therefore homologous in at least that region. However,
relationships at this basic level are not very informative. What is of more
use are relationships that imply some functional annotation. We tried,
therefore, to identify by sequence comparisons clusters of closely related
IgSF proteins whose members are likely to have been produced by relatively
recent gene duplication events and to have similar functions. To do this we
first determined the extent to which indications of affiliation to one of the
six functional classes can be detected from comparison of sequences. We took
the 58 Drosophila IgSF proteins whose function has been
experimentally characterised and allocated them to one of the six functional
classes described in the last section. The 58 proteins were then matched to
each other using the Smith-Waterman algorithm
(Smith and Waterman, 1981).
The scores in terms of E-value and sequence identity made by each of the
matched pairs were examined.
For protein pairs whose sequence identities are greater that 40%, their
close relationship is obvious. But for those where it is smaller than 40%, a
statistical measure such as the E-value is much more reliable for inference of
homology than sequence identity (Brenner et
al., 1998). For those pairs that have E-values lower than
1020 we plot the results shown in
Fig. 4. Matches that occur
between proteins in the same functional class and those that occur between
proteins in different classes are distinguished. It clearly shows that most of
matches with an E-value lower than 1035 are between proteins
within the same functional classes. The exceptions, where proteins of
different functional classes match with E-values lower than
1035, arise from two clusters. The Beat proteins cluster has
14 members of which four are cell-surface class I proteins and ten are
secreted proteins. Lachesin and Amalgam are two closely related proteins the
first of which is a cell surface class I protein and the second is in the
secreted proteins class.
|
Thus, the matches made between the 58 Drosophila proteins suggest that sequences with identities of 40% or greater or E-values below 1035 belong to the same functional class. Note that the match region covered more than 50% of the length of both proteins. (It should be noted that not all proteins within a functional class match each other with a score less that 1035. This means that only positive results are significant; a negative one just means a function cannot be implied by sequence comparisons.)
All the IgSF proteins meeting these conditions were then grouped into clusters of closely related, homologous proteins using a single linkage algorithm: a protein qualifies as a member of a cluster if it matches at least one of the other cluster members within the above mentioned thresholds. All clusters were inspected by eye to ensure accuracy, and a few clusters were split into separate clusters based on domain architectures and inter-domain connections of subgroups of proteins within the cluster, as described below. We used these clusters to assign uncharacterised proteins that were homologous to characterised proteins to the six functional classes.
![]() |
Results and discussion |
---|
![]() ![]() ![]() ![]() ![]() ![]() |
---|
Drosophila and Anopheles gambiae (mosquito) diverged from their common ancestor some 250 million years ago. Of the 142 Drosophila proteins, 128 have a clear orthologue in Anopheles: i.e. the Drosophila and Anopheles homologues match each other with scores better than those they made to any other protein. A similar situation applies to C. elegans: C. elegans and C. briggsae diverged some 40 million years ago. Here, eight IgSF proteins in C. elegans lack an orthologue in C. briggsae. The existence of clear orthologues is good evidence that the matching proteins are not pseudo-genes. The absence of a match, however, does not necessarily mean that the sequence is a pseudo-gene. This may arise from incomplete predictions, the loss of the protein in Anopheles or C. briggsae, or its recent formation in Drosophila or C. elegans.
Prior to this work, 58 Drosophila and 22 C. elegans proteins had been identified by experimental work and assigned a function. All but 25 of the other 84 Drosophila and the 58 C. elegans IgSF proteins have been assigned to one of the six functional classes defined above. Those not classified, 12 in Drosophila and 13 in C. elegans, are placed in a class termed `proteins of unknown cellular localisation' (see Tables 1 and 2).
The assignments to these functional classes have been made on the basis of
sequence homology and/or the presence or absence of signal sequences and
transmembrane helices. The problem with using the latter features is that the
prediction of long protein sequences often misses out N-terminal and
C-terminal regions (Teichmann and Chothia,
2000; Hill et al.,
2001
). Thus, we might expect that, in some cases, proteins
currently placed in the secreted proteins class, because they have a signal
sequence but no transmembrane helix or GPI anchor site, will be transferred to
a cell surface class by subsequent discovery of a C-terminal region with one
of these features. Similar revisions could well transfer proteins currently in
the unknown class to the secreted or cell surface classes.
Table 4 summarises the distribution of the proteins, and clusters of closely related proteins, between the different functional classes. In both organisms, the two largest functional classes are the cell surface class I proteins (82 and 30 in fly and worm, respectively) and the secreted proteins class (22 and 12 proteins) many of whose members have important roles during development. These proteins form three-quarters of the Drosophila IgSF repertoire and half of that in C. elegans. The average size of the two clusters in Drosophila is larger than in C. elegans. The other four functional classes have similar numbers of fly and worm proteins. As mentioned above, these numbers are likely to be modified when more accurate data become available, but any such changes are unlikely to change the general result.
|
Many members of the large clusters have been previously identified: 20
proteins in the DPR cluster (Nakamura et
al., 2002), all 14 Beat proteins
(Fambrough and Goodman, 1996
),
Sidestep on its own (Sink et al.,
2001
), three Kekkons
(Musacchio and Perrimon,
1996
), and Wrapper and Klingon
(Butler et al., 1997
;
Noordermeer et al., 1998
).
Except for the cluster of Wrapper/Klingon, all these larger clusters are in
the set of Drosophila-specific proteins that do not have C.
elegans orthologues. This is an example of the lineage-specific
expansions of protein families described by Aravind et al.
(Aravind et al., 2000
).
Comments on individual proteins and protein clusters Beat and Dpr
clusters
These two clusters had been identified and their functions determined prior
to this work (Fambrough and Goodman,
1996; Nakamura et al.,
2002
; Pipes et al.,
2001
). Although some of the Beat proteins have only marginal or no
sequence matches, key residue analysis shows they are all related to each
other. Note that some Beat proteins are attached to the cell membrane whilst
others are secreted.
It proved to be difficult to reconstruct all the relationships between Dpr1
to Dpr20 described by Nakamura et al.
(Nakamura et al., 2002). In
some cases, the relationships are very remote and could only be shown by key
residue analysis. For some of the sequences, the gene predictions were
improved using the GENEWISE procedure (see above) and the Dpr-1 homologue as
the query sequence (see above and Table
1). Dpr-12 has been mentioned in the work by Nakamura et al., but
it could not be found in the set of predicted proteins. Owing to its small
size (56 amino acids: the size of half an Ig domain), it has been disregarded
in this analysis. CG31114-PA, CG14469-PA, CG15380-PA and CG15183-PA are
predicted proteins that also belong to the same cluster, but were not
mentioned previously.
Dscam cluster
We were able to identify three novel Dscam-like proteins (CG18630-PA in
proposed fusion with CG7060-PA, CG32387-PA and CG31190-PA). Dscam is the
Drosophila homologue of the human Down's syndrome cell adhesion
molecule (DSCAM), which is required for axon guidance
(Schmucker et al., 2000). The
Dscam-like proteins hence represent interesting experimental targets.
CG1084-PA
This protein has been described recently as Drosophila homologue
of the human Contactin (Falk et al.,
2002). In fact, it makes a somewhat better match to Axonin, as was
also found previously for its worm orthologue C33F10.5A
(Teichmann and Chothia, 2000
).
The differences between Axonin and Contactin are subtle, but can be important
when looking at the detailed functions of the proteins: For example, Contactin
is known to display heterophilic but no homophilic binding activities
(Falk et al., 2002
), while
both were observed for Axonin (Kunz et
al., 2002
). Both proteins interact with members of the L1 family,
e.g. NrCAM, and are involved in axon guidance.
CG15354-PA and CG15355-PA
These two proteins match the N-terminal and C-terminal halves of
CG31970-PA. They are also adjacent on the chromosome. We propose a fusion of
the two predictions to give one protein.
C. elegans IgSF proteins
The IgSF repertoire in C. elegans comprises 80 proteins. Of these
25 belong to one of seven clusters of two or more homologous worm proteins.
This means that 25-7=18 proteins have been produced by gene duplication. This
is only one quarter of the C. elegans repertoire; as we have just
seen the proportion in Drosophila is one-half. The two largest
clusters are the Zig proteins (eight members) and PVR-like kinases (five
members). The other four have only two members (see Tables
2 and
3). Only 22 out of the 80
C. elegans protein have been identified by experiments.
Comments on individual proteins and protein clusters Zig
proteins
Only Zig-2, Zig-3 and Zig-4 have sequence matches with E-values smaller
than 1035. The membership of the other sequences in this
family is based on their similar domain architecture, functional roles and
manual inspection of the sequence alignments (see
Aurelio et al., 2003).
SSSD1.1
The SSSD1.1 sequence in Wormbase has 623 amino acid residues. Using the
homologous C. briggsae sequence and the GENEWISE procedure, we were
able to identify additional exons, which increase the length of the predicted
protein to 744 residues. SSSD1.1 is probably the C. elegans
orthologue of Turtle (see Table
3).
Proteins common and specific to Drosophila and C.
elegans
Table 3 lists the proteins
in the 26 clusters of closely related IgSF proteins that this work indicates
as having homologues in Drosophila and C. elegans. These
contain in all 36 proteins from Drosophila and 35 from C.
elegans, i.e. a quarter of those in the first organism and just under
half of those in the second.
Previous work had proposed putative orthologues for the Drosophila proteins DPTP9 (K04D7.4), Lar (C09D8.1), PTP6 (F56D1.4), ImpL2 (C14F5.2, F42F12.2, Y48A6A.1), Kirre (K02E10.8, now SYG-1), Neuroglian (C18F3.2/3) and Klingon/Wrapper (F41D9.3b). Details of these, and the relationships found in this work are described in Table 3.
The cell surface class I has been mentioned above as the largest class in
both organisms and as one of the two classes with large expansions in the fly.
This is also true for the subset of those proteins common to both organisms:
Drosophila has 21 while C. elegans has 12 proteins in the 11
clusters of the cell surface class I. There is only one cluster in this
functional class, Neuroglian, where there are more members in the worm than in
the fly (two and one, respectively). The clusters in the other functional
classes have similar contributions from the two organisms with one exception.
The exception is the PVR cluster of kinases, which has one member from
Drosophila but five from C. elegans. An expansion of the
cluster of kinases in C. elegans has been reported before
(Rubin et al., 2000).
In both organisms, the number of proteins in the two largest functional classes, the cell surface class I and secreted proteins class, is higher for the organism-specific proteins than in the shared set described above: in the worm, 13 proteins are in these two functional classes and have a Drosophila homologue, while 25 proteins in these two classes are worm-specific. In the fly, this relationship is even stronger: 25 cell surface class I and secreted proteins have homologues in C. elegans, whereas more than three times as many or 82 proteins in these classes are fly specific. That means that, in addition to the expansion of fly proteins that have homologues in the worm, both organisms also developed a large set of organism-specific proteins, with again a larger expansion in the fly. Proteins of these classes play major roles in cell adhesion processes, and are most likely to contribute to the formation of fly specific characteristics.
Supplementary database
We have deposited information on each of the IgSF proteins described in
this analysis in an interactive, supplementary database that can be found at
http://www.mrc-lmb.cam.ac.uk/genomes/FlyGee/.
The information includes: alternative protein identifiers or experimental
names, sequence homologies, structural annotation in terms of domains,
transmembrane helices and signal sequences, the amino acid sequence and
extensions of the gene predictions using NRDB90 or cDNA data, or references to
literature. The database can be queried using keywords or protein identifiers.
Each hit can include several sequences that all represent or point to the same
protein: the predicted protein, other sequences such as a matching cDNA
sequence, or the sequence found using GENEWISE, an experimentally determined
sequence and/or the gene prediction from the previous release of the fly
genome.
Conclusions
We have identified 142 IgSF proteins in Drosophila, described
their domain architecture, and obtained an indication of the type of function
that many of the novel proteins are involved in. We have also extended the
work that was previously carried out on IgSF proteins in C. elegans.
These results should be of use in the experimental characterisation of these
proteins. Experiments, in turn, will refine or correct results reported
here.
Some 26 clusters of closely related IgSF proteins are common to the two organisms and members of these clusters were present prior to the divergence of worm and fly. However, three-quarters of the Drosophila repertoire and half the C. elegans repertoire have emerged since their divergence. This means that a significant fraction of pathways involving the IgSF proteins in the much simpler organism, C. elegans, are not a subset of those in Drosophila but different. We also pointed to the particular expansion of two functional classes, many of whose members are involved in cell adhesion processes that play important roles during development. Relative to C. elegans, the greater size of the Drosophila IgSF repertoire, and the particular nature of many of its proteins, must be one of the contributing factors responsible for, for example, the formation of a more complex cellular structure in Drosophila.
The larger number of IgSF proteins in Drosophila contrasts with a
smaller total number of genes: the current counts are 13,639 genes in
Drosophila and 19,537 genes in C. elegans
(Clamp et al., 2003). Some
superfamilies in an organism expanded to improve its adaptation to its
environment but without substantial increase in physiological complexity. Such
changes in the protein repertoire could be called `conservative protein family
expansions'. One example is the large expansion of two chemoreceptor families
in the worm as compared with the fly
(Robertson, 1998
). Expansion
of other superfamilies can lead to the evolution of organisms of higher
complexity. This process could be called `progressive protein family
expansions'. One example are the expansions of signal transduction domain
superfamilies in the metazoan worm as compared with the unicellular baker's
yeast (Chervitz et al., 1998
).
Another example, described here, is the expansion of the IgSF superfamily in
Drosophila compared with that of C. elegans.
The general validation of this simple distinction between conservative and progressive protein family expansions will require a fuller investigation of the relationship between the size and function of protein superfamilies in organisms of different complexity.
![]() |
ACKNOWLEDGMENTS |
---|
![]() |
REFERENCES |
---|
![]() ![]() ![]() ![]() ![]() ![]() |
---|
Aravind, L., Watanabe, H., Lipman, D. J. and Koonin, E. V.
(2000). Lineage-specific loss and divergence of functionally
linked genes in eukaryotes. Proc. Natl. Acad. Sci. USA
97,11319
-11324.
Aurelio, O., Boulin, T. and Hobert, O. (2003).
Identification of spatial and temporal cues that regulate postembryonic
expression of axon maintenance factors in the C. elegans ventral nerve cord.
Development 130,599
-610.
Bateman, A., Birney, E., Cerruti, L., Durbin, R., Etwiller, L.,
Eddy, S. R., Griffiths-Jones, S., Howe, K. L., Marshall, M. and Sonnhammer, E.
L. L. (2002). The Pfam protein families database.
Nucleic Acids Res. 30,276
-280.
Birney, E. and Durbin, R. (2000). Using
GeneWise in the Drosophila annotation experiment. Genome
Res. 10,547
-548.
Brenner, S. E., Chothia, C. and Hubbard, T. J. P.
(1998). Assessing sequence comparison methods with reliable
structurally identified distant evolutionary relationships. Proc.
Natl. Acad. Sci. USA 95,6073
-6078.
Brenner, S. E., Koehl, P. and Levitt, M.
(2000). The ASTRAL compendium for protein structure and sequence
analysis. Nucleic Acids Res.
28,254
-256.
Butler, S. J., Ray, S. and Hiromi, Y. (1997).
klingon, a novel member of the Drosophila immunoglobulin superfamily, is
required for the development of the R7 photoreceptor neuron.
Development 124,781
-792.
C. elegans Sequencing Consortium (1998). Genome sequence of the nematode Caenorhabditis elegans. a platform for investigating biology. Science 287,2012 -2018.[CrossRef]
Chandonia, J. M., Walker, N. S., Lo Conte, L., Koehl, P.,
Levitt, M. and Brenner, S. E. (2002). ASTRAL compendium
enhancements. Nucleic Acids Res.
30,260
-263.
Chervitz, S. A., Aravind, L., Sherlock, G., Ball, C. A., Koonin,
E. V., Dwight, S. S., Harris, M. A., Dolinski, K., Mohr, S., Smith, T. et
al. (1998). Comparison of the complete protein sets of worm
and yeast: Orthology and divergence. Science
282,2022
-2028.
Chothia, C., Boswell, D. R. and Lesk, A. M. (1988). The outline structure of the T-cell Alpha-Beta-receptor. EMBO J. 7,3745 -3755.[Abstract]
Clamp, M., Andrews, D., Barker, D., Bevan, P., Cameron, G.,
Chen, Y., Clark, L., Cox, T., Cuff, J., Curwen, V. et al.
(2003). Ensembl 2002: accommodating comparative genomics.
Nucleic Acids Res. 31,38
-42.
Eddy, S. R. (1998). Profile hidden Markov models. Bioinformatics 14,755 -763.[Abstract]
Eisenhaber, B., Bork, P. and Eisenhaber, F. (1999). Prediction of potential GPI-modification sites in proprotein sequences. J. Mol. Biol. 292,741 -758.[CrossRef][Medline]
Falk, J., Bonnon, C., Girault, J. A. and Faivre-Sarrailh, C. (2002). F3/contactin, a neuronal cell adhesion molecule implicated in axogenesis and myelination. Biol. Cell 94,327 -334.[CrossRef][Medline]
Fambrough, D. and Goodman, C. S. (1996). The Drosophila beaten path gene encodes a novel secreted protein that regulates defasciculation at motor axon choice points. Cell 87,1049 -1058.[Medline]
Gough, J. and Chothia, C. (2002). SUPERFAMILY:
HMMs representing all proteins of known structure. SCOP sequence searches,
alignments and genome assignments. Nucleic Acids Res.
30,268
-272.
Gough, J., Karplus, K., Hughey, R. and Chothia, C. (2001). Assignment of homology to genome sequences using a library of hidden Markov models that represent all proteins of known structure. J. Mol. Biol. 313,903 -919.[CrossRef][Medline]
Harpaz, Y. and Chothia, C. (1994). Many of the immunoglobulin superfamily domains in cell-adhesion molecules and surface-receptors belong to a new structural set which is close to that containing variable domains. J. Mol. Biol. 238,528 -539.[CrossRef][Medline]
Hill, E., Broadbent, I. D., Chothia, C. and Pettitt, J. (2001). Cadherin superfamily proteins in Caenorhabditis elegans and Drosophila melanogaster. J. Mol. Biol. 305,1011 -1024.[CrossRef][Medline]
Holm, L. and Sander, C. (1998). Removing near-neighbour redundancy from large protein sequence collections. Bioinformatics 14,423 -429.[Abstract]
Hutter, H., Vogel, B. E., Plenefisch, J. D., Norris, C. R.,
Proenca, R. B., Spieth, J., Guo, C. B., Mastwal, S., Zhu, X. P., Scheel, J. et
al. (2000). Cell biology: Conservation and novelty in the
evolution of cell adhesion and extracellular matrix genes.
Science 287,989
-994.
Karplus, K., Barrett, C. and Hughey, R. (1998). Hidden Markov models for detecting remote protein homologies. Bioinformatics 14,846 -856.[Abstract]
Krogh, A., Larsson, B., von Heijne, G. and Sonnhammer, E. L. L. (2001). Predicting transmembrane protein topology with a hidden Markov model: Application to complete genomes. J. Mol. Biol. 305,567 -580.[CrossRef][Medline]
Krogh, A., Mian, I. S. and Haussler, D. (1994). A hidden Markov model that finds genes in Escherichia-coli DNA. Nucleic Acids Res. 22,4768 -4778.[Abstract]
Kunz, B., Lierheimer, R., Rader, C., Spirig, M., Ziegler, U. and
Sonderegger, P. (2002). Axonin-1/TAG-1 mediates cell-cell
adhesion by a cis-assisted trans-interaction. J. Biol.
Chem. 277,4551
-4557.
Lo Conte, L., Brenner, S. E., Hubbard, T. J. P., Chothia, C. and
Murzin, A. G. (2002). SCOP database in 2002: refinements
accommodate structural genomics. Nucleic Acids Res.
30,264
-267.
Madera, M. and Gough, J. (2002). A comparison of profile hidden Markov model procedures for remote homology detection. Nucleic Acids Res. 19,30 .[CrossRef]
Murzin, A. G., Brenner, S. E., Hubbard, T. and Chothia, C. (1995). Scopa Structural Classification of Proteins Database for the Investigation of Sequences and Structures. J. Mol. Biol. 247,536 -540.[CrossRef][Medline]
Musacchio, M. and Perrimon, N. (1996). The Drosophila kekkon genes: Novel members of both the leucine-rich repeat and immunoglobulin superfamilies expressed in the CNS. Dev. Biol. 178,63 -76.[CrossRef][Medline]
Nakamura, M., Baldwin, D., Hannaford, S., Palka, J. and Montell,
C. (2002). Defective proboscis extension response (DPR), a
member of the Ig superfamily required for the gustatory response to salt.
J. Neurosci. 22,3463
-3472.
Nielsen, H., Brunak, S. and von Heijne, G.
(1999). Machine learning approaches for the prediction of signal
peptides and other protein sorting signals. Protein
Eng. 12,3
-9.
Noordermeer, J. N., Kopczynski, C. C., Fetter, R. D., Bland, K. S., Chen, W. Y. and Goodman, C. S. (1998). Wrapper, a novel member of the Ig superfamily, is expressed by midline glia and is required for them to ensheath commissural axons in Drosophila. Neuron 21,991 -1001.[Medline]
Park, J., Karplus, K., Barrett, C., Hughey, R., Haussler, D., Hubbard, T. and Chothia, C. (1998). Sequence comparisons using multiple sequences detect three times as many remote homologues as pairwise methods. J. Mol. Biol. 284,1201 -1210.[CrossRef][Medline]
Pearson, W. R. and Lipman, D. J. (1988). Improved tools for biological sequence comparison. Proc. Natl. Acad. Sci. USA 85,2444 -2448.[Abstract]
Pipes, G. C., Lin, Q., Riley, S. E. and Goodman, C. S. (2001). The Beat generation: a multigene family encoding IgSF proteins related to the Beat axon guidance molecule in Drosophila. Development 128,4545 -4552.[Medline]
Robertson, H. M. (1998). Two large families of
chemoreceptor genes in the nematodes Caenorhabditis elegans and Caenorhabditis
briggsae reveal extensive gene duplication, diversification, movement, and
intron loss. Genome Res.
8, 449-463.
Rubin, G. M., Yandell, M. D., Wortman, J. R., Miklos, G. L. G.,
Nelson, C. R., Hariharan, I. K., Fortini, M. E., Li, P. W., Apweiler, R.,
Fleischmann, W. et al. (2000). Comparative genomics of the
eukaryotes. Science 287,2204
-2215.
Schmucker, D., Clemens, J. C., Shu, H., Worby, C. A., Xiao, J., Muda, M., Dixon, J. E. and Zipursky, S. L. (2000). Drosophila Dscam is an axon guidance receptor exhibiting extraordinary molecular diversity. Cell 101,671 -684.[Medline]
Schultz, J., Copley, R. R., Doerks, T., Ponting, C. P. and Bork,
P. (2000). SMART: a web-based tool for the study of
genetically mobile domains. Nucleic Acids Res.
28,231
-234.
Sink, H., Rehm, E. J., Richstone, L., Bulls, Y. M. and Goodman, C. S. (2001). sidestep encodes a target-derived attractant essential for motor axon guidance in Drosophila. Cell 105, 57-67.[Medline]
Smith, T. F. and Waterman, M. S. (1981). Identification of common molecular subsequences. J. Mol. Biol. 147,195 -197.[Medline]
Stein, L., Mangone, M., Schwarz, E., Durbin, R., Thierry-Mieg, J., Spieth, J. and Sternberg, P. (2001). WormBase: network access to the genome and biology of Caenorhabditis elegans. Nucleic Acids Res. 29,1012 -1012.
Teichmann, S. A. and Chothia, C. (2000). Immunoglobulin superfamily proteins in Caenorhabditis elegans. J. Mol. Biol. 296,1367 -1383.[CrossRef][Medline]
The Berkeley Drosophila Genome Project, Sequencing
Consortium (2000). The genome of Drosophila melanogaster.Science 287,2185
.
Related articles in Development: