MATERIALS AND METHODS
Retrieve the A. gambiae and D.melanogaster sequences.
All insect sequences, including those from Drosophila and Anopheles, were retrieved from Swiss-Prot and TrEMBL using the Sequence Retrieval System (SRS) (Zdobnov, Lopez et al. 2002) at the European Bioinformatics Institute (EBI, http://www.ebi.ac.uk). The sequences were downloaded to a FASTA file, excluding all proteins coding for a known domain in the INTERPRO database (Mulder, Apweiler et al. 2002; Mulder, Apweiler et al. 2003).
Furthermore, via MySQL, we extracted the full proteomes of A. gambiae and D. melanogaster from the ENSEMBL database (Hubbard, Barker et al. 2002; Birney, Andrews et al. 2004). The region of the sequences with annotations in INTERPRO were masked with a Perl script, leaving only the residues that are not annotated as a domain yet. A further masking step was applied with SEG (Birney, Andrews et al. 2004), to mask the low complexity regions in the sequence database. The masked sequences were split into several fragments, eliminating the masked regions and assigning a new accession number to every new sequence. By this procedure we obtained several individual sequences fragments from an original one. New accession numbers were assigned, based on the accession number of the parent sequence to allow later identification of the fragments.
Clusterize the proteins against themselves
To group the proteins we used the MKDOM2 tool, that by default exclude sequences shorter than 20 amino acids, which are considered too short to correspond to genuine structured domains. However, after a few tries, this limit was considered too low and in the analysis all clusters with a mean length lower than 50 residues have been skipped.
The results are given in two files: one of them is an input file for XDOM (graphical tool to visualize clusters), whereas the other one is a multiple alignment file. We kept the second one as the starting point to select the potential domains.
Select the putative clusters: the potential domains
We arbitrarily analysed only the clusters containing at least 15 sequences originating from a minimum of two different species to avoid clusters formed by paralog gene products only and possible redundancy in our dataset. A nearly identical sequence might appear several times in UniProt or in the same genome, leading to uninformative clusters.
We eliminated the clusters that were suspected to contain a signal peptide, a transmembrane or a coiled-coil region, using SignalP (Nielsen, Engelbrecht et al. 1997), TMPred (Hofmann and Stoffel 1993). and COILS (Lupas, Van Dyke et al. 1991), respectively. These filtering steps were performed to exclude compositionally biased sequences.
Convert the multiple alignments in profiles
The multiple alignments obtained with MKDOM2 were converted to profiles, weighted, calibrated and compared against the Swiss-Prot and TrEMBL databases (Bairoch and Boeckmann 1991; Watanabe and Harayama 2001; Boeckmann, Bairoch et al. 2003) using the Pftools package (Bucher et al, 1996).
Several rounds were conducted, redefining the profile at each round including the newly significant retrieved sequences (P<0.01), until the number of retrieved sequences did not change or until the profile lost its identity retrieving non-related sequences.
Downloadable FilesIn the following lines, the files that were produced during the process are listed. A short description is also provided.
Bairoch, A. and B. Boeckmann (1991). "The SWISS-PROT protein sequence data bank." Nucleic Acids Res 19 Suppl: 2247-9.
Birney, E., D. Andrews, et al. (2004). "Ensembl 2004." Nucleic Acids Res 32 Database issue: D468-70.
Boeckmann, B., A. Bairoch, et al. (2003). "The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003." Nucleic Acids Res 31(1): 365-70.
Hofmann, K. and W. Stoffel (1993). "TMbase - A databse of membrane spanning proteins segments." Biol. Chem. Hoppe-Seyler 374: 166.
Hubbard, T., D. Barker, et al. (2002). "The Ensembl genome database project." Nucleic Acids Res 30(1): 38-41.
Lupas, A., M. Van Dyke, et al. (1991). "Predicting coiled coils from protein sequences." Science 252(5010): 1162-4.
Mulder, N. J., R. Apweiler, et al. (2003). "The InterPro Database, 2003 brings increased coverage and new features." Nucleic Acids Res 31(1): 315-8.
Mulder, N. J., R. Apweiler, et al. (2002). "InterPro: an integrated documentation resource for protein families, domains and functional sites." Brief Bioinform 3(3): 225-35.
Nielsen, H., J. Engelbrecht, et al. (1997). "Identification of prokariotic and eukariotic signal peptides and prediction of their cleavage sites." Protein Engineering 10: 1-6.
Watanabe, K. and S. Harayama (2001). "[SWISS-PROT: the curated protein sequence database on Internet]." Tanpakushitsu Kakusan Koso 46(1): 80-6.
Zdobnov, E. M., R. Lopez, et al. (2002). "The EBI SRS server-new features." Bioinformatics 18(8): 1149-50.