MATERIALS AND METHODS


Retrieve the A. gambiae and D.melanogaster sequences.
All insect sequences, including those from Drosophila and Anopheles, were retrieved from Swiss-Prot and TrEMBL using the Sequence Retrieval System (SRS) (Zdobnov, Lopez et al. 2002) at the European Bioinformatics Institute (EBI, http://www.ebi.ac.uk). The sequences were downloaded to a FASTA file, excluding all proteins coding for a known domain in the INTERPRO database (Mulder, Apweiler et al. 2002; Mulder, Apweiler et al. 2003).
Furthermore, via MySQL, we extracted the full proteomes of A. gambiae and D. melanogaster from the ENSEMBL database (Hubbard, Barker et al. 2002; Birney, Andrews et al. 2004). The region of the sequences with annotations in INTERPRO were masked with a Perl script, leaving only the residues that are not annotated as a domain yet. A further masking step was applied with SEG (Birney, Andrews et al. 2004), to mask the low complexity regions in the sequence database. The masked sequences were split into several fragments, eliminating the masked regions and assigning a new accession number to every new sequence. By this procedure we obtained several individual sequences fragments from an original one. New accession numbers were assigned, based on the accession number of the parent sequence to allow later identification of the fragments.

Clusterize the proteins against themselves
To group the proteins we used the MKDOM2 tool, that by default exclude sequences shorter than 20 amino acids, which are considered too short to correspond to genuine structured domains. However, after a few tries, this limit was considered too low and in the analysis all clusters with a mean length lower than 50 residues have been skipped.
The results are given in two files: one of them is an input file for XDOM (graphical tool to visualize clusters), whereas the other one is a multiple alignment file. We kept the second one as the starting point to select the potential domains.

Select the putative clusters: the potential domains
We arbitrarily analysed only the clusters containing at least 15 sequences originating from a minimum of two different species to avoid clusters formed by paralog gene products only and possible redundancy in our dataset. A nearly identical sequence might appear several times in UniProt or in the same genome, leading to uninformative clusters.
We eliminated the clusters that were suspected to contain a signal peptide, a transmembrane or a coiled-coil region, using SignalP (Nielsen, Engelbrecht et al. 1997), TMPred (Hofmann and Stoffel 1993). and COILS (Lupas, Van Dyke et al. 1991), respectively. These filtering steps were performed to exclude compositionally biased sequences.

Convert the multiple alignments in profiles
The multiple alignments obtained with MKDOM2 were converted to profiles, weighted, calibrated and compared against the Swiss-Prot and TrEMBL databases (Bairoch and Boeckmann 1991; Watanabe and Harayama 2001; Boeckmann, Bairoch et al. 2003) using the Pftools package (Bucher et al, 1996).
Several rounds were conducted, redefining the profile at each round including the newly significant retrieved sequences (P<0.01), until the number of retrieved sequences did not change or until the profile lost its identity retrieving non-related sequences.

Downloadable Files

In the following lines, the files that were produced during the process are listed. A short description is also provided.

tablasAnopheles.gz The INTERPRO domain information associated to the Anopheles entries in a table format (original format from ENSEMBL).
tablasDrosophila.gz The INTERPRO domain information associated to the Drosophila entries in a table format (original format from ENSEMBL).
programa1copiaAno.pl A Perl script to fuse the information the ENSEMBL Anopheles sequences with the INTERPRO domain information associated to those sequences in one file.
programa1copiaDro.pl A Perl script to fuse the information the ENSEMBL Drosophila sequences with the INTERPRO domain information associated to those sequences in one file.
modifiedtablasAnopheles.gz A file with all the Anopheles gambiae ENSEMBL entries without masking, but with the INTERPRO domain information.
modifiedtablasDrosophila.gz A file with all the Drosophila melanogaster ENSEMBL entries without masking, but with the INTERPRO domain information.
AnophelesMasked.gz A file with all the Anopheles gambiae ENSEMBL entries masked in those regions where INTERPRO domain information was associated.
DrosophilaMasked.gz A file with all the Drosophila melanogaster ENSEMBL entries masked in those regions where INTERPRO domain information was associated.
InsectaMasked The database obtained with the SRS tool, after being parsed with SEG.
Mask Interpro domains A perl script to mask the already known domains regions for each entry. This script was used to mask the files modifiedtablasDrosophila and modifiedtablasAnophles. This is a modification of a Lorezo Cerutti script.
databaseENSEMBL.gz The complete Drosophila and Anopheles proteomes masked with the entries found in INTERPRO for each entry. This database also includes the sequences recovered with SRS. After have masked these regions, each entry was splitted in shorter ones to avoid the masked regions with the script splitter.pl.
filteredfileForMkdom2.gz The database used in this study. It includes the original accession nmber for each sequence, if it was splitted or not (the splitted sequences have an additional number at the end of the original ID to indicate it is a fragment) and which regions were masked in the original entry. FASTA file.
clusterLengthFile.gz The MKDOM2 output that includes all the domains, with the number of sequences per group, the maximum length and the mean length of each potential domain.
dbRESULTS.gz A MKDOM2 output in which is reported the number of the cluster (automatically asigned by MKDOM2), the number of sequences included in the group and the consensus sequence of those included in the group.
dbRESULTS.result.gz One of the MKDOM2 ouput files in which is related each studied sequence that has been clusterized in a new potential cluster with the name of the sequence, the length of the potential domain (with the starting and the final point), the domain ID and the number of sequences included in each cluster.
dbRESULTS.mul.gz Maybe, this is the most important of the MKDOM2 output files. This file includes all the potential domains, with its correspondind ID, and the sequences included in each cluster. If one would want to convert this file into a FASTA file, the script ClusteredLength.pl could be useful.
dbRESULTS.xdom.gz XDOM constitutes a graphical alternative to analyze the clustersof MKDOM2. We did not use this option, but it is an open option to be explored.
ClusterLength.pl.gz A Perl script that modifies the original Mkdom2 output file.mul to convert it in a Fasta file.
Selectdomains.pl A perl script that takes the headers of all of the groups clusterized by MKDOM2 and select those that have more than 10 sequences andat the same time their mean length is above 50 residues.
Headers.filtered5_20.gz This file includes all the sequences with more than five sequences with above 50 residues (in mean).
Headers.filteres10-20.gz This file includes all the sequences with more than ten sequences with above 50 residues (in mean).
MKDOM2 A link for the MKDOM2 program used in this work.

References


Bairoch, A. and B. Boeckmann (1991). "The SWISS-PROT protein sequence data bank." Nucleic Acids Res 19 Suppl: 2247-9.
Birney, E., D. Andrews, et al. (2004). "Ensembl 2004." Nucleic Acids Res 32 Database issue: D468-70.
Boeckmann, B., A. Bairoch, et al. (2003). "The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003." Nucleic Acids Res 31(1): 365-70.
Hofmann, K. and W. Stoffel (1993). "TMbase - A databse of membrane spanning proteins segments." Biol. Chem. Hoppe-Seyler 374: 166.
Hubbard, T., D. Barker, et al. (2002). "The Ensembl genome database project." Nucleic Acids Res 30(1): 38-41.
Lupas, A., M. Van Dyke, et al. (1991). "Predicting coiled coils from protein sequences." Science 252(5010): 1162-4.
Mulder, N. J., R. Apweiler, et al. (2003). "The InterPro Database, 2003 brings increased coverage and new features." Nucleic Acids Res 31(1): 315-8.
Mulder, N. J., R. Apweiler, et al. (2002). "InterPro: an integrated documentation resource for protein families, domains and functional sites." Brief Bioinform 3(3): 225-35.
Nielsen, H., J. Engelbrecht, et al. (1997). "Identification of prokariotic and eukariotic signal peptides and prediction of their cleavage sites." Protein Engineering 10: 1-6.
Watanabe, K. and S. Harayama (2001). "[SWISS-PROT: the curated protein sequence database on Internet]." Tanpakushitsu Kakusan Koso 46(1): 80-6.
Zdobnov, E. M., R. Lopez, et al. (2002). "The EBI SRS server-new features." Bioinformatics 18(8): 1149-50.