In search for insect specific protein domains, we used a combination of masked specific databases and automated domain discovery tools (MKDOM2). We obtained a huge number of potential domains and applied rules to restrict the manual analysis to about 200 domains, out of which 18 are completely new or extend valuably an already existing domain. The clusters containing only one species were arbitrarily skipped (~50%) because we wanted to avoid redundancy in the dataset. The number of clusters to analyze could change dramatically when more insect genomes are available; because the sister groups of the studied clades (mainly D. melanogaster and A. gambiae) probably will possess some of these sequences. In fact, these clusters could be an important data source, especially to find domains related to parasite-vector relationships (e.g., Plasmodium falciparum and Anopheles gambiae, or Trypanosoma cruzi and Rhodnius prolixus). Nevertheless is important to remember that using this analyses more than 36,000 were obtained, and most of them have not been analyze yet. The complete database, the used scipts and the results, including the whole list of the putative domains is available in the Methods page.
These results are encouraging and confirm the fact that it is possible to detect new domains that were not already found by previous other domain discovery methods. The preparation of the specific database with filtering and masking steps helps to find new domains, however it would benefit to explore a further filtering to remove the biased sequences like signal peptides, transmembrane regions, coiled-coils, among others. As it is shown in Table I, about 20% of the clusters were tagged as biased regions.
Now the appearance of new fully sequenced genomes (i.e., Apis mellifera) and many other to come will probably lead us to clusterize a new protein collection in order to assess the robustness of the method. This should reveal more insect specific domains.