This is the practical session for the chapter Biomolecular databases of the course Introduction to bioinformatics. The slides of the lecture are available in various formats.
During the tutorial part of this practical session, we will define a few biological problems, and see how databases can be used to obtain the answer.
It is important to get the answer for the problemss solved in the tutorial, because the results of some database queries will be used as input for the subsequent practicals (sequence alignment, phylogeny).
A series of exercises will give you the opportunity to use the concepts seen in the tutorials to answer some concrete biological questions.
Students should of course feel free to add their own questions to this list, which can be treated afterwards, if there is some time left.
This tutorial will be based on the following Web resources.
|EMBL||Nucleic sequences||The EMBL Nucleic Sequence Database (EBI - UK)
|Genbank||Nucleic sequences||Genbank (NCBI - USA)
|DDBJ||Nucleic sequences||DDBJ - DNA Data Bank of Japan
|UniProt||Protein sequences||UniProt - the Universal Protein Resource
|PDB||3D structure of macromolecules||PDB - The Protein Data Bank
|EnsEMBL||Genome browser||EnsEMBL Genome Browser (Sanger Institute + EBI)
|UCSC||Genome browser||UCSC Genome Browser (University California Santa Cruz - USA)
|ECR||Genome browser||ECR Browser
|Integr8||Comparative genomics||Integr8 - access to complete genomes and proteomes
|Prosite||Protein domains||Prosite - protein domains, families and functional sites
|Pfam||Protein domains||PFAM - Protein families represented by multiple sequence
alignments and hidden Markov models (HMMs) (Sanger Institute - UK)
|CATH||Protein domains||CATH - Protein Structure Classification
|InterPro||Protein domains||InterPro (EBI - UK)
|GO||Gene ontology||Gene Ontology Database
|Entrez||Multi-database||A collection of biomolecular databases maintained at the NCBI (USA), accessible via an interface called Entrez.
|SRS||Data warehouse||A collection of biomolecular databases maintained at the European Institute for Bioinformatics (EBI, UK), accessible via an interface called SRS
The number of biomolecular databases is growing so fast that it is impossible to give a balanced survey of all the existing resources. We selected here a few databases on the basis of various criteria (popularity, ease of access, ...) to illustrate the type of information that can be retrieved from them.
As a matter of exercise, we propose to browse some databases in order to grab information about one particular protein. Each student can do the same analysis with some protein of interest to him/her. If you are out of inspiration, you can for example run the exercise with the Drosophila protein Ubx.
Choose a protein for which you have some prior knowledge (e.g. the protein Ubx from Drosophila melanogaster, and try to extract all the information relevant to this protein in the databases listed in the table of biomolecular databases above.
In the exercise above, we saw that each database an provide us with a piece of information about some aspects of our protein of interest:
Note that this is just a very small sample of the information that can be obtained via the hundreds of biomolecular databases distributed around the world.
We will now consult two Web servers (NCBI Entrez and EBI SRS) that provide an integrated access to multiple databases, thereby facilitating the consultation of multiple aspects regarding a protein of interest.
During this tutorial, we will learn to use the interface of NCBI Entrez to retrieve a protein of interest. As will be seen, a simple formulation of the query generally returns too many hits, and the desired answer may be lost in hundreds or thousands of other records. We will see how to use advanced search options in order to refine the query.
We will try to retrieve from Entrez the information about the Gal4 protein from the budding yeast Saccharomyces cerevisiae.
How many results do you obtain ? How many of them correspond to your needs ? How could you try to improve the result ?
The simple query Gal4 returned 20,513 proteins (Oct 2011). Needless to say, this is too much for what we search: the genome of the budding yeast Saccharomyces cerevisiae contains ~6,000 coding genes, and only one of them codes for the Gal4 protein.
A first reason is that we did not impose any constraint on the organism.
A second reason is that, by typing Gal4 in the query box, we asked Entrez to return all the proteins which contained this string in any field (name, description, ...). Thus, our answer includes some proteins related with Gal4, for example because they interact with this protein, or becausea a Gal4 fragment was used to construct hybrid proteins (e.g. for enhancer trap experiments).
A first improvement can be obtained by imposing some additional words in the query. For instance, we could impose to find the words "Saccharomyces" and "cerevisiae", in addition to "Gal4".
For this, you can use the logical operators 'AND', 'OR', and 'NOT' within the query sentence. Beware ! These operators are case-sensitive, i.e. if you type them in lowercase, they will be considered as imposed words rather than operators.
In the query box, type
Gal4 AND Saccharomyces AND cerevisiae
An even more precise way to select Saccharomyces cerevisiae is to quote the pair of words.
Gal4 AND "Saccharomyces cerevisiae"This will only retain the records where these two words are written consecutively.
What about the result ? Did we obtain an improvement ? How do you explain the incorrect result ?
By combining Gal4 and 'Saccharomyces cerevisiae' in the query, we already obtained some improvement, and he number of results has been reduced. However, we still obtain a series of proteins which do not directly correspond to not Gal4, but are returned because the three words of our query were found in some field (name, decription, organism, ...).
You can refine the selection by specifying the field in which your query text has to be found.
How many results do we obtain now ? Do they all fit our needs ? How could we refine the query ?
We obtained improvement over our first query (Gal4 alone) by imposing that the value GAL4 has to be found in the field Gene name. However, we still did not acheive the desired precision (in Oct 2011, the query returns 45 records). There are two reasons:
We would thus like to formulate a query with constraints on multiple fields: GAL4 as gene name and Saccharomyces cerevisiae as organism.
We will further use the Advanced query form to impose constraints simultaneously on gene name and on organism.
GAL4[Gene Name]Do not click on the Search button yet ! We still need to add some constraints.
(GAL4[Gene Name]) AND Saccharomyces cerevisiae[Organism]
Now that we have selected a reasonably low number of proteins, we can identify the one we were searching for: Gal4p from Saccharomyces cerevisiae.
>gi|1169823|sp|P04386.2|GAL4_YEAST RecName: Full=Regulatory protein GAL4 MKLLSSIEQACDICRLKKLKCSKEKPKCAKCLKNNWECRYSPKTKRSPLTRAHLTEVESRLERLEQLFLL IFPREDLDMILKMDSLQDIKALLTGLFVQDNVNKDAVTDRLASVETDMPLTLRQHRISATSSSEESSNKG QRQLTVSIDSAAHHDNSTIPLDFMPRDALHGFDWSEEDDMSDGLPFLKTDPNNNGFFGDGSLLCILRSIG FKPENYTNSNVNRLPTMITDRYTLASRSTTSRLLQSYLNNFHPYCPIVHSPTLMMLYNNQIEIASKDQWQ ILFNCILAIGAWCIEGESTDIDVFYYQNAKSHLTSKVFESGSIILVTALHLLSRYTQWRQKTNTSYNFHS FSIRMAISLGLNRDLPSSFSDSSILEQRRRIWWSVYSWEIQLSLLYGRSIQLSQNTISFPSSVDDVQRTT TGPTIYHGIIETARLLQVFTKIYELDKTVTAEKSPICAKKCLMICNEIEEVSRQAPKFLQMDISTTALTN LLKEHPWLSFTRFELKWKQLSLIIYVLRDFFTNFTQKKSQLEQDQNDHQSYEVKRCSIMLSDAAQRTVMS VSSYMDNHNVTPYFAWNCSYYLFNAVLVPIKTLLSNSKSNAENNETAQLLQQINTVLMLLKKLATFKIQT CEKYIQVLEEVCAPFLLSQCAIPLPHISYNNSNGSAIKNIVGSATIAQYPTLPEENVNNISVKYVSPGSV GPSPVPLKSGASFSDLVKLLSNRPPSRNSPVTIPRSTPSHRSVTPFLGQQQQLQSLVPLTPSALFGGANF NQSGNIADSSLSFTFTNSSNGPNLITTQTNSQALSQPIASSNVHDNFMNNEITASKIDDGNNSKPLSPGW TDQTAYNAFGITTGMFNTTTMDDVYNYLFDDEDTPPNPKKE
An interesting feature of Entrez is the history. By clicking on the link Advanced below the query box, below Search builder you will see a section entitled Search history. You can select any of these previous queries in order to come back to its results, or edit it, or combine them to refine the selection.
We will now illustrate how the same query can be performed with SRS, the EMBL/EBI information retrieval system.
SRS is a multi-database retrieval system, and the first step is to select one or several databases on which the query will be performed. For our first exercise, it will be sufficient to select a single database : UniProt, the non-redundant database of protein sequences. As we saw during the course, this database contains two sections : Swiss-Prot (proteins wih experimental characterization, annotated with many references to the literature), and TREMBL (translation of all the coding sequences from the EMBL nucleotide sequence database).
You will be prompted for a user ID, type the login name you would like to use in the future, and click OK. From now on, the server will store your queries, and the next time you connect to SRS with the same login name, you will be able to get back any previous result.
We will now refine the query by imposing constraints on specific fields. For this, we will use the standard query form.
For this, clik on the tab Library page. This page displays the list of databases supported by SRS at EBI. Select Uniprot-KB by clicking on the check box besides it.
Click Search. How many results do you obtain ? Check the gene names associated to these proteins. Do they all correspond to your query (GAL4) ? In which organisms did you find a protein ?
Gal4p belong to a family of proteins containing a domain called commonly Zinc cluster. This domain contains 6 cysteins, which interact with 2 atoms of Zinc. To date, this domain has only been found in fungi.
We will try two different approaches to select all the Zinc cluster proteins from UniProt.
We will use the substring Zn(2)-C6 which seems to characterize this domain.
You can come back to the Query form and impose, as a first restriction, the Organism name to be Saccharomyces cerevisiae. For the second restriction, select the field Comment: comment in the pop-up menu, and type Zn(2)-C6 in the text box.
If you run the query now, you will obtain a syntax error. This error comes from the presence of parentheses in the query text. Parentheses have a specific meaning in SRS queries : they are used to separate logical operations (AND, OR, NOT, ...).
We need thus to indicate that the string Zn(2)-C6 is a query as a whole. For this, we have to quote the string. Type "Zn(2)-C6" (with the quotes) in the query box and run the query. We now obtain 50 proteins (in May 2006), which corresponds to the number of Zn cluster proteins identified in the genome of Saccharomyces cerevisiae.
Compare the number of entries selected with the query on
"description", on "comments" and on either of those fields. Do you
feel confident about your retrieval ? How would you envisage to refine
the query to obtain a more complete list of transcription factors.
The previous query was interesting, but at each step we were able to
display the result of the target of the link (e.g. LENZYME), and we
lost the information about the origin (e.g. UniProt).
SRS allow to go further, by defining custom views on linked databases.
Selecting multiple output fields
There are various ways to customize the fields to be returned. The
simplest way is to select them from the list on the query form.
Select all transcription factors from the yeast
Saccharomyces cerevisiae. Retrieve for each of them the
The next exercise illustrate how to link several databases.
Selecting custom fields across multiple databases
The previous query was interesting, but at each step we were able to display the result of the target of the link (e.g. LENZYME), and we lost the information about the origin (e.g. UniProt).
SRS allow to go further, by defining custom views on linked databases.
This tutorial only illustrates the basic features of Entrez and SRS, you will find more information on the following sites :