Practicals - Biomolecular databases

Introduction
Resources
A quick tour of selected databases
Retrieving information from the NCBI with Entrez
Retrieving information from various databases with SRS
Additional exercises
More info

Introduction

This is the practical session for the chapter Biomolecular databases of the course Introduction to bioinformatics. The slides of the lecture are available in various formats.

Portable document format (pdf)

During the tutorial part of this practical session, we will define a few biological problems, and see how databases can be used to obtain the answer.

It is important to get the answer for the problemss solved in the tutorial, because the results of some database queries will be used as input for the subsequent practicals (sequence alignment, phylogeny).

A series of exercises will give you the opportunity to use the concepts seen in the tutorials to answer some concrete biological questions.

Students should of course feel free to add their own questions to this list, which can be treated afterwards, if there is some time left.

[back to contents]

Resources

This tutorial will be based on the following Web resources.

Acronym	Type	Description+URL
EMBL	Nucleic sequences	The EMBL Nucleic Sequence Database (EBI - UK) http://www.ebi.ac.uk/embl/
Genbank	Nucleic sequences	Genbank (NCBI - USA) http://www.ncbi.nlm.nih.gov/Genbank/
DDBJ	Nucleic sequences	DDBJ - DNA Data Bank of Japan http://www.ddbj.nig.ac.jp/
UniProt	Protein sequences	UniProt - the Universal Protein Resource http://www.uniprot.org/
PDB	3D structure of macromolecules	PDB - The Protein Data Bank http://www.rcsb.org/pdb/
EnsEMBL	Genome browser	EnsEMBL Genome Browser (Sanger Institute + EBI) http://www.ensembl.org/
UCSC	Genome browser	UCSC Genome Browser (University California Santa Cruz - USA) http://genome.ucsc.edu/
ECR	Genome browser	ECR Browser http://ecrbrowser.dcode.org/
Integr8	Comparative genomics	Integr8 - access to complete genomes and proteomes http://www.ebi.ac.uk/integr8/
Prosite	Protein domains	Prosite - protein domains, families and functional sites http://www.expasy.ch/prosite/
Pfam	Protein domains	PFAM - Protein families represented by multiple sequence alignments and hidden Markov models (HMMs) (Sanger Institute - UK) http://pfam.sanger.ac.uk/
CATH	Protein domains	CATH - Protein Structure Classification http://www.cathdb.info/
InterPro	Protein domains	InterPro (EBI - UK) http://www.ebi.ac.uk/interpro/
GO	Gene ontology	Gene Ontology Database http://www.geneontology.org/
Entrez	Multi-database	A collection of biomolecular databases maintained at the NCBI (USA), accessible via an interface called Entrez. http://www.ncbi.nlm.nih.gov/Entrez/
SRS	Data warehouse	A collection of biomolecular databases maintained at the European Institute for Bioinformatics (EBI, UK), accessible via an interface called SRS http://srs.ebi.ac.uk/

[back to contents]

A quick tour of selected databases

The number of biomolecular databases is growing so fast that it is impossible to give a balanced survey of all the existing resources. We selected here a few databases on the basis of various criteria (popularity, ease of access, ...) to illustrate the type of information that can be retrieved from them.

As a matter of exercise, we propose to browse some databases in order to grab information about one particular protein. Each student can do the same analysis with some protein of interest to him/her. If you are out of inspiration, you can for example run the exercise with the Drosophila protein Ubx.

Exercise

Choose a protein for which you have some prior knowledge (e.g. the protein Ubx from Drosophila melanogaster, and try to extract all the information relevant to this protein in the databases listed in the table of biomolecular databases above.

Next steps

In the exercise above, we saw that each database an provide us with a piece of information about some aspects of our protein of interest:

gene sequence (GenBank, EMBL, DDBJ),
genomic context + cross-genome conservation (EnsEMBL, USCS, ECR),
orthologous and paralogous genes (Integr8),
protein sequence enriched with annotations about functional features (UniProt),
3D structure (PDB),
structural domains (CATH),
sequential motifs (PROSITE),
...

Note that this is just a very small sample of the information that can be obtained via the hundreds of biomolecular databases distributed around the world.

We will now consult two Web servers (NCBI Entrez and EBI SRS) that provide an integrated access to multiple databases, thereby facilitating the consultation of multiple aspects regarding a protein of interest.

[back to contents]

Retrieving information from the NCBI with Entrez

Entrez is a retrieval system for searching several linked databases stored at the NCBI (National Computational Bioinfology Institute of the United States).

Goal

During this tutorial, we will learn to use the interface of NCBI Entrez to retrieve a protein of interest. As will be seen, a simple formulation of the query generally returns too many hits, and the desired answer may be lost in hundreds or thousands of other records. We will see how to use advanced search options in order to refine the query.

http://www.ncbi.nlm.nih.gov/Entrez/

An example of simple query

We will try to retrieve from Entrez the information about the Gal4 protein from the budding yeast Saccharomyces cerevisiae.

Tutorial

Open the Entrez home page
You can see a list of the databases supported at Entrez. Click on the link Protein: sequence database.
In the query box, type
```
Gal4
```

Questions

How many results do you obtain ? How many of them correspond to your needs ? How could you try to improve the result ?

Answers

The simple query Gal4 returned 20,513 proteins (Oct 2011). Needless to say, this is too much for what we search: the genome of the budding yeast Saccharomyces cerevisiae contains ~6,000 coding genes, and only one of them codes for the Gal4 protein.

A first reason is that we did not impose any constraint on the organism.

A second reason is that, by typing Gal4 in the query box, we asked Entrez to return all the proteins which contained this string in any field (name, description, ...). Thus, our answer includes some proteins related with Gal4, for example because they interact with this protein, or becausea a Gal4 fragment was used to construct hybrid proteins (e.g. for enhancer trap experiments).

Logical operators

A first improvement can be obtained by imposing some additional words in the query. For instance, we could impose to find the words "Saccharomyces" and "cerevisiae", in addition to "Gal4".

For this, you can use the logical operators 'AND', 'OR', and 'NOT' within the query sentence. Beware ! These operators are case-sensitive, i.e. if you type them in lowercase, they will be considered as imposed words rather than operators.

In the query box, type

Gal4 AND Saccharomyces AND cerevisiae

An even more precise way to select Saccharomyces cerevisiae is to quote the pair of words.

Gal4 AND "Saccharomyces cerevisiae"

This will only retain the records where these two words are written consecutively.

Questions

What about the result ? Did we obtain an improvement ? How do you explain the incorrect result ?

Answers

By combining Gal4 and 'Saccharomyces cerevisiae' in the query, we already obtained some improvement, and he number of results has been reduced. However, we still obtain a series of proteins which do not directly correspond to not Gal4, but are returned because the three words of our query were found in some field (name, decription, organism, ...).

Imposing constraints on a specific field

You can refine the selection by specifying the field in which your query text has to be found.

Click on the link Advanced below the query box.
In the Search builder, select the field Gene name and enter GAL4. By pressing the Enter key, yo obtain a list of matches for the gene name GAL4.
Click the button Add to Search box. This will add a structured text in the query box
```
GAL4[Gene Name]
```
You can now click the Search link below the query box.

Questions

How many results do we obtain now ? Do they all fit our needs ? How could we refine the query ?

Answers

We obtained improvement over our first query (Gal4 alone) by imposing that the value GAL4 has to be found in the field Gene name. However, we still did not acheive the desired precision (in Oct 2011, the query returns 45 records). There are two reasons:

We did not impose any constraint on the organism.
We matched any gene whose name contains "gal", for example "galecting-4 [Homo sapiens]"

We would thus like to formulate a query with constraints on multiple fields: GAL4 as gene name and Saccharomyces cerevisiae as organism.

Specifying constraints on multiple fields

We will further use the Advanced query form to impose constraints simultaneously on gene name and on organism.

In the query box, type the structured query obtained in the previous section.
Do not click on the Search button yet ! We still need to add some constraints.

In the Search builder, select Organism and type Saccharomyces cerevisiae. Click Add to Search Box. This should display the following query.

Questions

How many results do yo obtain now ? What is the difference between these entries ?

Browsing a protein entry

Now that we have selected a reasonably low number of proteins, we can identify the one we were searching for: Gal4p from Saccharomyces cerevisiae.

The result list should include a record with the accession number P04386.2. Click on the link to display the entire record.

Browse the resulting page to get an idea about the annotation content.

Saving the protein sequence in FASTA format

On the top of the window, the option Display allows you to choose among different formats. Select the format FASTA. This will display the coding sequence of the Gal4p protein.

	    >gi|1169823|sp|P04386.2|GAL4_YEAST RecName: Full=Regulatory protein GAL4
	    MKLLSSIEQACDICRLKKLKCSKEKPKCAKCLKNNWECRYSPKTKRSPLTRAHLTEVESRLERLEQLFLL
	    IFPREDLDMILKMDSLQDIKALLTGLFVQDNVNKDAVTDRLASVETDMPLTLRQHRISATSSSEESSNKG
	    QRQLTVSIDSAAHHDNSTIPLDFMPRDALHGFDWSEEDDMSDGLPFLKTDPNNNGFFGDGSLLCILRSIG
	    FKPENYTNSNVNRLPTMITDRYTLASRSTTSRLLQSYLNNFHPYCPIVHSPTLMMLYNNQIEIASKDQWQ
	    ILFNCILAIGAWCIEGESTDIDVFYYQNAKSHLTSKVFESGSIILVTALHLLSRYTQWRQKTNTSYNFHS
	    FSIRMAISLGLNRDLPSSFSDSSILEQRRRIWWSVYSWEIQLSLLYGRSIQLSQNTISFPSSVDDVQRTT
	    TGPTIYHGIIETARLLQVFTKIYELDKTVTAEKSPICAKKCLMICNEIEEVSRQAPKFLQMDISTTALTN
	    LLKEHPWLSFTRFELKWKQLSLIIYVLRDFFTNFTQKKSQLEQDQNDHQSYEVKRCSIMLSDAAQRTVMS
	    VSSYMDNHNVTPYFAWNCSYYLFNAVLVPIKTLLSNSKSNAENNETAQLLQQINTVLMLLKKLATFKIQT
	    CEKYIQVLEEVCAPFLLSQCAIPLPHISYNNSNGSAIKNIVGSATIAQYPTLPEENVNNISVKYVSPGSV
	    GPSPVPLKSGASFSDLVKLLSNRPPSRNSPVTIPRSTPSHRSVTPFLGQQQQLQSLVPLTPSALFGGANF
	    NQSGNIADSSLSFTFTNSSNGPNLITTQTNSQALSQPIASSNVHDNFMNNEITASKIDDGNNSKPLSPGW
	    TDQTAYNAFGITTGMFNTTTMDDVYNYLFDDEDTPPNPKKE

You can store this result in some file on your computer, in order to use it for further analyses.

Getting the query history

An interesting feature of Entrez is the history. By clicking on the link Advanced below the query box, below Search builder you will see a section entitled Search history. You can select any of these previous queries in order to come back to its results, or edit it, or combine them to refine the selection.

[back to contents]

Retrieving information from various databases with SRS

We will now illustrate how the same query can be performed with SRS, the EMBL/EBI information retrieval system.

http://srs.ebi.ac.uk

http://downloads.lionbio.co.uk/publicsrs.html

Selecting a database

SRS interface is more powerful but also more complex than what we saw with Entrez. The reason is that SRS allows to perform more complex queries. But for this, you must first get familiar with the basic concepts.

SRS is a multi-database retrieval system, and the first step is to select one or several databases on which the query will be performed. For our first exercise, it will be sufficient to select a single database : UniProt, the non-redundant database of protein sequences. As we saw during the course, this database contains two sections : Swiss-Prot (proteins wih experimental characterization, annotated with many references to the literature), and TREMBL (translation of all the coding sequences from the EMBL nucleotide sequence database).

Go to SRS home page, and click Start a permanent project.
You will be prompted for a user ID, type the login name you would like to use in the future, and click OK. From now on, the server will store your queries, and the next time you connect to SRS with the same login name, you will be able to get back any previous result.
On the top of the page, there is a series of "tabs", which allow you to select different tasks: Quick query, Library page, Query form, ....
Try a Quick Search: select the database Proteins in the pop-up menu, and type the protein name Gal4p as query. Do you obtain the right result ?
Try the same query with the gene name (GAL4) instead of the protein name (Gal4p). How many results do you obtain ?

Imposing constraints on field contents

We will now refine the query by imposing constraints on specific fields. For this, we will use the standard query form.

Before using a query form, you need to specify which databases you want to include in your search.
For this, clik on the tab Library page. This page displays the list of databases supported by SRS at EBI. Select Uniprot-KB by clicking on the check box besides it.
You can now open the query form by selecting the tab Query at the top of the page.
You should see a form with 4 text boxes, which will allow you to select proteins on the basis of 4 criteria. Each text box is preceded by a pop-up menu, permitting to select a field on which the constraint will be imposed.
For the first criterion, select Gene Name in the pop-up menu, and type GAL4 in the text box.
Click Search. How many results do you obtain ? Check the gene names associated to these proteins. Do they all correspond to your query (GAL4) ? In which organisms did you find a protein ?
We will not add a second criterion of selection, the organism. Come back to the query form. For the first criterion, select GAL4 as Gene name as previously. For the secondd query field, select Organism and enter Saccharomyces cerevisiae. Click Search and compare the results with the previous search.

Selecting multiple sequences

Saccharomyces cerevisiae, together with their description.

Gal4p belong to a family of proteins containing a domain called commonly Zinc cluster. This domain contains 6 cysteins, which interact with 2 atoms of Zinc. To date, this domain has only been found in fungi.

Exercise

Retrieve the peptidic sequence of all proteins from the yeast Saccharomyces cerevisiae which contain a binuclear Zn cluster domain. Retrieve the peptidic sequences of these proteins and save them in a file. Beware: this sequence file will be used for subsequent tutorials.
Perform the same query as above with the yeast Schizosaccharomyces pombe.

We will try two different approaches to select all the Zinc cluster proteins from UniProt.

A simple but not very accurate method : searching the Zinc cluster-characteristic keyword in the UniProt entries themselves.
A more complex but more accurate method : identify the Zinc cluster family in the PFAM database (a database specialized in the annotation of protein families), and find the links between this family and UniProt proteins.

Search by keywords

Read carefully the swissprot entry for GAL4, and try to find a way to select all the yeast protein having the Zinc cluster domain. For this, you need to identify the part of the form where this domain is mentionned, and think about the best way to select all the proteins having the same key words in the same field of their Uniprot record.
Come back to the Query form, and select all proteins from Saccharomyces cerevisiae which contain a Zinc cluster domain. Beware, this exercise is difficult. Try to find the solution by yourself, and, if you don't succeed, read the following.
In the GAL4_YEAST record, the zinc finger domain is indicated in the comment field, with the following sentence.
We will use the substring Zn(2)-C6 which seems to characterize this domain.
You can come back to the Query form and impose, as a first restriction, the Organism name to be Saccharomyces cerevisiae. For the second restriction, select the field Comment: comment in the pop-up menu, and type Zn(2)-C6 in the text box.
If you run the query now, you will obtain a syntax error. This error comes from the presence of parentheses in the query text. Parentheses have a specific meaning in SRS queries : they are used to separate logical operations (AND, OR, NOT, ...).
We need thus to indicate that the string Zn(2)-C6 is a query as a whole. For this, we have to quote the string. Type "Zn(2)-C6" (with the quotes) in the query box and run the query. We now obtain 50 proteins (in May 2006), which corresponds to the number of Zn cluster proteins identified in the genome of Saccharomyces cerevisiae.

Search with a link between PFAM and Uniprot

TO BE WRITTEN

Saving the sequences in a text file

In the left panel, you can see a button View. Below this button, there is a pop-up menu proposing different viewing options for the database you selected. Select FastaSeqs and click View.
Click on the Save button in the left panel. This leads you to another form, displaying the saving options. Select the option Text file, and chec that the output format is set to FastaSeq. Now click Save. You will be prompted to indicate the foldeer and file name. Let us save the result in a folder Desktop/bioinfo/Zinc_cluster, in a file Saccharomyces_Zinc_cluster.fasta.
Perform the same query for Schizosaccharomyces pombe and save the result in a separate file. We will use these files for the following tutorials.

Identifier
Accession number
Description
Organism

Protocol

Open the Query form
Fill the form to retrieve all proteins from Saccharomyces cerevisiae whose comment contains the substring transcription factor. Do not submit yet.
At the bottom of the form, you can see Create your own view, with a list of fields. Select the required fields (see above). Note: for selecting multiple fields, you need to maintain the Ctrl key pressed.
You can now submit the query.

Linking databases

The next exercise illustrate how to link several databases.

Exercise

Find all the yeast genes coding for a metabolic enzyme (tip: start from the LENZYME database).
Retrieve all Saccharomyces cerevisiae enzymes, together with their substrates and products.

Protocol

Select in UniProt all proteins from Saccharomyces cerevisiae.
When you have the result, click on the Link button in the left panel. The list of libraries is now displayed.
In Metabolic databases, select LENZYME (this is the enzyme section of LIGAND). To avoid massive data transfer, select the view *Names only* rather than the default view. Submit the link.
Link the new result to the database LCOMPOUND (Select the view *Names only*. ).

Selecting custom fields across multiple databases

The previous query was interesting, but at each step we were able to display the result of the target of the link (e.g. LENZYME), and we lost the information about the origin (e.g. UniProt).

SRS allow to go further, by defining custom views on linked databases.

In the tabs at the top of the form, click the View tab.
In the left panel, fill the View name box with "UniProt_substrates_products".
You can see two list of databases. The left list is the origin database, the right one is the target database. Select UniProt-KB as origin and LENZYME as target. Click Create new view.
Select the output fields of your choice, within UniProt-Kb and LENZYME (do not forget to include LENZYME substrates and products).
Click Save view (top of the form).
Open the Results form. Check the previous query where you linked UniProt to LENZYME (when writing this tutorial, it selected 457 proteins).
Come back the the Query form. Select all Saccharomyces cerevisiae enzymes having the string "EC 2.4" in their description. Select your new view before submitting the query.

[back to contents]

Additonal exercises

In UniProt, select all the proteins belonging to the species Saccharomyces cerevisiae.
Using the PATHWAY database (a mirror of KEGG), get all the yeast genes involved in galactose metabolism.
Find all the proteins of Escherichia coli for which there is a structure in PDB.
In UniProt, find all the enzymes with an aspartokinase catalytic domain.
Calculate, year per year, the number of entries submitted to UniProt during the 10 last years.
Calculate the frequency distribution of polypeptide lengths in Swissprot, with a class intervl of 100.

[back to contents]

More info

This tutorial only illustrates the basic features of Entrez and SRS, you will find more information on the following sites :

SRS user guide, which is available on SRS web site.
SRS exercises by Thure Etzold http://downloads.lionbio.co.uk/pharm/ex.html

[back to contents]

Jacques van Helden (van-helden.j@univmed.fr)

Practicals - Biomolecular databases

Contents

Introduction

Resources

A quick tour of selected databases

Exercise

Next steps

Retrieving information from the NCBI with Entrez

Goal

An example of simple query

Tutorial

Questions

Answers

Logical operators

Questions

Answers

Imposing constraints on a specific field

Questions

Answers

Specifying constraints on multiple fields

Questions

Browsing a protein entry

Saving the protein sequence in FASTA format

Getting the query history

Retrieving information from various databases with SRS

Selecting a database

Imposing constraints on field contents

Selecting multiple sequences

Exercise

Search by keywords

Search with a link between PFAM and Uniprot

Saving the sequences in a text file

Questions

Selecting multiple output fields

Exercise

Protocol

Linking databases

Exercise

Protocol

Selecting custom fields across multiple databases

Additonal exercises

More info