EST clustering

Slide presentation

The slide presentation about EST clustering in pdf format (2 pages per slide).


Today we will use a file containing EST sequences, 'dataset.fasta'. Sequences in this dataset have already been masked for contaminations and repeats using the same procedure described yesterday.

Copy the dataset in your directory:

Before starting the exercise you have to index the sequences in the dataset to retrieve them later:


In this exercise we will cluster the sequences of the dataset. First we have to compare the sequences of the set with them-self (we will use BLAST), then retrieve the sequences of each cluster.

The first step is to build a database for the blast search:

Now we can run our Blast search:

The clustering (putting together overlapping sequences) is based on the blast results and is made by the homemade script


This step will create contigs starting from the clusters (consensus sequences). We will use two programs, Phrap and CAP3, for the assembly and compare the results.

View of the results

You can see the results of the assembly using Consed (if available).

What is my sequence?

Use the contigs produced by the previous step to find an homologous gene.

Blast the contigs against the human genome at the NCBI web page:

Cool!! Lunch!!