EST clustering

Slide presentation

The slide presentation about EST clustering in pdf format (2 pages per slide).

Exercise

Today we will use a file containing EST sequences, 'dataset.fasta'. Sequences in this dataset have already been masked for contaminations and repeats using the same procedure described yesterday.

Copy the dataset in your directory:

cp /home/lcerutti/dataset.fasta .

Before starting the exercise you have to index the sequences in the dataset to retrieve them later:

Index the dataset:
gsi_index.pl dataset.fasta > dataset.gsi

Clustering

In this exercise we will cluster the sequences of the dataset. First we have to compare the sequences of the set with them-self (we will use BLAST), then retrieve the sequences of each cluster.

The first step is to build a database for the blast search:

To build a database for blast we use the formatdb program:
formatdb -p F -i dataset.fasta

Now we can run our Blast search:

blastall -p blastn -i dataset.fasta -d dataset.fasta -e 0.001 -b 1000 > dataset.blast

The clustering (putting together overlapping sequences) is based on the blast results and is made by the homemade script cluster.pl.

Create a new directory 'clusters':
mkdir clusters
Enter the directory:
cd clusters
Run the clustering script:
cluster.pl -o 80 -i 96 ../dataset.blast
Questions:

How many clusters are produced?
What are the differences between the clusters? What does it mean?

Retrieve sequences from the two major clusters using gsi_fetch:

gsi_fetch.pl -i ../dataset.gsi cluster1 > cluster1.fasta
gsi_fetch.pl -i ../dataset.gsi cluster2 > cluster2.fasta

Assembly

This step will create contigs starting from the clusters (consensus sequences). We will use two programs, Phrap and CAP3, for the assembly and compare the results.

Phrap (see Phrap and cross_match documentation):

Move up in the directory: cd ..
Create a new directory 'phrap_out':
mkdir phrap_est_out
Copy the file 'cluster1.fasta' into the 'phrap__est_out' directory:
cp clusters/cluster1.fasta phrap_est_out/
Enter the 'phrap__est_out' directory:
cd phrap_est_out
Execute phrap:
phrap cluster1.fasta -minmatch 20 -new_ace > phrap.out
Spend some time having a look at the produced files and try to understand them.

CAP3 (see CAP3 documentation):

Move up in your directory: cd ..
Create a new directory 'cap3_est_out'.
mkdir cap3_est_out
Copy the file 'cluster1.fasta' into the 'cap3_est_out' directory.
cp clusters/cluster1.fasta cap3_est_out
Enter the 'cap3_est_out' directory.
cd cap3_est_out
Run CAP3:
cap3 cluster1.fasta > cap3.est_log
Have a look at the files produced by the program.

Questions:

How many contigs are produced by Phrap? How many by CAP3?
Compare the resulting contigs using a pairwise alignment program (ex. cross_match with the '-alignments' option). There is any difference?

View of the results

You can see the results of the assembly using Consed (if available).

What is my sequence?

Use the contigs produced by the previous step to find an homologous gene.

Blast the contigs against the human genome at the NCBI web page:

Go to the NCBI web page.
Go to the human genomic blast page.
Search using your contigs as queries.
Question: What is the sequence corresponding to your contig?
From the blast result, try to retrieve the genomic sequence and compare it to one of your contigs using sim4 (you can find the genomic sequence for the two clusters here: /home/lcerutti/sequenceA and here /home/lcerutti/sequenceB.

Save one of your contigs to a file in fasta format.
Execute sim4:
sim4 CONTIG_SEQ GENOMIC_SEQ A=1
Here some documentation about sim4.

Comment your results.