The slide presentation about EST clustering in pdf format (2 pages per slide).
Today we will use a file containing EST sequences, 'dataset.fasta'. Sequences in this dataset have already been masked for contaminations and repeats using the same procedure described yesterday.
Copy the dataset in your directory:
Before starting the exercise you have to index the sequences in the dataset to retrieve them later:
gsi_index.pl dataset.fasta > dataset.gsi
In this exercise we will cluster the sequences of the dataset. First we have to compare the sequences of the set with them-self (we will use BLAST), then retrieve the sequences of each cluster.
The first step is to build a database for the blast search:
formatdb -p F -i dataset.fasta
Now we can run our Blast search:
The clustering (putting together overlapping sequences) is based on the blast results and is made by the homemade script cluster.pl.
mkdir clusters
cd clusters
cluster.pl -o 80 -i 96 ../dataset.blast
This step will create contigs starting from the clusters (consensus sequences). We will use two programs, Phrap and CAP3, for the assembly and compare the results.
mkdir phrap_est_out
cp clusters/cluster1.fasta phrap_est_out/
cd phrap_est_out
phrap cluster1.fasta -minmatch 20 -new_ace > phrap.out
mkdir cap3_est_out
cp clusters/cluster1.fasta cap3_est_out
cd cap3_est_out
cap3 cluster1.fasta > cap3.est_log
You can see the results of the assembly using Consed (if available).
Use the contigs produced by the previous step to find an homologous gene.
Blast the contigs against the human genome at the NCBI web page:
sim4 CONTIG_SEQ GENOMIC_SEQ A=1