Data cleaning

Slide presentation

The slide presentation about sequencing, data cleaning, and assembling in pdf format (2 slides per page).


Recover ESTs chromatograms

The first exercise just want to show you how to recover chromatograms from NCBI by homology, using the NCBI Traces Blast server (

Alternatively you can use the trace server of Ensembl. Use the SSAHA algorithm to search for your sequence.

Translate chromatograms

This exercise will show you how to translate a chromatogram into a sequence.

The program Phred is used to translate the chromatograms into sequences (FASTA format). It produces also the quality file associated with your sequence (see course).

Filter vector sequences

To check for possible vector contaminations, we align the sequences against a vector database. The contaminations will be filtered (deleted!) from the sequences. During this exercise we will not filter mitochondrial, ribosomal, and other possible contaminant sequences. But remember, mask them in real world!

A good vector database can be downloaded from the ftp server at NCBI: vector.Z.

Locally, we will use cross_match (for documentation see Phrap and cross_match documentation) to check for possible contaminations. But we can also use blast to check for possible vector sequence contaminants (avoid this for this step, otherwise you have to mask your sequences manually).

Mask repeats

Repeats, low complexity regions, and so on, can affect the clustering and assembling of sequences. It is very important to mask these regions to avoid false results.

Before starting, create a new directory 'mask' and copy the file 'set.fasta.screen.clean' in it:

Again, you can choose to work on your local workstation or use a web service:


Here we will use Phrap for the clustering/assembling step.

... time for a beer!