Data cleaning

Slide presentation

The slide presentation about sequencing, data cleaning, and assembling in pdf format (2 slides per page).

Exercises

Recover ESTs chromatograms

The first exercise just want to show you how to recover chromatograms from NCBI by homology, using the NCBI Traces Blast server (http://www.ncbi.nlm.nih.gov/Traces).

Alternatively you can use the trace server of Ensembl. Use the SSAHA algorithm to search for your sequence.

Take one of the human sequences proposed here and blast it against human chromatograms.
- sequence1
- sequence2
Have a look at the chromatograms.
Question: What do you think about the quality of these chromatograms?

Translate chromatograms

This exercise will show you how to translate a chromatogram into a sequence.

The program Phred is used to translate the chromatograms into sequences (FASTA format). It produces also the quality file associated with your sequence (see course).

The Phred documentation can be found here: Phred documentation
A simple way to produce sequence and quality files using Phred:

Copy the human genomic clones chromatograms that you can find in '/home/lcerutti/traces' in a directory called 'traces'.

cp -rf /home/lcerutti/traces . (the dot is important !).

Make a new directory 'phred_out':
mkdir phred_out
Run Phred:
phred -id traces -sd phred_out -qd phred_out
The files '.seq' and '.qual' are created in 'phred_out' in FASTA format.
Have a look to the resulting files.
Concatenate the sequences in a unique file:
cat phred_out/*.seq > set.fasta
Concatenate quality files:
cat phred_out/*.qual > set.fasta.qual

Question: How do the quality of the sequences look like? What does it mean a score of 10? and a score of 40?
Try other options of Phred. The '-trim' option for example, which give a 0 quality value to bases with low scores at the start/end of the sequences. You can find them in the Phred documentation.

Filter vector sequences

To check for possible vector contaminations, we align the sequences against a vector database. The contaminations will be filtered (deleted!) from the sequences. During this exercise we will not filter mitochondrial, ribosomal, and other possible contaminant sequences. But remember, mask them in real world!

A good vector database can be downloaded from the ftp server at NCBI: vector.Z.

Locally, we will use cross_match (for documentation see Phrap and cross_match documentation) to check for possible contaminations. But we can also use blast to check for possible vector sequence contaminants (avoid this for this step, otherwise you have to mask your sequences manually).

Using cross_match locally:

Execute cross_match:
cross_match set.fasta /home/lcerutti/vector -minmatch 10 -minscore 20 -screen > cross_match.log
The '-screen' option causes the creation of a new file called 'dataset.fasta.screen'. This file contains the sequences masked for vector contaminations with 'X'. A log file is created 'cross_match.log'.
Remember: cross_match uses also the quality file for the analysis, if present (in our case 'set.fasta.qual').

Using NCBI blast service:

Use the megablast at NCBI.
For vector contaminations you can also use the VecScreen service.

Questions:
- Does the search find any vector contamination? If yes, how the contaminated sequences are masked? (look at the output file 'set.fasta.screen')
To filter the masked regions use the mask.pl script:

mask.pl set.fasta.screen set.fasta.qual

Have a look at the results ('set.fasta.screen.clean' and 'set.fasta.qual.clean'):

more set.fasta.screen.clean

Mask repeats

Repeats, low complexity regions, and so on, can affect the clustering and assembling of sequences. It is very important to mask these regions to avoid false results.

Before starting, create a new directory 'mask' and copy the file 'set.fasta.screen.clean' in it:

mkdir mask
cp set.fasta.screen.clean mask/set2.fasta

Again, you can choose to work on your local workstation or use a web service:

Local:

Enter the 'mask' directory:
cd mask
Execute RepeatMasker:
RepeatMasker set2.fasta
Look what is going on during the execution of the program.
The program generates a new file 'set2.fasta.masked', and a number of other files.

On the web:

Go the the RepeatMasker web server (alternative sites: Germany, UK)
Upload your sequences file.
Download the result file into the 'mask' directory (save the file as 'set2.fasta.masked') and have a look to the other produced files.

Questions:

There is any difference between the original file 'set2.fasta' and the masked one set2.fasta.masked'? If yes, what are the differences?
How many repeats are found? Which kind?

Assembly

Here we will use Phrap for the clustering/assembling step.

Phrap (see Phrap and cross_match documentation):

Return to your home directory: cd
Create a new directory 'phrap_out':
mkdir phrap_out
Copy the file 'set.fasta.screen.masked' and 'set.fasta.qual' into the 'phrap_out' directory and rename them 'dataset.fasta' and 'dataset.fasta.qual':

cp mask/set2.fasta.masked phrap_out/dataset.fasta
cp set.fasta.qual.clean phrap_out/dataset.fasta.qual

Enter the 'phrap_out' directory:
cd phrap_out
Execute phrap:
phrap dataset.fasta -minmatch 20 -new_ace > phrap.out
Spend some time having a look at the produced files and try to understand them.