Data cleaning
Slide presentation
The slide presentation about sequencing, data cleaning, and assembling in pdf format (2 slides per page).
Exercises
Recover ESTs chromatograms
The first exercise just want to show you how to recover chromatograms from NCBI by homology, using the NCBI Traces Blast server (http://www.ncbi.nlm.nih.gov/Traces).
Alternatively you can use the trace server of Ensembl. Use the SSAHA algorithm to search for your sequence.
- Take one of the human sequences proposed here and blast it against human chromatograms.
- Have a look at the chromatograms.
- Question: What do you think about the quality of these chromatograms?
Translate chromatograms
This exercise will show you how to translate a chromatogram into a sequence.
The program Phred is used to translate the chromatograms into sequences (FASTA format). It produces also the quality file associated with your sequence (see course).
- The Phred documentation can be found here: Phred documentation
- A simple way to produce sequence and quality files using Phred:
- Copy the human genomic clones chromatograms that you can find in '/home/lcerutti/traces' in a directory called 'traces'.
- cp -rf /home/lcerutti/traces . (the dot is important !).
- Make a new directory 'phred_out':
mkdir phred_out
- Run Phred:
phred -id traces -sd phred_out -qd phred_out
- The files '.seq' and '.qual' are created in 'phred_out' in FASTA format.
- Have a look to the resulting files.
- Concatenate the sequences in a unique file:
cat phred_out/*.seq > set.fasta
- Concatenate quality files:
cat phred_out/*.qual > set.fasta.qual
- Question:  How do the quality of the sequences look like? What does it mean a score of 10? and a score of 40?
- Try other options of Phred. The '-trim' option for example, which give a 0 quality value to bases with low scores at the start/end of the sequences. You can find them in the Phred documentation.
Filter vector sequences
To check for possible vector contaminations, we align the sequences against a vector database. The contaminations will be filtered (deleted!) from the sequences. During this exercise we will not filter mitochondrial, ribosomal, and other possible contaminant sequences. But remember, mask them in real world!
A good vector database can be downloaded from the ftp server at NCBI: vector.Z.
Locally, we will use cross_match (for documentation see Phrap and cross_match documentation) to check for possible contaminations. But we can also use blast to check for possible vector sequence contaminants (avoid this for this step, otherwise you have to mask your sequences manually).
- Using cross_match locally:
- Using NCBI blast service:
- Use the megablast at NCBI.
- For vector contaminations you can also use the VecScreen service.
- Questions:
- Does the search find any vector contamination? If yes, how the contaminated sequences are masked? (look at the output file 'set.fasta.screen')
- To filter the masked regions use the mask.pl script:
- mask.pl set.fasta.screen set.fasta.qual
- Have a look at the results ('set.fasta.screen.clean' and 'set.fasta.qual.clean'):
- more set.fasta.screen.clean
Mask repeats
Repeats, low complexity regions, and so on, can affect the clustering and assembling of sequences. It is very important to mask these regions to avoid false results.
Before starting, create a new directory 'mask' and copy the file 'set.fasta.screen.clean' in it:
- mkdir mask
- cp set.fasta.screen.clean mask/set2.fasta
Again, you can choose to work on your local workstation or use a web service:
- Local:
- Enter the 'mask' directory:
cd mask
- Execute RepeatMasker:
RepeatMasker set2.fasta
- Look what is going on during the execution of the program.
- The program generates a new file 'set2.fasta.masked', and a number of other files.
- On the web:
- Go the the RepeatMasker web server (alternative sites: Germany, UK)
- Upload your sequences file.
- Download the result file into the 'mask' directory (save the file as 'set2.fasta.masked') and have a look to the other produced files.
- Questions:
- There is any difference between the original file 'set2.fasta' and the masked one set2.fasta.masked'? If yes, what are the differences?
- How many repeats are found? Which kind?
Assembly
Here we will use Phrap for the clustering/assembling step.
... time for a beer!