EST clustering
Exercise 1: Cleaning sequences
A little cluster
Here are 4 mouse sequences that you will analyze and clean. Copy them in a local file (you will need to manually edit them).
Using LALIGN check manually if the 4 sequences could be in the same cluster.
Do a similar, but automated clustering with CAP.
Cleaning
Do a vector clipping:
- Analyze each of your sequences for vector contamination using VecScreen at NCBI.
- or better analyse all sequences together with EVEC from EBI (use an Exp threshold of 0.0001).
- Remove the vector contaminations from the sequences using your text editor or word processor.
Repeat masking
- Run RepeatMasker to mask repeats with "X".
- Substitute the sequences containing repeats by the masked ones.
- Which kind of repeats did you find?
Now try to "re-cluster" the cleaned sequences again using LALIGN or CAP. (select the fastest solution...)
What can you say?
Genomic Mapping
Cluster 7 rat sequences from Unigene Rn.43270 using CAP and map the contig onto the genomic sequence (AC094146) with Spidey or Sim4. What can you say? What about the mouse ortholog? (U20225)
Exercise 2: Gene Indices
Unigene
Go to the Unigene web page.
- Go to the Homo sapiens gene index. How many clusters have been build for human? How many sequences have been used to build the clusters?
- Why mRNAs sequences are present in the clusters?
You can search the Gene Index by keywords.
- Try the keyword 'myelin'. How many clusters do you get?
Have a look in details to cluster Hs.69547 (don't hesitate to click around, there is a lot of information around!)
- On which chromosome this cluster lies? Look at the map graphically.
- How many mRNAs are integrated in the cluster? How many ESTs?
You can also browse the different EST libraries by clicking Library Browser in the Human Gene Index home page.
- How many sequences and clusters contains the brain library 'Lib.86'?
Digital Differential Display (DDD) allow you to do in silico differential gene expression analysis.
- Follow the DDD link from the Human Gene Index home page.
- Choose all breast libraries derived from 'tumor' and 'carcinoma' tissues.
- Compare to sequences from a normal breast.
- How many genes are over-expressed in the disease tissues?
TIGR Gene Indices
Enter the TIGR Gene Indices web page.
- Select the Human Gene Index.
- How many Tentative Human Consensus (THC) are present in the last database release?
- Why HT (Human expressed transcripts or Human ET) are present?
Follow the Gene Product Name link to search the gene index by keyword. Use the keyword 'myelin' for the search. Look in more details the report for THC1093817. Look at the Functional Classification based on the Gene Ontology Assignments. All TIGR Gene Indices are classified using the Gene Ontology vocabulary.