Gene prediction
Slide presentation
The slide presentation about HMMs and Profiles in pdf format (2 pages per slide).
Exercise
Today we will have an overview of gene prediction algorithms available on the web.
These two chormosomal sequences of 200 kb belong to mouse and human respectively: AC002397 and U47924. The regions contain 16 functional genes.
Eight subregions of 20'000 bp have been extracted from the mouse contig:
AC002397 (12001 - 32000)
AC002397 (32001 - 52000)
AC002397 (52001 - 72000)
AC002397 (72001 - 92000)
AC002397 (120001 - 140000)
AC002397 (140001 - 160000)
AC002397 (160001 - 180000)
AC002397 (180001 - 200000)
Choose one of these sequences. The goal is to predict all complete genes contained in one sequence using gene prediction programs, EST searches, and species comparisons.
Rules of the game:
Proceed in four steps, using increasing amounts of information not necessarily available for all genes:
- Phase 1: Predict coding regions and gene structures. Use only gene prediction programs and WWW servers that do not use sequence homology information. Pick up two or three predictors from the list and try them. Compare the results.
- GRAIL Oak Ridge Nat. Laboratory (US)
- FGENEH Sanger Center (UK)
- MZEF Cold Spring Harbor Labs (US)
- HMMgene Center for Biological Sequence Analysis (Denmark)
- GENSCAN MIT (US)
- GENEMARK Georgia Institute of Technology (US)
- Genie Lawrence Berkely Nat. Laboratory (US)
- GENVIEW Instituto Tecnologie Biomediche Avanzate (Italy)
- Phase 2: Extract predicted coding region and/or protein using the tools available (see following list). Blast the predicted genes/proteins to find homologous to confirm gene structure (ESTs,proteins,cDNAs).
- Phase 3: Homologous can be used to build improved gene structure. You can analyze fragments of the sequence to avoid too long waiting time.
- Procrustes To build a gene structure by comparaison to protein homologous.
- Wise2 Build gene structure using a protein or HMM-profile as template. Maximum DNA size 6kb in interactive mode.
- Phase 4: Compare your results to the original sequence and its annotation (AC002397), and against humnan sequences. Use pairwise alignment and dot matrix to compare the sequences.
- Dotlet
- LALIGN
- Ensembl Human genome. Compare your predicted genes to the human ones in Ensembl database.
Questions:
- How accurate are gene prediction algorithms ?
- Which gene prediction tool performed the best on your sequence ?
- Which gene prediction tool can deal with multiple genes in one sequence ?
- How useful are EST/protein searches for gene prediction ?
- How useful are cross-species comparisons of genomic sequences for gene prediction ?
... relax now. Any drink someone?