HMMs and Profiles

Slide presentation

The slide presentation about HMMs and Profiles in pdf format (2 slides per page).

Exercise

Today we will build a protein profile (HMM and generalized profile) starting from scratch.

The first thing to do is to recover the sequence we want to analyze. We will use the MLTD_ECOLI protein from E. coli.

Copy the sequence in your directory:

cp /home/lcerutti/MLTD_ECOLI.fasta .

A first look at the protein

Use Dotlet to look at the protein. Compare your protein against itself.

Questions:

Change the parameters of the dotplot and look at the changes.
What is the structure of the protein?
There are any repeats? If yes how many?
In how many parts will you split the sequence before running a PSI-BLAST?

PSI-BLAST search

Now that you have an idea about the regions of the protein you can use each one of the regions for a PSI-BLAST search.

To cut your protein you can use the "extractseq" tool from EMBOSS, by hand (copy-paste philosophy), or more simply fetch directly the sub-sequences:
fetch -f sw:MLTD_ECOLI[start_pos..end_pos] > seq_file.fasta
For each sub-sequence run a PSI-BLAST against SWISS-PROT:
blastpgp -d swiss -i seq_file.fasta -j 10 -h 0.0001 > out.psi
Questions:

Have a look at the PSI-BLAST output files.
Notice that the search converged. What does it mean?
What happens if you use an inclusion value of 0.01 (-h 0.01)?

Recover the PSI-BLAST results

We now recover all sequences from the PSI-BLAST results with an e-value lower than a specified threshold.

To avoid doing this by hand (... quite annoying!) use this little script extract_psi_seq.pl.

Run the script against your PSI-BLAST results:
extract_psi_seq.pl -e 1e-4 out.psi > out.psi.fasta

Build a multiple alignment

To build the multiple alignment we will use clustalw (some documentation here).

Build multiple alignments:
clustalw -infile=out.psi.fasta
Have a look at the results (the alignments are found in the .aln files).
To have a nice view of the alignment use jalview:
jalview out.psi.ali CLUSTAL
Questions:

What do you think about these alignments?
Are they good to build a profile?

Build the profile

Open the alignment corresponding to the N-term region of your protein with jalview and redefine the N-term and C-term regions of the alignment (use the "Remove sequences" left and right in the Edit menu).

Save the new alignment in MSF format as 'profile.msf'.

Now we will build the profile using HMMER2 package

Run hmmbuild:
hmmbuild -f profile.hmm profile.msf

Observe the output produced by the program. You can read a lot of information about the building of the HMM profile, as for example that it uses Dirichlet priors.

The given scores are also very useful:
```
Constructed a profile HMM (length 149)
Average score:      227.25 bits
Minimum score:      199.93 bits
Maximum score:      249.33 bits
Std. deviation:      13.78 bits
```
If the minimum score is very low (~10), then the HMM is not modeling some of the sequences well (not in the same family or mis-aligned).
Before using the profile-HMM against a database, we must calibrate it to have good estimates of statistical significance for database matches.
Run hmmcalibrate:
hmmcalibrate --num 1000 profile.hmm

The --num 1000 parameter means that 1000 sequences are randomly generated. The default which should be used in preference is 5000.

The mu and lambda parameters produced by hmmcalibrate represent the extreme value distribution of the noise.
```
HMM    : profile
mu     :    -8.995543
lambda :     0.763559
max    :    -0.051000
```
Question:

How does it look the file 'profile.hmm'? Try to understand it. (HMM files produce by HMMER2 are well explained in the HMMER User's Guide.

Search a database with the HMM profile:
hmmsearch -Z 102504 profile.hmm small_db.seq > hmmsearch.output

Here we use a small database (small_db.seq) for our search (hmmsearch is very CPU expensive!!). To avoid scoring problems we artificially the number of sequences of the database as the number of sequences found in Swissprot (-Z 102504).

Have a look at the hmmsearch output (more hmmsearch.output). The output is divided in three part. Try to understand them.

Rebuild an alignment

We can use the sequences found by hmmsearch to rebuild a profile.

Retrieve the sequences from the file hmmsearch.output using the script hmmsearch_get_seq.pl:

hmmsearch_get_seq.pl hmmsearch.output > profile2.fasta

Align the sequences to the existing profile:

hmmalign -q profile.hmm profile2.fasta | selex2f > profile2.aln

Questions:

Are you satisfied of the alignment?
How could you improve the alignment
Try to improve it and rebuild a new profile. Do a new search against the protein database with the new profile.

Ouff! You deserve some freedom ... yeah, LUNCH TIME!