The slide presentation about HMMs and Profiles in pdf format (2 slides per page).
Today we will build a protein profile (HMM and generalized profile) starting from scratch.
The first thing to do is to recover the sequence we want to analyze. We will use the MLTD_ECOLI protein from E. coli.
Copy the sequence in your directory:
Use Dotlet to look at the protein. Compare your protein against itself.
Now that you have an idea about the regions of the protein you can use each one of the regions for a PSI-BLAST search.
fetch -f sw:MLTD_ECOLI[start_pos..end_pos] > seq_file.fasta
blastpgp -d swiss -i seq_file.fasta -j 10 -h 0.0001 > out.psi
We now recover all sequences from the PSI-BLAST results with an e-value lower than a specified threshold.
To avoid doing this by hand (... quite annoying!) use this little script extract_psi_seq.pl.
extract_psi_seq.pl -e 1e-4 out.psi > out.psi.fasta
To build the multiple alignment we will use clustalw (some documentation here).
jalview out.psi.ali CLUSTAL
Open the alignment corresponding to the N-term region of your protein with jalview and redefine the N-term and C-term regions of the alignment (use the "Remove sequences" left and right in the Edit menu).
Save the new alignment in MSF format as 'profile.msf'.
Now we will build the profile using HMMER2 package
hmmbuild -f profile.hmm profile.msf
Observe the output produced by the program. You can read a lot of information about the building of the HMM profile, as for example that it uses Dirichlet priors.
The given scores are also very useful:
Constructed a profile HMM (length 149) Average score: 227.25 bits Minimum score: 199.93 bits Maximum score: 249.33 bits Std. deviation: 13.78 bits
If the minimum score is very low (~10), then the HMM is not modeling some of the sequences well (not in the same family or mis-aligned).
hmmcalibrate --num 1000 profile.hmm
The --num 1000 parameter means that 1000 sequences are randomly generated. The default which should be used in preference is 5000.
The mu and lambda parameters produced by hmmcalibrate represent the extreme value distribution of the noise.
HMM : profile mu : -8.995543 lambda : 0.763559 max : -0.051000
hmmsearch -Z 102504 profile.hmm small_db.seq > hmmsearch.output
Here we use a small database (small_db.seq) for our search (hmmsearch is very CPU expensive!!). To avoid scoring problems we artificially the number of sequences of the database as the number of sequences found in Swissprot (-Z 102504).
Have a look at the hmmsearch output (more hmmsearch.output). The output is divided in three part. Try to understand them.
We can use the sequences found by hmmsearch to rebuild a profile.
Retrieve the sequences from the file hmmsearch.output using the script hmmsearch_get_seq.pl:
hmmsearch_get_seq.pl hmmsearch.output > profile2.fasta
Align the sequences to the existing profile:
hmmalign -q profile.hmm profile2.fasta | selex2f > profile2.aln