Protein Domains and PSI-blast

Identification of Known Domain in a Protein Sequence

What can you predict about this sequence
```
     MGIQGLAKLI ADVAPSAIRE NDIKSYFGRK VAIDASMSIY QFLIAVRQGG DVLQNEEGET     TSHLMGMFYR TIRMMENGIK PVYVFDGKPP QLKSGELAKR SERRAEAEKQ LQQAQAAGAE     QEVEKFTKRL VKVTKQHNDE CKHLLSLMGI PYLDAPSEAE ASCAALVKAG KVYAAATEDM     DCLTFGSPVL MRHLTASEAK KLPIQEFHLS RILQELGLNQ EQFVDLCILL GSDYCESIRG     IGPKRAVDLI QKHKSIEEIV RRLDPNKYPV PENWLHKEAH QLFLEPEVLD PESVELKWSE     PNEEELIKFM CGEKQFSEER IRSGVKRLSK SRQGSTQGRL DDFFKVTGSL SSAKRKEPEP     KGSTKKKAKT GAAGKFKRGK	    
```
using the following Motif-Scan servers
- Pfam
- Prosite
- SMART
- Interpro which is an attempt to federate Pfam, Prosite, SMART and a few others.
- CD-Search
- ProdDom
How much different from each other are these predictions? Which server do you prefer? Why?

Retrieve the SwissProt entry that corresponds to the above sequence and observe how the predictions of the different Motif-Scan servers are incorporated into the annotations of the Swiss-Prot entry.
Protein Classification based on Domain Architecture

Using the Hits protein workbench, retrieve all proteins in SwissProt that contains a match by the Prosite profiles 53EXO_N_DOMAIN and 53EXO_I_DOMAIN:
- Go to the "At least" query form
- Type prf:53EXO_N_DOMAIN prf:53EXO_I_DOMAIN in the text area, set minimal count to 2, and search SwissProt.
- Send the list of about fifty proteins into the protein hub using the more about these proteins button.
Try to regroup these proteins into a few families by looking at their domain architecture. The most useful tools for this purpose is probably the sequence element viewer SEView. Two keys are pretty usefull in establishing the classification:
- The length in amino-acids of the intervening sequences between the motifs 53EXO_N_DOMAIN and 53EXO_I_DOMAIN.
- The identity of the associated motifs.
Does your classification reflect the ID given by the SwissProt annotators? As this task is pretty time consumming, you can get a pre-established classification here but note that some domain definition have been altered. But, get an idea by yourself before looking at this page.

This two other links might help you sorting out the different type of domains architecture found in these proteins. You must however supply one of the sequence as query to launch them
- The NCBI's Dart tool
- Architecture analysis at SMART
PSI-blast Iteration

The previous exercice provides you with some knowledge of the names and domains architecture of a group of related proteins. One will now exploite this knowledge to observe the behaviour of PSI-blast. You can either use the NCBI web interface or our still experimental web interface to begin the exercise.
- Execute three cycles of PSI-blast using the sub-sequence below as query.
```
>sw:XPG_XENLA/27-95LAVDISIWLNQAVKGARDRQGNAIQNAHLLTLFHRLCKLLFFRIRPIFVFDGEAPLLKRQTLAKRRQRT	      
```
  Limit your search to the SwissProt database. At each cycle, record the E-value produced by the protein XPG_XENLA, FEN1_HUMAN, DIN7_YEAST. Represent these E-value in a table (3 cycles vs 3 proteins) and try to explain whait you observe.
- Redo the above exercise using
```
>sw:DPO1_ECOLI/10-81ILVDGSSYLYRAYHAFPPLTNSAGEPTGAMYGVLNMLRSLIMQYKPTHAAVVFDAKGKTFRDELFEHYKSHR	      
```
  as query and look at the E-value produced by DPO1_ECOLI and EX9_ECOLI.
- Our still experimental web interface offer the possibility to use a multiple sequence alignment as query. Align the two preceeding query using clustal-W or T-coffee or lalign (in global alignement mode). Then, execute three cycles of PSI-blast using the resulting alignement as initial query. Record the E-value of XPG_XENLA, FEN1_HUMAN, DIN7_YEAST, DPO1_ECOLI and EX9_ECOLI.
What Multiple Sequence Alignment

The alignment of the two sequences below was deduced from the structures of two gluthathione S-transferases. The spatial coordinates of the alpha-carbon atoms of both crystal structures were taken into account to produce the alignment. Actually, the nature of each amino acids was not considered.
```
>1gul/1-217RPKLHYPNGRGRMESVRWVLAAAGVEFDEEFLET-KEQLYKLQDGNHLLFQQVPMVEIDGMKLVQTRSILHYIADKH----NLFGKNLKERTLIDMYVEGT----LDLLELLIMHPF----LKPDDQQKEVVNMAQKAIIRYFPVFEKILRGHGQSFLVGNQLSLADVILLQTILALEEKIPNILSAFPFLQEYTVKLSNIPTIKRFLEPGSKKKPPPDEIYVRTVYNIF>1ljr/1-244GLELFLDLVSQPSRAVYIFAKKNGIPLELRTVDLVKGQHKSKEFLQINSLGKLPTLKDGDFILTESSAILIYLSCKYQTPDHWYPSDLQARARVHEYLGWHADCIRGTFGIPLWVQVLGPLIGVQVPEEKVERNRTAMDQ-ALQWLEDKFLG-DRPFLAGQQVTLADLMALEELMQPVALGYELFEGRPRLAAWRGRVEAFLGAELCQEAHSIILSILEQAAKKTLPTPS	  
```
Paste this alignment into the query form of our still experimental web interface and launch a search against SwissProt while restraining the taxonomic range to Bacteria.

In a second browser window, paste the alignment into the text area of the MSA hub. Re-align the sequences using a sequence-based method like Clustal-W. Then upload the resulting re-aligned sequences into the PSI-blast query form and lauch the search against SwissProt while restraining the taxonomic range to Bacteria.

Compare the two PSI-blast output. Use the "more about the selected proteins" button to verify with SEView in the Protein Hub that all matched proteins are actually related to glutathione S-transferases. Indeed, you can be quite confident about the prediction by the Pfam and Prosite predictors.

Do one more cycle of PSI-BLAST with each of the window. Does this confirm your preliminary observations.

Marco Pagni

Last modified: Fri Aug 31 14:05:44 CEST 2001