About the FISH server and structure-anchored hidden Markov models
FISH, which stands for Family Identification with Structure anchored HMMs, is a server for the identification of remote sequence homologues, on the basis of protein domains. The FISH server uses a collection of structure-anchored hidden Markov models (saHMMs) to detect homologous relationships. Establishing the collection of saHMMs requires the following main steps for each domain family: i) selection of representatives for the "midnight ASTRAL set", ii) structural superposition of the representatives, iii) extraction of a multiple sequence alignment from structural superposition, and iv) creation of an associated structure anchored hidden Markov model.
For the definition of homologous structural domains we use the SCOP classification on the family level, and coordinates for the individual domains are from PDB style files in the ASTRAL compendium. In order to avoid bias for very common sequences and to obtain maximum evolutionary spread of the representatives, we use only sequences with a mutual sequence identity below a certain limiting curve, the so-called twilight zone curve.
In the twilight zone one can no longer determine whether two aligned protein sequences are related or not, based on the percentage sequence identity. From a plot of alignment length versus sequence identity a twilight zone curve can be defined[5-7], such that most protein pairs which appear above the curve are homologues. Around the curve, unrelated pairs start to appear, and their number increases rapidly as one descends below the curve into the "midnight zone". Pairs of proteins falling below this curve are most likely not related. However, some are, and it is those we want to keep as representative domains in the "midnight ASTRAL set", the base of the saHMMs. The procedure to construct this set selects, for each family, only those domains which have a mutual sequence identity below the twilight zone curve, are determined to a resolution below 3.6 Å, and match our requirements for good quality crystal structures.
For each SCOP family with at least two members in the "midnight ASTRAL set", a multiple structure alignment is produced for the representative structures using MUSTANG, resulting in a structure anchored multiple sequence alignment, saMSA.
In order to construct structure anchored hidden Markov models, saHMMs, HMMs are generated from the saMSAs using HMMER, with default parameters for hmmbuild. All saHMMs are calibrated using hmmcalibrate, with default settings, to obtain fitted E-values. In this way we create one saHMM for each SCOP protein domain family.
A search of SCOP sequences versus the SCOP 1.69 version of our saHMMs shows that 95% of the matches are to the correct family, using an E-value cutoff of 0.01. Of the few hits outside the family, almost all fall within the correct superfamily. A comparison with PSI-BLAST shows that the saHMMs have consistently lower errors per query at a given coverage. Evaluating the ability of the saHMMs to correctly identify new family members by searching with sequences from SCOP version 1.71 resulted in 99% accuracy and 85% coverage. In a similar evaluation we find that the saHMMs performed about 6% better than the corresponding Pfam_ls HMMs.
1. Tångrot, J., et al., FISH - Family identification of sequence homologues using structure anchored hidden Markov models. NAR, 2006. 34: pp. W10-W14.
2. Tångrot, J., et al., Design, Construction and use of the FISH server. PARA2006, Lecture Notes in Computer Science, 2007. LNCS4699, pp. 647-657.
3. Murzin, A.G., et al., SCOP: a structural classification of proteins database for the investigation of sequences and structures. J Mol Biol, 1995. 247(4): pp. 536-540.
4. Brenner, S.E., P. Koehl, and M. Levitt, The ASTRAL compendium for protein structure and sequence analysis. Nucleic Acids Research, 2000. 28(1): pp. 254-256.
5. Sander, C. and R. Schneider, Database of homology-derived protein structures and the structural meaning of sequence alignment. Proteins, 1991. 9(1): pp. 56-68.
6. Abagyan, R.A. and S. Batalov, Do aligned sequences share the same fold? J Mol Biol, 1997. 273(1): pp. 355-68.
7. Rost, B., Twilight zone of protein sequence alignments. Protein Eng, 1999. 12(2): pp. 85-94.
8. Konagurthu, AS., et al. MUSTANG: A multiple structural alignment algorithm. Proteins, 2006. 64: pp. 559-574.
9. Eddy, SE. http://hmmer.janelia.org
10. Tångrot, J., et al., Accurate Domain Identification with Structure-Anchored Hidden Markov Models, saHMMs. Submitted