William R. Pearson
Professor of Biochemistry & Molecular Genetics
Ph.D., California Institute of Technology
Computational Approaches to Protein Evolution

Home Page


We have a long-standing interest in exploiting protein sequence information, both for understanding better how new protein sequences arise and for understanding the relationship between protein sequence and protein structure. Since the description of the FASTP program in 1985, our group has been developing more effective methods for identifying distantly related protein sequences. Over the past 20 years, state-of-the-art methods have improved to where proteins that have diverged from a common ancestor in the past billion years are likely to be detected by sequence similarity searching. We hope to push back that threshold to beyond 2 billion years (near the time when prokaryotes and eukaryotes diverged), but already it is possible to identify novel proteins that are likely to have emerged in the last 500 - 800 million years. If we can identify proteins that emerged in the last 100 - 250 million years, it may be possible to identify the mechanisms by which new proteins are formed.

In addition, our ability to reliably identify very distant protein homologs allows us to begin to consider the question, "how often do different protein families catalyze the same function?" The best known example of nature "reinventing the wheel" is trypsin/subtilisin, two proteins with very different structures that have very similar chemical functions. We have also identified at least four, and perhaps five evolutionarily distinct families of glutathione transferases (Pearson, 2005). All of these families are very old; they are found in bacteria, fungi, plants, and animals, and thus may be examples of functionally redundant proteins that are more than 2.5 billion years old. There are dozens, and perhaps hundreds of examples of functional redundancy in non-homologous (structurally dissimilar) protein families, which suggests that the emergence of new proteins is more common than widely accepted.

Selected References

Sierk ML, Smoot ME, Bass EJ, Pearson WR. (2010) "Improving pairwise sequence alignment accuracy using near-optimal protein sequence alignments." BMC Bioinformatics. Mar 11:146. [PubMed]

Gonzalez MW, Pearson WR. (2010) "Homologous over-extension: a challenge for iterative similarity searches." Nucleic Acids Res. 38:2177-89. Epub 2010 Jan 11. [PubMed]

Lavelle DT, Pearson WR. (2010) "Globally, unrelated protein sequences appear random." Bioinformatics. Feb 26:310-8. Epub 2009 Nov 30. [PubMed]

Cantarel BL, Morrison HG, Pearson W. (2006) "Exploring the relationship between sequence similarity and accurate phylogenetic trees." Mol Biol Evol. 23(11):2090-100. Epub 2006 Aug 4. [PubMed]

Pearson WR, Sierk ML. (2005) "The limits of protein sequence comparison?" Curr Opin Struct Biol. 15:254-60. [PubMed]

Pearson WR. (2005) "Phylogenies of glutathione transferase families." Methods Enzymol. 401:186-204. [PubMed]

Sierk ML, Pearson WR. (2004) "Sensitivity and selectivity in protein structure comparison." Protein Sci. 13:773-85. [PubMed]

Xu P, Widmer G, Wang Y, Ozaki LS, Alves JM, Serrano MG, Puiu D, Manque P,Akiyoshi D, Mackey AJ, Pearson WR, Dear PH, Bankier AT, Peterson DL, AbrahamsenMS, Kapur V, Tzipori S, Buck GA. (2004) "The genome of Cryptosporidium hominis." Nature. Oct 431(7012):1107-12. [PubMed]

Smoot ME, Guerlain SA, Pearson WR. (2004) "Visualization of near-optimal sequence alignments." Bioinformatics. Apr 20:953-8. Epub 2004 Jan 29. [PubMed]

Wood TC, Pearson WR. (1999) "Evolution of protein sequences and structures." J Mol Biol. Aug 291:977-95. [PubMed]