Title: Protein Evolution and
1Protein Evolution and Sequence Analysis
2Central Premise
Significant sequence similarity allows one to
assign function to an unknown protein(s) based on
properties of known proteins and is a direct
consequence of evolutionary relationships.
Speciation- Evolution of a new gene/protein that
is genetically independent of the ancestral gene
from which it arose.
Homolog- A gene/protein related to a second
gene/protein by descent from a common ancestral
gene by speciation.
Ortholog- Genes/proteins in different species
that evolved from a common ancestral gene by
speciation and that retain the same function.
Paralog- Genes/proteins related by duplication of
a common ancestral gene that evolves new
functions even if related to that of the
ancestor.
Convergent evolution- Evolution of similar
features or properties in genes/proteins of
different genetic lineages.
3Divergent and Convergent Evolution Among the
Serine Proteases
Chymotrypsin
Overlay
Trypsin
3NKK
1ACB
Subtilisn
1SBT
4Mechanisms Involved in Molecular Evolution of
Genes/Proteins
Mutation- Stochastic single point changes in the
genetic material due to errors in DNA replication
during mitosis, radiation exposure, chemical or
environmental stressors, or viruses and
transposable elements. Slow but constant rate
(molecular clock) of 10-9 to 10-8 mutations per
base per generation. Splicing errors in
eukaryotes that retain introns.
Recombination- Exchange of genes or portions of
genes between different chromosomes to create new
combinations of elements.
Gene duplication- Duplication of a gene or
portions of a gene, one of which continues the
original function and the other is free to evolve
and acquire new functions.
Retrotransposition- Incorporation of mRNA
sequences back into DNA, frequently inserting
into new locations with different expression
patterns.
The mechanism by which new genes/proteins arise
allow for the possibility of sequence analysis to
infer functional and structural relationships
among different sequences.
5Sequence alignments are methods to arranging DNA,
RNA, or protein sequences to identify regions of
similarity or identity with the goal of inferring
structure, function, or both.
Sequence searches and alignments using DNA/RNA
are usually not as informative as searches and
alignments using protein sequences. However.
DNA/RNA searches are intuitively easier to
understand
AGGCTTAGCAAA........TCAGGGCCTAATGCG
AGGCTTAGGAAACTTCCTAGTCAGGGCC
TAAAGCG
The above pairwise alignment could be scored
giving a 1 for each identical nucleotide, A
zero for a mismatch, and a -4 for opening a
gap and a -1 for each extension of the gap. So
score 25 11 14
6Protein sequence alignments are much more
complicated but are more informative because they
involve 20 degrees of freedom (total possible
amino acids) rather than 4 (total possible
bases).
ARDTGQEPSSFWNLILMY.........DSCVIVHKKMSLEIRVH
AKKSAEQPTSYWDIVILYESTDKNDSGDSCTLVKKRMSIQLRVH
Unlike nucleotide sequence alignments, which are
either identical or not identical at a given
position, protein sequence alignments include
shades of grey where one might acknowledge that
a T is sort of equivalent to an S. But how
equivalent? What number would you assign to an
S-T mismatch? And what about gaps? Since alanine
is a common amino acid, couldnt the A-A match be
by chance? Since Trp and Cys are uncommon, should
those matches be given higher scores?
Therefore, accurately aligning sequences and
accurately finding related sequences are
approximately the same problem?
7Multiple Sequence Alignments
Sequence comparisons fall into two categories
Local alignment in which regions of a large
sequences are compared to identify regions of
similarity such as in domains and global
alignments in which similar sequences of similar
length are compared to analyze overall similarity.
Various methods are available depending on the
assumptions of the algorithm and the types of
sequences to be analyzed. All require a scoring
matrix for dealing with similarities, gaps, and
insertions.
Clustal is a commonly used global alignment
algorithm for performing multiple sequence
alignments. Algorithm is executed in three
stages (1) A pairwise sequence comparison is
performed across all sequences starting from the
most similar (2) The pairwise information is
used to create a guide tree (3) The guide tree
is used to perform the final alignment.
8PAM (Percent Acceptable Mutation) matrices
- Are derived from studying global alignments of
well-characterized protein families. - PAM1 only 1 of residues has changed (ie short
evolutionary distance) - Raise this to 250 power to get 250 change of
two sequences (greater - evolutionary distance), or about 20 sequence
identity. - Therefore,
- a PAM 30 would be used to analyze more
closely related proteins, - a PAM 400 is used for finding and analyzing
distantly related proteins. - PAMx PAM1x
9Block substitution matrices (BLOSUM)
- Are derived from studying local alignments
(blocks) of sequences from related proteins that
differ by no more than X. - In other words, one might use the portions of
aligned sequences from related proteins that have
no more than 62 identity (in the portions or
blocks) to derive the BLOSUM 62 scoring matrix. - One might use only the blocks that have lt80
identity to derive the BLOSUM 80 matrix.
- 3) BLOSUM and PAM substitution matrices have the
opposite effects - The higher the number of the BLOSUM matrix
(BLOSUM X), the more closely related proteins you
are looking for. - The higher the number of the PAM matrix (PAM X),
the more distantly related proteins you are
looking for.
10Gap penalties Intuitively one recognizes that
there should be a penalty for introducing
(requiring) a gap during identification/alignment
of a given sequence. But if two sequences are
related, the gaps may well be located in loop
regions which are more tolerant of mutational
events and probably have little impact on
structure. Therefore, a new gap should be
penalized, but extending an existing gap should
be penalized very little.
Filtering many proteins and nucleotides contain
simple repeats or regions of low sequence
complexity. These must be excluded from searches
and alignments.
Significance of a hit during a search - More
important than an arbitrary score is an
estimation of the likelihood of finding a hit
through pure chance (lower the value to more
certainty of a match). Ergo the Expectation
value or E-value. E-values can be as low as
10-70.
11Useful Bioinformatics Sites
National Center for Biotechnology Information
(NCBI)- National Institutes of Health sponsored
sites with rich array of resources and data
bases. http//www.ncbi.nlm.nih.gov/pubmed
ExPASy (Swiss Institute of Bioinformatics)-
Large number of different tools for sequence and
function analysis. http//www.expasy.org/tools/
RCSB Protein Data Bank- Largest data base for
curated of protein structures. http//www.rcsb.o
rg/pdb/home/home.do
BioGRID- Large data base of curated protein
interaction datasets. http//thebiogrid.org/
Osprey- Software and interactome analysis tools
for visualizing interaction data sets.
http//en.bio-soft.net/protein/Osprey.html
Tree of Life website- Database information on
phylogenetic relationships among organisms with
useful link outs. http//tolweb.org/tree/