Title: Sequence Alignments and Database Searching
1Sequence Alignments and Database
Searching 08/20/07
2Why compare protein sequences?
Significant sequence similarities allow
associations based upon known functions.
3Homology vs. similarity
Possible for proteins to possess high sequence
identity between segments and not be homologous
In this example, cytochrome c4, has reasonably
high sequence similarity with trypsins, yet does
not have common ancestor, nor common fold.
Also, subtilisin has same spatial arrangement of
active site residues, but is not related to
trypsins
Extracted from ISMB2000 tutorial, WR Pearson, U.
of Virginia
4Homology vs. similarity
- Homologous proteins always share a common three-
- dimensional fold, often with common active or
binding site. - Proteins that share a common ancestor are
homologous. - Proteins that possess gt25 identity across entire
length generally will be homologous. - Proteins with lt20 identity are not necessarily
homologous
5Extracted from ISMB2000 tutorial, WR Pearson, U.
of Virginia
Orthologous cyctochrome c isozymes
Homologous sequences are either 1) orthologous,
or 2) paralogous
Orthologs - sequence differences arises from
divergence in different species (i.e. cyctochrome
c) Paralogs - sequence differences arise after
gene duplication within a given species (i.e.
GPCRs, hemoglobins)
- For orthologs - sequence divergence and
evolutionary relationships will agree. - For paralogs - no necessary linkage between
sequence divergence and speciation.
Hemoglobins contain both orthologs and paralogs
6Weve all seen and/or used sequence alignments,
but how are they accomplished?
Sequence searches and alignments using DNA/RNA
are usually not as informative as searches and
alignments using protein sequences.
However. DNA/RNA searches are intuitively easier
to understand
AGGCTTAGCAAA........TCAGGGCCTAATGCG
AGGCTTAGGAAACTTCCTAGTCAGGGCC
TAAAGCG
The above alignment could be scored giving a 1
for each identical nucleotide, A zero for a
mismatch, and a -4 for opening a gap and a -1
for each extension of the gap. So score 25
11 14
7Protein sequence alignments are much more
complicated. How would this alignment be scored?
ARDTGQEPSSFWNLILMY.........DSCVIVHKKMSLEIRVH
AKKSAEQPTSYWDIVILYESTDKNDSGDSCTLVKKRMSIQLRVH
Unlike nucleotide sequence alignments, which are
either identical or not identical at a given
position, protein sequence alignments
include shades of grey where one might
acknowledge that a T is sort of equivalent to an
S etc. But how equivalent? What number would
you assign to an S-T mismatch? And what about
gaps? Since alanine is a common amino acid,
couldnt the A-A match be by chance? Since Trp
and Cys are uncommon, should those matches be
given higher scores?
Do you see that accurately aligning sequences and
accurately finding related sequences are ? the
same problem?
8Databases
Nucleotide GenBank (NCBI), EMBL, DDBJ
Protein SwissProt, TrEMBL, GenPept(GenBank)
Huge databases share much information. Many
entries linked to other databases (e.g. PDB).
SwissProt small but well curated. NCBI
non-redundant (nr) protein sequence database is
very large but sometimes confusing.
These databases can be searched in a number of
ways. Can search only human or metazoan
sequences. Can eliminate entries made before a
given Date. Etc.
9What do all those numbers mean?
Type of Record Sample accession format
GenBank/EMBL/DDBJ Nucleotide SequenceRecords One letter followed by five digits, e.g.U12345Two letters followed by six digits, e.g.AY123456, AF123456
GenPept Sequence Records(which contain the amino acid translations from GenBank/EMBL/DDBJ records that have a coding region feature annotated on them) Three letters and five digits, e.g.AAA12345
Protein Sequence Records from SWISS-PROT and PIR All are six charactersCharacter/Format1 O,P,Q2 0-93 A-Z,0-94 A-Z,0-95 A-Z,0-96 0-9e.g.P12345 and Q9JJS7
NCBI
Protein Sequence Records from PRF A series of digits (often six or seven)followed by a letter, e.g.1901178A
RefSeq Nucleotide Sequence Records Two letters, an underscore bar, and six digits, e.g.mRNA records (NM_)NM_000492genomic DNA contigs (NT_)NT_000347 complete genome or chromosome (NC_)NT_000907 genomic region (NG_)NG000019
10Continued.
RefSeq Protein Sequence Records Two letters (NP), an underscore bar, and six digits, e.g.NP_000483
RefSeq Model (predicted) Sequence Records from the Human Genome annotation process Two letters (XM, XP, or XT), an underscore bar, and six digits, e.g.XM_000583
Protein Structure Records PDB accessions generally contain one digit followed by three letters, e.g.1TUPMMDB ID numbers generally contain four digits, e.g.3973The record for the Tumor Suppressor P53 Complexed With DNA can be retrieved by either number above
NCBI
GI numbersa series of digits that are assigned
consecutively by NCBI to each sequence it
processes. Version numbersconsist of the
accession number followed by a dot and a version
number.
Nucleotide sequence GI 6995995VERSION
NM_000492.2
Protein translation GI 6995996VERSION
NP_000483.2
gtgi897557gbAAA98443.1 TIAM1 protein
http//www.ornl.gov/sci/techresources/Human_Genome
/posters/chromosome/geneguide.shtml
11Weve got the data, now how do we
score/search? First, we need a way to assign
numbers to shades of grey matches.
Genetic code scoring system This assumes that
changes in protein sequence arise from mutations.
If only one point mutation is needed to change a
given AA to another (at a specific position in
alignment), the two amino-acids are more closely
related than if two point mutations were
required.
Physicochemical scoring system a Thr is like a
Ser, a Trp is not like an Ala
These systems are seldom used because they have
problems. Why try to second guess Nature? Since
there are many related sequences out there, we
can look at some (trusted) alignments to SEE
which sub- stitutions have occurred and the
frequency with which they occur.
12Observed substitution scoring schemes
PAM (percent acceptable mutation) matrices are
derived from studying global alignments of
well-characterized protein families. Use 1
residue change (short evolutionary distance) to
get PAM1 matrix. Raise this to 250 power to get
250 change (greater evolutionary distance).
Therefore a PAM 30 would be used to analyze more
closely related proteins, PAM 400 is used for
finding and analyzing distantly related proteins.
Block substitution matrices (BLOSUM) are derived
from studying local alignments (blocks) of
sequences from related proteins. In other
words, one might use the portions of aligned
sequences from related proteins that have gt62
identity (in the portions or blocks) to derive
the BLOSUM 62 scoring matrix. One might use only
the blocks that have gt80 identity to derive the
BLOSUM 80 matrix.
13Amino acid substitution matrices
- Negative scores - unlikely substitutions
Note that for identical matches, scores vary
depending upon observed frequencies. That is,
rare amino acid (i.e. Trp) that are not
substituted have high scores frequently occuring
amino acids (i.e. Ala) are down-weighted because
of the high probability of aligning by chance.
PAM250 matrix
Extracted from ISMB2000 tutorial, WR Pearson, U.
of Virginia
14Amino acid substitution matrices
In general, Blosum62 matrix is more accurate than
PAM. However, should be aware that search
performance will depend on underlying matrix
Q. Which are more divergent PAM120 or PAM250
Blosum45 or Blosum62?
PNAS 89, 10915 (1992).
15Gap penalties Intuitively one recognizes that
there should be a penalty for introducing
(requiring) a gap during identification/alignment
of a given sequence. But if two sequences are
related, the gaps may well be located In loop
regions which are more tolerant of mutational
events and probably have little impact on
structure. Therefore, a new gap should be
penalized, but extending an existing gap should
be penalized very little.
Filtering many proteins and nucleotides contain
simple repeats or regions of low sequence
complexity. These must be excluded from searches
and alignments. Why?
Significance of a hit during a search - More
important than an arbitrary score is an
estimation of the likelihood of finding a hit
through pure chance. Ergo the Expectation
value or E-value. E-values can be as low as
10-70.
16Statistics
- Similarity score distribution expected based
upon random chance using given searched
database. - distribution of normalized
similarity scores (observed) for a search using a
proton ATPase against the same database.
17E-value
So, for sufficiently large databases (so can
apply statistics)
E Kmne-?S
m- query length n - database length E -
expectation value K - scale factor for search
space (database) ? - scale factor for scoring
system S - score, dependent on substitution
matrix, gap-penalties, etc.
Doubling either sequence string doubles number of
sequences with a given expectation value
similarly, double the score and expectation value
decreases exponentially Expectation value -
probability that given score will occur by chance
given the query AND database strings
18Statistics
Must account for increases in similarity score
due to increase in sequence length searched.
Extracted from ISMB2000 tutorial, WR Pearson, U.
of Virginia
19Basic local alignment search tool (BLAST)
- Break query up into words e.g. ASTGHKDLLV
- AST
- WORDS
STG -
TGH - 2) Generate expanded list of words that would
match with (i.e. PAM250) - a score of at least T Youre acknowledging
that you may not have any - exact matches with original list of words.
- 3) Use expanded list of words to search database
for exact matches. - 4) Extend alignments from where word(s) found
exact match.
Heuristic algorithm Uses guesses. Increases
speed without a great loss of accuracy (BLASTP,
FASTA (local Hueristic), S-W local
rigorous, Needleman-Wunsch global, rigorous)
20Global versus local alignments
Global scores require alignment of entire
sequence length. Cannot be used to detect
relationships between domains in mosaic proteins.
Local alignments are necessary to detect domains
within mosaic proteins, internal duplications.
Extracted from ISMB2000 tutorial, WR Pearson, U.
of Virginia
21Pictorial representation of BLAST algorithm.
Query sequence
Words (they overlap)
Expand list of words
Search database, find exact hits, extend
alignments
Report sorted list of hits
22BLAST
ATCGCCATGCTTAATTGGGCTT CATGCTTAATT
exact word match
one hit
Nucleotide BLAST looks for exact matches Protein
BLAST requires two hits
GTQITVEDLFYNI SEI YYN
neighborhood words
NCBI
two hits
23FASTA
Instead of breaking up query into words (and then
generating a list of similar words), find all
sequences in the database that contain short
sequences that are exact or nearly exact matches
for sequences within the query. Score these and
sort. Sort of reverse methodology to BLAST
Database sequence
Query sequence
24(No Transcript)
25Protein database
26(No Transcript)
27mouse over
28sorted by e values
5 X 10-98
link to entrez
LocusLink
29Low complexity filter
30Identifying distant homologies (use several
different query sequences)
Also remember - If A is homologous to B, and B to
C, then A should be homologous to C
Examine output carefully. A lack of statistical
significance doesnt necessarily mean a lack of
homology!
Extracted from ISMB2000 tutorial, WR Pearson, U.
of Virginia
31PSI-BLAST
Very sensitive, but must not include a non-member
sequence!
- Regular BLAST search
- Sequences above a certain threshold (lt specified
E-value) are - included. Assumed to be related proteins.
This group of sequences - is used to define a profile that contains
the essence of the family. - Now with the important sequence positions
highlighted, can look - for more distantly related sequences that
should still have the essence - of the protein family.
- Inclusion of more distantly related sequences
modifies the profile - further (further defines the essence) and
allows for identification of - even more distantly related sequences. Etc.
Note PSI-BLAST may find and then subsequently
lose a homologous sequence during the iteration
process! Drifting of the program, would be the
gradual loss of close homologs during the
iteration process.
32Position specific scoring matrix (PSSM) (learning
from your hits)
NCBI
33Position specific scoring matrix (PSSM)
A R N D C Q E G H I L K M
F P S T W Y V 206 D 0 -2 0 2 -4 2 4
-4 -3 -5 -4 0 -2 -6 1 0 -1 -6 -4 -1 207 G
-2 -1 0 -2 -4 -3 -3 6 -4 -5 -5 0 -2 -3 -2 -2
-1 0 -6 -5 208 V -1 1 -3 -3 -5 -1 -2 6 -1
-4 -5 1 -5 -6 -4 0 -2 -6 -4 -2 209 I -3 3
-3 -4 -6 0 -1 -4 -1 2 -4 6 -2 -5 -5 -3 0 -1
-4 0 210 S -2 -5 0 8 -5 -3 -2 -1 -4 -7 -6
-4 -6 -7 -5 1 -3 -7 -5 -6 211 S 4 -4 -4 -4
-4 -1 -4 -2 -3 -3 -5 -4 -4 -5 -1 4 3 -6 -5 -3
212 C -4 -7 -6 -7 12 -7 -7 -5 -6 -5 -5 -7 -5 0
-7 -4 -4 -5 0 -4 213 N -2 0 2 -1 -6 7 0
-2 0 -6 -4 2 0 -2 -5 -1 -3 -3 -4 -3 214 G
-2 -3 -3 -4 -4 -4 -5 7 -4 -7 -7 -5 -4 -4 -6 -3
-5 -6 -6 -6 215 D -5 -5 -2 9 -7 -4 -1 -5 -5
-7 -7 -4 -7 -7 -5 -4 -4 -8 -7 -7 216 S -2 -4
-2 -4 -4 -3 -3 -3 -4 -6 -6 -3 -5 -6 -4 7 -2 -6
-5 -5 217 G -3 -6 -4 -5 -6 -5 -6 8 -6 -8 -7
-5 -6 -7 -6 -4 -5 -6 -7 -7 218 G -3 -6 -4 -5
-6 -5 -6 8 -6 -7 -7 -5 -6 -7 -6 -2 -4 -6 -7 -7
219 P -2 -6 -6 -5 -6 -5 -5 -6 -6 -6 -7 -4 -6 -7
9 -4 -4 -7 -7 -6 220 L -4 -6 -7 -7 -5 -5 -6
-7 0 -1 6 -6 1 0 -6 -6 -5 -5 -4 0 221 N
-1 -6 0 -6 -4 -4 -6 -6 -1 3 0 -5 4 -3 -6 -2
-1 -6 -1 6 222 C 0 -4 -5 -5 10 -2 -5 -5 1
-1 -1 -5 0 -1 -4 -1 0 -5 0 0 223 Q 0 1
4 2 -5 2 0 0 0 -4 -2 1 0 0 0 -1 -1 -3 -3
-4 224 A -1 -1 1 3 -4 -1 1 4 -3 -4 -3 -1
-2 -2 -3 0 -2 -2 -2 -3
Serine scored differently in these two positions
Active site nucleophile
NCBI
34PSI-BLAST initial run
NCBI
e value cutoff for PSSM
35PSI-BLAST initial run
NCBI
36PSI-BLAST first PSSM search
Other purine nucleotide metabolizing enzymes not
found by ordinary BLAST
NCBI
37PSI-BLAST importance of original query
(remember, if A is like B.)
iteration 1
iteration 2
PSI-Blast of human Tiam1
38PSI-BLAST importance of original query
iteration 1
iteration 2
PSI-Blast of mouse Tiam2 (90 identity with
human Tiam1)
Ras-binding domains
iteration 3
39Three-dimensional Position Specific Scoring
Matrix (3D-PSSM)
Extremely sensitive, but the structure of a
homolog must exist! Uses a Library of structures
that represent all the known folds and a
non-redundant sequence database.
Preparing the 3D-PSSM database
- 1D-PSSM generation. For every entry in the
Library of structures, - perform 20 iteration of PSI-BLAST against the NR
database. Use - E-value cutoff of 0.0005. Keep intermediate
results from 1st through - 20th iteration. Recombine these intermediates at
the end. Generate - a PSSM (1D-PSSM) from the results.
- (A 1D-PSSM for a protein of length L will have
dimensions L X 20 )
2) For each Library entry, assign 2ndry
structure (Helix, Strand, Coil)
40- Perform 3D structural superposition between each
entry in the Library - and all other members of its fold
superfamily. Use cutoff criteria. Use - the residue equivalencies from the
superpositions to augment the 1D- - PSSMs for Library members in a given
superfamily. (Key here is that - structural alignment reduces possibility of
miss-alignment of sequences).
- Use the structural info from the whole Library to
assign solvation - potentials for each residue type. e.g.
Alanines with only 5 - solvent exposure are seen 122 times. The total
number of residues - In the Library with 5 exposure is 3246. So the
solvation potential - would be 122/32460.038 for an alanine with 5
exposure. Do this - for Ala at 10, 15, 95, 100. Do for all 20
AAs.
Enter query sequence
5) Use Psi-BLAST to generate 1D-PSSM for query
(nr database)
6) Perform 2ndry structure prediction for query
417) Align the query sequence against each member
of the Library using a 3 pass
approach I) query is aligned against Library
member using the 1D-PSSM of the Library
entry 2) query is aligned to the 3D-PSSM of the
Library entry 3) Library entry is aligned to the
querys 1D-PSSM During these procedures the
2ndry structure matching and solvation
potentials are being used but are constant. The
highest scoring of the 3 passes is taken as the
final result.
So, how good is 3D-PSSM?
42Three papers report the initial characterization
of PLC-e (what, no PH domain???)
JBC, 276, 2758 (2001) EMBO J 20, 743 (2001) JBC
276, 2752 (2001)
43A fourth paper quickly follows. (PLC-es share
architecture of PLC-b isozymes. Howd they do
that???)
Wing M. et al., JBC 2002
443D-PSSM
PLCe sequence entered as query
453D-PSSM
PDB entry (for existing structure)
Expectation value
3-D model (with sidechains!)
Fold
Sequence alignment (between query and existing
structure)
46A very simple HMM for a protein with 4 amino acids
The square boxes are called match states
these will emit a amino acid with a set
probability for each AA. Diamond boxes are for
insertions between match states, and the circles
are for deletions.
Not only are there emission probabilities for the
set and insert states, there are probabilities
for the transitions between states. There
are many possible paths through the Model!
Random transitions through the Model and
emissions from the states are guided by
probabilities. All you see at the end is the
generated sequence. The model that generated the
sequence is hidden. But the resulting sequence
is related to those sequences used to construct
the model. ?IT IS POSSIBLE TO CALCULATE THE
PROBABILITY THAT A GIVEN SEQUENCE WAS GENERATED
BY THE MODEL!
47Multiple sequence alignments (MSAs)
In this example, an MSA is used to identify
regions of high sequence conservation presumably
reflecting structural and functional constraints.
Useful for delimiting known domains and
potential new functional regions (e.g. the
Ras-binding domain in yellow and the blue box of
currently unknown function).
48Fun with MSA...
MSA used to locate functional residues and domain
boundries in homologs of Dbl-proteins with known
structure (Dbs and Tiam1).
Red amino acids directly interact with GTPases.
Blue residues directly interact with
phosphoinositides.
49What you should know
The general approaches to finding related
sequences i.e. the methodology the
terminology, how they differ. Some of the
definitions (e.g. what factors affect the
E-value?, whats paralogous?)