Title: Sequence Alignment
1Sequence Alignment
- K-tuple methodsStatistics of alignments
- Phylogenetics
2Database searches
- What is the problem?
- Large number of sequences to search your query
sequence against. - Various indexing schemes and heuristics are used,
one of which is BLAST. - heuristic is a technique to solve a problem that
ignores whether the solution can be proven to be
correct, but usually produces a good solution,
are intended to gain computational performance or
conceptual simplicity potentially at the cost of
accuracy or precision.
http//en.wikipedia.org/wiki/HeuristicsComputer_s
cience
3Concepts of Sequence Similarity Searching
- The premise
- The sequence itself is not informative it must
be analyzed by comparative methods against
existing databases to develop hypothesis
concerning relatives and function.
4Important Terms for Sequence Similarity Searching
with very different meanings
- Similarity
- The extent to which nucleotide or protein
sequences are related. In BLAST similarity refers
to a positive matrix score. - Identity
- The extent to which two (nucleotide or amino
acid) sequences are invariant. - Homology
- Similarity attributed to descent from a common
ancestor.
5Sequence Similarity Searching The Approach
- Sequence similarity searching involves the use of
a set of algorithms (such as the BLAST programs)
to compare a query sequence to all the sequences
in a specified database. - Comparisons are made in a pairwise fashion. Each
comparison is given a score reflecting the degree
of similarity between the query and the sequence
being compared.
6Blast
QUERY sequence(s)
BLAST results
BLAST program
BLAST database
7Topics
BLAST program
- There are different blast programs
- Understanding the BLAST algorithm
- Word size
- HSPs (High Scoring Pairs)
- Understanding BLAST statistics
- The alignment score (S)
- Scoring Matrices
- Dealing with gaps in an alignment
- The expectation value (E)
8The BLAST algorithm
- The BLAST programs (Basic Local Alignment Search
Tools) are a set of sequence comparison
algorithms introduced in 1990 for optimal local
alignments to a query. - Altschul SF, Gish W, Miller W, Myers EW, Lipman
DJ (1990) Basic local alignment search tool. J.
Mol. Biol. 215403-410. - Altschul SF, Madden TL, Schaeffer AA, Zhang J,
Zhang Z, Miller W, Lipman DJ (1997) Gapped BLAST
and PSI-BLAST a new generation of protein
database search programs. NAR 253389-3402.
9http//www.ncbi.nlm.nih.gov/BLAST
blastn
10Other BLAST programs
- BLAST 2 Sequences (bl2seq)
- Aligns two sequences of your choice
- Gives dot-plot like output
11More BLAST programs
- BLAST against genomes
- Many available
- BLAST parameters pre-optimized
- Handy for mapping query to genome
- Search for short exact matches
- BLAST parameters pre-optimized
- Great for checking probes and primers
12How Does BLAST Work?
- The BLAST programs improved the overall speed of
searches while retaining good sensitivity
(important as databases continue to grow) by
breaking the query and database sequences into
fragments ("words"), and initially seeking
matches between fragments. - Word hits are then extended in either direction
in an attempt to generate an alignment with a
score exceeding the threshold of T".
13Picture used with permission from Chapter 11 of
Bioinformatics A Practical Guide to the
Analysis of Genes and Proteins
14Each BLAST hit generates an alignment that can
contain one or more high scoring pairs (HSPs)
15Each BLAST hit generates an alignment that can
contain one or more high scoring pairs (HSPs)
16Where does the score (S) come from?
- The quality of each pair-wise alignment is
represented as a score and the scores are ranked.
- Scoring matrices are used to calculate the score
of the alignment base by base (DNA) or amino acid
by amino acid (protein). - The alignment score will be the sum of the scores
for each position.
17Whats a scoring matrix?
- Substitution matrices are used for amino acid
alignments. These are matrices in which each
possible residue substitution is given a score
reflecting the probability that it is related to
the corresponding residue in the query.
18PAM vs. BLOSUM scoring matrices
- BLOSUM 62 is the default matrix in BLAST 2.0.
Though it is tailored for comparisons of
moderately distant proteins, it performs well in
detecting closer relationships. A search for
distant relatives may be more sensitive with a
different matrix.
19PAM vs BLOSUM scoring matrices
- The PAM Family
- PAM matrices are based on global alignments of
closely related proteins. - The PAM1 is the matrix calculated from
comparisons of sequences with no more than 1
divergence. - Other PAM matrices are extrapolated from PAM1.
- The BLOSUM family
- BLOSUM matrices are based on local alignments.
- BLOSUM 62 is a matrix calculated from comparisons
of sequences with no less than 62 divergence. - All BLOSUM matrices are based on observed
alignments they are not extrapolated from
comparisons of closely related proteins.
20What happens if you have a gap in the alignment?
- A gap is a position in the alignment at which a
letter is paired with a null - Gap scores are negative. Since a single
mutational event may cause the insertion or
deletion of more than one residue, the presence
of a gap is frequently ascribed more significance
than the length of the gap. - Hence the gap is penalized heavily, whereas a
lesser penalty is assigned to each subsequent
residue in the gap.
21Percent Sequence Identity
- The extent to which two nucleotide or amino acid
sequences are invariant
A C C T G A G A G A C G T G G C
A G
mismatch
indel
70 identical
22BLAST algorithm
- Keyword search of all words of length w in the
query of default length n in database of length m
with score above threshold - w 11 for nucleotide queries, 3 for proteins
- Do local alignment extension for each hit of
keyword search - Extend result until longest match above threshold
is achieved and output
23BLAST algorithm (contd)
keyword
Query KRHRKVLRDNIQGITKPAIRRLARRGGVKRISGLIYEETRGVL
KIFLENVIRD
GVK 18 GAK 16 GIK 16 GGK 14 GLK 13 GNK 12 GRK
11 GEK 11 GDK 11
Neighborhood words
neighborhood score threshold (T 13)
extension
Query 22 VLRDNIQGITKPAIRRLARRGGVKRISGLIYEETRGVLK
60 DN G IR L GK I L E
RGK Sbjct 226 IIKDNGRGFSGKQIRNLNYGIGLKVIADLV-EK
HRGIIK 263
High-scoring Pair (HSP)
24Original BLAST
- Dictionary
- All words of length w
- Alignment
- Ungapped extensions until score falls below
statistical threshold T - Output
- All local alignments with score gt statistical
threshold
25Original BLAST Example
A C G A A G T A A G G T C
C A G T
- w 4, T 4
- Exact keyword match of GGTC
- Extend diagonals with mismatches until score
falls below a threshold - Output result
- GTAAGGTCC
- GTTAGGTCC
C T G A T C C T G G A T T
G C G A
From lectures by Serafim Batzoglou (Stanford)
26Gapped BLAST Example
A C G A A G T A A G G T C
C A G T
- Original BLAST exact keyword search, THEN
- Extend with gaps in a zone around ends of exact
match - Output result
- GTAAGGTCCAGT
- GTTAGGTC-AGT
C T G A T C C T G G A T T
G C G A
From lectures by Serafim Batzoglou (Stanford)
27Gapped BLAST Example (contd)
A C G A A G T A A G G T C
C A G T
- Original BLAST exact keyword search, THEN
- Extend with gaps around ends of exact match until
score ltT, then merge nearby alignments - Output result
- GTAAGGTCCAGT
- GTTAGGTC-AGT
C T G A T C C T G G A T T
G C G A
From lectures by Serafim Batzoglou (Stanford)
28Topics
BLAST databases
- The different blast databases provided by the
NCBI - Protein databases
- Nucleotide databases
- Genomic databases
- Considerations for choosing a BLAST database
- Custom databases for BLAST
29BLAST protein databases available at through
blastp web interface _at_ NCBI
30Considerations for choosing a BLAST database
- First consider your research question
- Are you looking for an ortholog in a particular
species? - BLAST against the genome of that species.
- Are you looking for additional members of a
protein family across all species? - BLAST against nr, if you cant find hits check
wgs, htgs, and the trace archives. - Are you looking to annotate genes in your species
of interest? - BLAST against known genes (RefSeq) and/or ESTs
from a closely related species.
31When choosing a database for BLAST
- It is important to know your reagents.
- Changing your choice of database is changing your
search space completely - Database size affects the BLAST statistics
- record BLAST parameters, database choice,
database size in your bioinformatics lab book,
just as you would for your wet-bench experiments. - Databases change rapidly and are updated
frequently - It may be necessary to repeat your analyses
32Topics
BLAST results
- Choosing the right BLAST program
- Running a blastp search
- BLAST parameters and options to consider
- Viewing BLAST results
- Look at your alignments
- Using the BLAST taxonomy report
33BLAST parameters and options to consider
conserved domains
Entrez query
E-value cutoff
Word size
34More BLAST parameters and options to consider
filtering
gap penalities
matrix
35Run your BLAST search
BLAST
36The BLAST Queue
click for more info
Note your RID
37Formatting and Retrieving your BLAST results
Results
options
38A graphical view of your BLAST results
39The BLAST hit list
Score
E-Value
GenBank
alignment
EntrezGene
40The BLAST pairwise alignments
Identity
Similarity
41Sample BLAST output
- Blast of human beta globin protein against zebra
fish
- Score E
- Sequences producing significant alignments
(bits) Value - gi18858329refNP_571095.1 ba1 globin Danio
rerio gtgi147757... 171 3e-44 - gi18858331refNP_571096.1 ba2 globin
SIdZ118J2.3 Danio rer... 170 7e-44 - gi37606100embCAE48992.1 SIbY187G17.6 (novel
beta globin) D... 170 7e-44 - gi31419195gbAAH53176.1 Ba1 protein Danio
rerio 168 3e-43 - ALIGNMENTS
- gtgi18858329refNP_571095.1 ba1 globin Danio
rerio - Length 148
- Score 171 bits (434), Expect 3e-44
- Identities 76/148 (51), Positives 106/148
(71), Gaps 1/148 (0) - Query 1 MVHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWT
QRFFESFGDLSTPDAVMGNPK 60 - MV T EA LWGKNDEG AL R
LVYPWTQRF FGLSP AMGNPK - Sbjct 1 MVEWTDAERTAILGLWGKLNIDEIGPQALSRCLIVYPWT
QRYFATFGNLSSPAAIMGNPK 60
42Sample BLAST output (contd)
- Blast of human beta globin DNA against human DNA
- Score E
- Sequences producing significant alignments
(bits) Value - gi19849266gbAF487523.1 Homo sapiens gamma A
hemoglobin (HBG1... 289 1e-75 - gi183868gbM11427.1HUMHBG3E Human gamma-globin
mRNA, 3' end 289 1e-75 - gi44887617gbAY534688.1 Homo sapiens A-gamma
globin (HBG1) ge... 280 1e-72 - gi31726embV00512.1HSGGL1 Human messenger RNA
for gamma-globin 260 1e-66 - gi38683401refNR_001589.1 Homo sapiens
hemoglobin, beta pseud... 151 7e-34 - gi18462073gbAF339400.1 Homo sapiens haplotype
PB26 beta-glob... 149 3e-33 - ALIGNMENTS
- gtgi28380636refNG_000007.3 Homo sapiens beta
globin region (HBB_at_) on chromosome 11 - Length 81706
- Score 149 bits (75), Expect 3e-33
- Identities 183/219 (83)
- Strand Plus / Plus
-
- Query 267 ttgggagatgccacaaagcacctggatgatctcaagg
gcacctttgcccagctgagtgaa 326 -
43What do the Score and the e-value really mean?
- The quality of the alignment is represented by
the Score. - Score (S)
- The score of an alignment is calculated as the
sum of substitution and gap scores. Substitution
scores are given by a look-up table (PAM, BLOSUM)
whereas gap scores are assigned empirically . - The significance of each alignment is computed as
an E value. - E value (E)
- Expectation value. The number of different
alignments with scores equivalent to or better
than S that are expected to occur in a database
search by chance. The lower the E value, the more
significant the score.
44E value
- E value (E)
- Expectation value. The number of different
alignments with scores equivalent to or better
than S expected to occur in a database search by
chance. The lower the E value, the more
significant the score.
45Assessing sequence homology
- Need to know how strong an alignment can be
expected from chance alone - Chance is the comparison of
- Real but non-homologous sequences
- Real sequences that are shuffled to preserve
compositional properties - Sequences that are generated randomly based upon
a DNA or protein sequence model (favored)
46High Scoring Pairs (HSPs)
- All segment pairs whose scores can not be
improved by extension or trimming - Need to model a random sequence to analyze how
high the score is in relation to chance
47Expected number of HSPs
- Expected number of HSPs with score gt S
- E-value E for the score S
- E Kmne-lS
- Given
- Two sequences, length n and m
- The statistics of HSP scores are characterized by
two parameters K and ? - K scale for the search space size
- ? scale for the scoring system
48BLAST statistics to record in your bioinformatics
labbook
Record the statistics that are found at bottom of
your BLAST results page
49Scoring matrices
- Amino acid substitution matrices
- PAM
- BLOSUM
50Bit Scores
- Normalized score to be able to compare sequences
- Bit score
- S lS ln(K) ln(2)
- E-value of bit score
- E mn2-S
51Assessing the significance of an alignment
- How to assess the significance of an alignment
between the comparison of a protein of length m
to a database containing many different proteins,
of varying lengths? - Calculate a "database search" E-value. Multiply
the pairwise-comparison E-value by the number of
sequences in the database N divided by the length
of the sequence in the database n -
52Homology Some Guidelines
- Similarity can be indicative of homology
- Generally, if two sequences are significantly
similar over entire length they are likely
homologous - Low complexity regions can be highly similar
without being homologous - Homologous sequences not always highly similar
53Homology Some Guidelines
- Suggested BLAST Cutoffs
- (source Chapter 11 Bioinformatics A Practical
Guide to the Analysis of Genes and Proteins) - For nucleotide based searches, one should look
for hits with E-values of 10-6 or less and
sequence identity of 70 or more - For protein based searches, one should look for
hits with E-values of 10-3 or less and sequence
identity of 25 or more
54Contributors
http//creativecommons.org/licenses/by-sa/2.0/
55Odds score in sequence alignment
- The chance of an aligned amino acid pair being
found in alignments of related sequences compared
to the chance of that pair being found in random
alignments of unrelated sequences.
56Statistical significance of an alignment
- The probability that random or unrelated
sequences could be aligned to produce the same
score. - Smaller the probability is the better.
57Alignment Statistics
- For two sequences of length n and m, n times m
comparisons are being made thus the longest
length of the predicted match would be log1/p(mn).
58Alignment Statistics
- Expectation value or the mean longest match would
be - E(M) log1/p(Kmn), where K is a constant that
depends on amino acid or base composition and p
is the probability of a match. - This is only true for ungapped local alignments.
59Distribution of alignment scores
- resembles Gumbel extreme value distribution.
60Extreme Value Distribution
61Alignment Statistics
- E(M)log1/p(Kmn) means that match length gets
bigger as the log of the product of sequence
lengths. Amino acid substitution matrices will
turn match lengths into alignment scores (S). - More commonly ? ln(1/p) is used.
- Number of longest run HSP will be estimated
- E Kmne-?S
- How good a sequence score is evaluated based on
how many HSPs (i.e. E value) one would expect for
that score.
62Alignment Statistics
- Two ways to get K and ?
- For 10000 random amino acid sequences with
various gap penalties, K and lambda parameters
have been tabulated. - Calculation of the distribution for two sequences
being aligned by keeping one of them fixed and
scrambling the other, thus preserving both the
sequence length and amino acid composition.
63Alignment Statistics
64Alignment Statistics
65Alignment Statistics
66Alignment Statistics
67Gene Structure
68Mutation Rates
69Functional Constraint
70Synonymous vs nonsynonymous substitutions
71Synonymous vs nonsynonymous substitutions
72Synonymous vs nonsynonymous substitutions
73Mutation vs substitution
74Estimating substitutions
75Jukes-Cantor model
76Transitions vs transversions
77Kimuras 2-parameter model
78Kimuras 2-parameter model
79Kimuras 2-parameter model
80Functional Constraints
81Molecular Clocks
82Relative Rate
83Distance based phylogenetics
84Distance based phylogenetics
85Distance based phylogenetics
86Distance based phylogenetics
87Distance based phylogenetics
88Distance based phylogenetics
89Phylogenetics Programs