Title: The Genome Access Course Pairwise Sequence Comparisons
1TheGenomeAccessCoursePairwise
SequenceComparisons
The Artist in His Museum, Charles Wilson Peale
2- Identity
- Similarity
- Homology
- Paralogy
- Orthology
3Types of Homology
4Methods of Alignment
- Dot Plot Analysis
- Dynamic Programming
- k-tuple Methods
5Dot Plot Analysis
- Plot one sequence against another, or against
itself - Can specify word size
- Good for finding repeats and inverts
- Identity runs along main diagonal
- Common subsequences (possible translocations) are
shorter off-diagonal lines - Inversions are perpendicular to main diagonal
- Deletions are interruptions in the lines
6Dot Plot Analysis
- EMBOSS has several programs (available via PISE)
- Dotter
- Dotlet (web-based)
7S
T
O
P
S
S
T
O
P
S
8S
T
O
P
S
S
P
O
T
S
9Dotplot of the Complete Works of Shakespeare
10Self-Alignment of Human LDL Receptor
11Dynamic Programming
- Provides an optimal alignment
- Results depend on scoring system and gap
penalties - Takes gaps into account, but limits the number of
comparisons
12Global vs. Local
SPQ-RTGKCCWIAGPGILHRMSL SGALRCSWND-IAGPCAQH-MSA
Global Needleman-Wunsch start at end and
add gaps until one end is reached
Local Smith-Waterman finds region(s) of
highest similarity and build outward
13Smith-Waterman (Local)
- Dynamic Programming
- Much slower than either BLAST or FASTA
- More sensitive
14Scoring Matrices
- Mutation Data (PAM, BLOSUM)
- Identity
- Genetic Code
- Physical Properties
15(No Transcript)
16PAM(Point/Percent Accepted Mutation)
One PAM unit is an average of 1 change in all
amino acid positions PAM 1 was generated from 71
protein sequence groups of at least 85
similarity to avoid errors from multiple
mutations at the same site, as well as insertions
and deletions Convert mutation probabilities to
a (log odds) scoring matrix
17PAM Matrices
- PAMn supplies scoring for sequences that have
diverged n PAM units - PAM1n generates other matrices based on Markov
model - PAM250 represent 250 change in 2500 my
18BLOSUM(BLOck SUbstitution Matrix)
- BLOCKS is a database of 3,000 blocks short,
continuous multiple alignments - Generates substitution frequency, which can be
converted into log odds score - BLOSUM62
- Block has 62 similarity
- Close to PAM250
- More Reliable
19Gap Penalties
- Optimal alignment has smaller number of gaps
(i.e., the alignment stays close to the diagonal
of a comparison matrix) - Charge for initial and terminal gaps
- Opening a gap (5 nucleotides, 11 proteins)
- Continuing a gap (2 nucleotides, 1 proteins)
20Software for Alignment
- Fasta
- BLAST
- WU-BLAST
- EMBOSS (Needle, Water)
- BLAT
21FASTA
- Find hot-spots (pairs of words of length k these
are called hits in BLAST) - Locate sequences of consecutive hot spots on a
diagonal - Combine sub-alignments into a longer alignment
- Find alternative local alignments using dynamic
programming restricted to a ribbon along the
diagonal containing best run
22FASTA Algorithm
23FASTA Algorithm
24FASTA Programs
- FastA3 - Compare a protein sequence to a protein
database, or a DNA sequence to a DNA database - Ssearch3 - Compare a protein sequence to a
protein database, or a DNA database, using the
Smith-Waterman algorithm. It is very slow but
much more sensitive for full-length proteins
comparison. - Fastx3 - Compare a DNA sequence to a protein
database, by comparing the translated DNA
sequence in three frames and allowing gaps and
frameshifts.
25BLAST(Basic Local Alignment Search Tool)
- Word substring of a sequence
- Word pair pair of words of the same length
- Score numerical value of the gapless alignment
of the words in a word pair - Hit a short, high-scoring word pair, presumably
from homologous sequences.
26BLAST Parameters
- Word Size (W)
- Minimum hit score (T)
- For BLOSUM62, w 3 and T 13
27BLAST Steps
- Compile the list of possible high-scoring words
- Generate hits by matching the possible words from
against the database - Extend hits until the score drops gtX return
highest-scoring segment pair (HSP) - Evaluate significance of extended hits
28Query word (W 3)
Query GSDFWQETRASFGCSLAALLNKCKTPQGQRLVNQWIKQPLMDK
NRIEERLNLVEAFGCATSWPI
PQG 18 PEG 15 PRG 14 PKG 14 PNG 13 PDG 13 PHG 13 P
MG 13 PSG 13 PQA 12 PQN 12
Neighborhood words
Neighborhood score threshold (T 13)
Query SLAALLNKCKTPQGQRLVNQWIKQPLMDKNRIEERLNLVEA
LAL TP G R W P D ER
A Subject TLASVLDCTVTPMGSRMLKRWLHMPVRDTRVLLERQQT
IGA
High-scoring Segment Pair (HSP)
29HSPs are Aligned Regions
- The results of the word matching and attempts to
extend the alignment are segments called HSPs
(High-scoring Segment Pairs) - BLAST often produces several short HSPs rather
than a single aligned region
30BLAST Scoring
SPQR SPIN
Score (BLOSUM62) 4 7 -3 0 8
Random alignments must give a negative score
31Gapped BLAST
- Find two non-overlapping hits of score at least T
and distance at most d one from another - Ungapped extension
- If the HSP generated has high enough normalized
score, start gapped extension - Report resulting alignment if it has sufficiently
large E-value
32E-values
- Lower values give more stringent results
- Default is 10 fractional values permissible
- E10 means 10 matches are expected by chance
- E lt 0.05 is significant
- Short regions of high identity are less
significant than long regions of moderate
similarity
33Equations
- E Kmne-lS (K is search space l is scoring
system) - S (lS ln K) / ln 2
- E mn2-S
- P 1 e-E
34BLAST Programs
35Other BLAST Programs
- MEGABLAST
- Genomic BLAST
- Fugu rubripes
- A/G BLAST
36BLAST Databases
37Benefits of Protein Searches
- 20 amino acids reduce chance of random matches
- Protein DBs are smaller
- Degrees of similarity
- Chemical
- Observed mutation frequencies
- Steps for mutation
38BLAST and FASTA
- BLAST for proteins and speed
- FASTA for DNA and frameshifts
- FASTA for accurate statistics (protein and coding
DNA) - SSEARCH for optimal
39Blast vs. Fasta
- Database format is the same
- BLAST is generally faster
- FASTA may produce better alignments
- blastp will show several alignments between
domains in the same sequences. Fasta3 shows one
alignment for each sequence pair.
40Blast vs. Fasta II
- Different default matrices and gap penalties
- BLAST cannot search very short sequences (can use
Fasta, but EMBOSS fuzznuc or fuzzpro are better) - Fasta searches one strand of DNA
- Fasta does not filter low complexity by default
41Anatomy of a BLAST Search
42BLAST Input
- Program
- Filters
- Expect
- Word Size
- Show
- Number of Descriptions/Alignments
43BLAST Filters
- Low complexity
- Human repeats
- Mask for lookup table only
- Mask lower case
- Limit by Entrez query
44Low Complexity Regions
pi probability of symbol i ? T,G,A,C
Information content of i -log2 pi
-log2 pi increases exponentially as pi approaches
0
-log2 pi 0 when pi 1
Thus, sequences such as AAAAAAAAAAAAAAAAAA have
no information content
45ENTREZ Queries
- nucleotide allFilter NOT specimen-voucherAll
Fields - You can eliminate most of the BAC-type records
from the default nucleotide database with the
following query nucleotide allFilter NOT
htgKeyword
46BLAST Output
- Reference
- Query
- Database
- TaxBLAST link
- Graphical Overview aligned matches
47BLAST Output Significant Alignments
- SeqID Genbank ID and Accession
- Description (links to entry)
- Score (links down)
- E-value
- LocusLink
- Unigene
48BLAST Output - Alignments
- Length
- Score
- Expect
- Identities/Gaps
- Strand (e.g. Plus/Plus)
- Alignment shows identity
49BLAST Output - Proteins
- Conserved Domains Pfam CD-Search
- Alignments
- Matching letters shown
- Mismatches blank
- Conservative substitutions shown by
- Gaps are shown by dashes
50BLAST Output End Summary
- Database -- of letters sequences
- Lambda, K, H
- Gapped Lambda, K, H
- Matrix
- Gap Penalties
- Hit summary , extensions, gt 10.0
- HSPs
- length of query
- T,A, X1, X2, X3, S1, S2
51Profile and Pattern Searches
- PSI-BLAST
- PHI-BLAST
- PROBE
52Position Specific Iterated BLAST (PSI-BLAST)
- Using the query sequence as a template, align all
sequences that match with an E-value below
threshold - Assign weights to the sequences
- Construct a position-specific scoring matrix
- Iterate
- E-values unreliable
53PSI-BLAST
- Position-specific score matrix (PSSM) same length
as query - Multiple alignment constructed from BLAST output
- Sequences weighted on column-by-column basis
- Effective number of independent observations
estimated - Targert frequencies derived using data-dependent
pseudo-counts - Log-odds weight matrix scores calculated to scale
- BLAST applied to PSSM
- Statistical evaluation of results
- Iteration
54Pattern-Hit-Initiated BLAST (PHI-BLAST)
- Combines regular expression matching with local
alignments - Finds proteins containing the pattern and
similarity in the region of the pattern - Integrated with PSI-BLAST
- E-values are computed differently
- Under development
55BLASTing Custom Databases
- Assemble Database (e.g., by using Batch ENTREZ)
- Run formatdb
- Schedule Job
56Sites using WU-BLAST
- European Bioinformatics Institute BLAST server
- EMBL Advanced BLAST2 Search
- Institut Pasteur
- Berkeley Drosophila Genome Project (BDGP)
- European Drosophila Genome Project at the EBI
- Mendel Bioinformatics Group at the John Innes
Centre - Mouse Genome Database at the Jackson Laboratory
- PlasmoDB (Plasmodium Genome Resource) at the
University of Pennsylvania - Saccharomyces cerevisiae search at Stanford
University - TAP (Transcript Assembly Program) by Zhengyan
("George") Kan - TIGR Gene Indices
- TIGR Microbial Genomes Database
- WU Genome Sequencing Center