The Genome Access Course Pairwise Sequence Comparisons - PowerPoint PPT Presentation

1 / 56

About This Presentation

Title:

The Genome Access Course Pairwise Sequence Comparisons

Description:

Common subsequences (possible translocations) are shorter off-diagonal lines ... Fugu rubripes. A/G BLAST. BLAST Databases. High-Throughput Genomic Sequences. htgs ... – PowerPoint PPT presentation

Number of Views:76

Avg rating:3.0/5.0

Slides: 57

Provided by: james857

Category:

more less

Transcript and Presenter's Notes

Title: The Genome Access Course Pairwise Sequence Comparisons

1
TheGenomeAccessCoursePairwise
SequenceComparisons
The Artist in His Museum, Charles Wilson Peale
2

Identity
Similarity
Homology
Paralogy
Orthology

3
Types of Homology
4
Methods of Alignment

Dot Plot Analysis
Dynamic Programming
k-tuple Methods

5
Dot Plot Analysis

Plot one sequence against another, or against
itself
Can specify word size
Good for finding repeats and inverts
Identity runs along main diagonal
Common subsequences (possible translocations) are
shorter off-diagonal lines
Inversions are perpendicular to main diagonal
Deletions are interruptions in the lines

6
Dot Plot Analysis

EMBOSS has several programs (available via PISE)
Dotter
Dotlet (web-based)

7
S
T
O
P
S
S
T
O
P
S
8
S
T
O
P
S
S
P
O
T
S
9
Dotplot of the Complete Works of Shakespeare
10
Self-Alignment of Human LDL Receptor
11
Dynamic Programming

Provides an optimal alignment
Results depend on scoring system and gap
penalties
Takes gaps into account, but limits the number of
comparisons

12
Global vs. Local
SPQ-RTGKCCWIAGPGILHRMSL SGALRCSWND-IAGPCAQH-MSA
Global Needleman-Wunsch start at end and
add gaps until one end is reached
Local Smith-Waterman finds region(s) of
highest similarity and build outward
13
Smith-Waterman (Local)

Dynamic Programming
Much slower than either BLAST or FASTA
More sensitive

14
Scoring Matrices

Mutation Data (PAM, BLOSUM)
Identity
Genetic Code
Physical Properties

15
(No Transcript)
16
PAM(Point/Percent Accepted Mutation)
One PAM unit is an average of 1 change in all
amino acid positions PAM 1 was generated from 71
protein sequence groups of at least 85
similarity to avoid errors from multiple
mutations at the same site, as well as insertions
and deletions Convert mutation probabilities to
a (log odds) scoring matrix
17
PAM Matrices

PAMn supplies scoring for sequences that have
diverged n PAM units
PAM1n generates other matrices based on Markov
model
PAM250 represent 250 change in 2500 my

18
BLOSUM(BLOck SUbstitution Matrix)

BLOCKS is a database of 3,000 blocks short,
continuous multiple alignments
Generates substitution frequency, which can be
converted into log odds score
BLOSUM62
Block has 62 similarity
Close to PAM250
More Reliable

19
Gap Penalties

Optimal alignment has smaller number of gaps
(i.e., the alignment stays close to the diagonal
of a comparison matrix)
Charge for initial and terminal gaps
Opening a gap (5 nucleotides, 11 proteins)
Continuing a gap (2 nucleotides, 1 proteins)

20
Software for Alignment

Fasta
BLAST
WU-BLAST
EMBOSS (Needle, Water)
BLAT

21
FASTA

Find hot-spots (pairs of words of length k these
are called hits in BLAST)
Locate sequences of consecutive hot spots on a
diagonal
Combine sub-alignments into a longer alignment
Find alternative local alignments using dynamic
programming restricted to a ribbon along the
diagonal containing best run

22
FASTA Algorithm
23
FASTA Algorithm
24
FASTA Programs

FastA3 - Compare a protein sequence to a protein
database, or a DNA sequence to a DNA database
Ssearch3 - Compare a protein sequence to a
protein database, or a DNA database, using the
Smith-Waterman algorithm. It is very slow but
much more sensitive for full-length proteins
comparison.
Fastx3 - Compare a DNA sequence to a protein
database, by comparing the translated DNA
sequence in three frames and allowing gaps and
frameshifts.

25
BLAST(Basic Local Alignment Search Tool)

Word substring of a sequence
Word pair pair of words of the same length
Score numerical value of the gapless alignment
of the words in a word pair
Hit a short, high-scoring word pair, presumably
from homologous sequences.

26
BLAST Parameters

Word Size (W)
Minimum hit score (T)
For BLOSUM62, w 3 and T 13

27
BLAST Steps

Compile the list of possible high-scoring words
Generate hits by matching the possible words from
against the database
Extend hits until the score drops gtX return
highest-scoring segment pair (HSP)
Evaluate significance of extended hits

28
Query word (W 3)
Query GSDFWQETRASFGCSLAALLNKCKTPQGQRLVNQWIKQPLMDK
NRIEERLNLVEAFGCATSWPI
PQG 18 PEG 15 PRG 14 PKG 14 PNG 13 PDG 13 PHG 13 P
MG 13 PSG 13 PQA 12 PQN 12
Neighborhood words
Neighborhood score threshold (T 13)
Query SLAALLNKCKTPQGQRLVNQWIKQPLMDKNRIEERLNLVEA
LAL TP G R W P D ER
A Subject TLASVLDCTVTPMGSRMLKRWLHMPVRDTRVLLERQQT
IGA
High-scoring Segment Pair (HSP)
29
HSPs are Aligned Regions

The results of the word matching and attempts to
extend the alignment are segments called HSPs
(High-scoring Segment Pairs)
BLAST often produces several short HSPs rather
than a single aligned region

30
BLAST Scoring
SPQR SPIN
Score (BLOSUM62) 4 7 -3 0 8
Random alignments must give a negative score
31
Gapped BLAST

Find two non-overlapping hits of score at least T
and distance at most d one from another
Ungapped extension
If the HSP generated has high enough normalized
score, start gapped extension
Report resulting alignment if it has sufficiently
large E-value

32
E-values

Lower values give more stringent results
Default is 10 fractional values permissible
E10 means 10 matches are expected by chance
E lt 0.05 is significant
Short regions of high identity are less
significant than long regions of moderate
similarity

33
Equations

E Kmne-lS (K is search space l is scoring
system)
S (lS ln K) / ln 2
E mn2-S
P 1 e-E

34
BLAST Programs
35
Other BLAST Programs

MEGABLAST
Genomic BLAST
Fugu rubripes
A/G BLAST

36
BLAST Databases
37
Benefits of Protein Searches

20 amino acids reduce chance of random matches
Protein DBs are smaller
Degrees of similarity
Chemical
Observed mutation frequencies
Steps for mutation

38
BLAST and FASTA

BLAST for proteins and speed
FASTA for DNA and frameshifts
FASTA for accurate statistics (protein and coding
DNA)
SSEARCH for optimal

39
Blast vs. Fasta

Database format is the same
BLAST is generally faster
FASTA may produce better alignments
blastp will show several alignments between
domains in the same sequences. Fasta3 shows one
alignment for each sequence pair.

40
Blast vs. Fasta II

Different default matrices and gap penalties
BLAST cannot search very short sequences (can use
Fasta, but EMBOSS fuzznuc or fuzzpro are better)
Fasta searches one strand of DNA
Fasta does not filter low complexity by default

41
Anatomy of a BLAST Search
42
BLAST Input

Program
Filters
Expect
Word Size
Show
Number of Descriptions/Alignments

43
BLAST Filters

Low complexity
Human repeats
Mask for lookup table only
Mask lower case
Limit by Entrez query

44
Low Complexity Regions
pi probability of symbol i ? T,G,A,C
Information content of i -log2 pi
-log2 pi increases exponentially as pi approaches
0
-log2 pi 0 when pi 1
Thus, sequences such as AAAAAAAAAAAAAAAAAA have
no information content
45
ENTREZ Queries

nucleotide allFilter NOT specimen-voucherAll
Fields
You can eliminate most of the BAC-type records
from the default nucleotide database with the
following query nucleotide allFilter NOT
htgKeyword

46
BLAST Output

Reference
Query
Database
TaxBLAST link
Graphical Overview aligned matches

47
BLAST Output Significant Alignments

SeqID Genbank ID and Accession
Description (links to entry)
Score (links down)
E-value
LocusLink
Unigene

48
BLAST Output - Alignments

Length
Score
Expect
Identities/Gaps
Strand (e.g. Plus/Plus)
Alignment shows identity

49
BLAST Output - Proteins

Conserved Domains Pfam CD-Search
Alignments
Matching letters shown
Mismatches blank
Conservative substitutions shown by
Gaps are shown by dashes

50
BLAST Output End Summary

Database -- of letters sequences
Lambda, K, H
Gapped Lambda, K, H
Matrix
Gap Penalties
Hit summary , extensions, gt 10.0
HSPs
length of query
T,A, X1, X2, X3, S1, S2

51
Profile and Pattern Searches

PSI-BLAST
PHI-BLAST
PROBE

52
Position Specific Iterated BLAST (PSI-BLAST)

Using the query sequence as a template, align all
sequences that match with an E-value below
threshold
Assign weights to the sequences
Construct a position-specific scoring matrix
Iterate
E-values unreliable

53
PSI-BLAST

Position-specific score matrix (PSSM) same length
as query
Multiple alignment constructed from BLAST output
Sequences weighted on column-by-column basis
Effective number of independent observations
estimated
Targert frequencies derived using data-dependent
pseudo-counts
Log-odds weight matrix scores calculated to scale
BLAST applied to PSSM
Statistical evaluation of results
Iteration

54
Pattern-Hit-Initiated BLAST (PHI-BLAST)

Combines regular expression matching with local
alignments
Finds proteins containing the pattern and
similarity in the region of the pattern
Integrated with PSI-BLAST
E-values are computed differently
Under development

55
BLASTing Custom Databases