The Genome Access Course Pairwise Sequence Comparisons - PowerPoint PPT Presentation

1 / 56
About This Presentation
Title:

The Genome Access Course Pairwise Sequence Comparisons

Description:

Common subsequences (possible translocations) are shorter off-diagonal lines ... Fugu rubripes. A/G BLAST. BLAST Databases. High-Throughput Genomic Sequences. htgs ... – PowerPoint PPT presentation

Number of Views:76
Avg rating:3.0/5.0
Slides: 57
Provided by: james857
Category:

less

Transcript and Presenter's Notes

Title: The Genome Access Course Pairwise Sequence Comparisons


1
TheGenomeAccessCoursePairwise
SequenceComparisons
The Artist in His Museum, Charles Wilson Peale
2
  • Identity
  • Similarity
  • Homology
  • Paralogy
  • Orthology

3
Types of Homology
4
Methods of Alignment
  • Dot Plot Analysis
  • Dynamic Programming
  • k-tuple Methods

5
Dot Plot Analysis
  • Plot one sequence against another, or against
    itself
  • Can specify word size
  • Good for finding repeats and inverts
  • Identity runs along main diagonal
  • Common subsequences (possible translocations) are
    shorter off-diagonal lines
  • Inversions are perpendicular to main diagonal
  • Deletions are interruptions in the lines

6
Dot Plot Analysis
  • EMBOSS has several programs (available via PISE)
  • Dotter
  • Dotlet (web-based)

7
S
T
O
P
S
S
T
O
P
S
8
S
T
O
P
S
S
P
O
T
S
9
Dotplot of the Complete Works of Shakespeare
10
Self-Alignment of Human LDL Receptor
11
Dynamic Programming
  • Provides an optimal alignment
  • Results depend on scoring system and gap
    penalties
  • Takes gaps into account, but limits the number of
    comparisons

12
Global vs. Local
SPQ-RTGKCCWIAGPGILHRMSL SGALRCSWND-IAGPCAQH-MSA
Global Needleman-Wunsch start at end and
add gaps until one end is reached
Local Smith-Waterman finds region(s) of
highest similarity and build outward
13
Smith-Waterman (Local)
  • Dynamic Programming
  • Much slower than either BLAST or FASTA
  • More sensitive

14
Scoring Matrices
  • Mutation Data (PAM, BLOSUM)
  • Identity
  • Genetic Code
  • Physical Properties

15
(No Transcript)
16
PAM(Point/Percent Accepted Mutation)
One PAM unit is an average of 1 change in all
amino acid positions PAM 1 was generated from 71
protein sequence groups of at least 85
similarity to avoid errors from multiple
mutations at the same site, as well as insertions
and deletions Convert mutation probabilities to
a (log odds) scoring matrix
17
PAM Matrices
  • PAMn supplies scoring for sequences that have
    diverged n PAM units
  • PAM1n generates other matrices based on Markov
    model
  • PAM250 represent 250 change in 2500 my

18
BLOSUM(BLOck SUbstitution Matrix)
  • BLOCKS is a database of 3,000 blocks short,
    continuous multiple alignments
  • Generates substitution frequency, which can be
    converted into log odds score
  • BLOSUM62
  • Block has 62 similarity
  • Close to PAM250
  • More Reliable

19
Gap Penalties
  • Optimal alignment has smaller number of gaps
    (i.e., the alignment stays close to the diagonal
    of a comparison matrix)
  • Charge for initial and terminal gaps
  • Opening a gap (5 nucleotides, 11 proteins)
  • Continuing a gap (2 nucleotides, 1 proteins)

20
Software for Alignment
  • Fasta
  • BLAST
  • WU-BLAST
  • EMBOSS (Needle, Water)
  • BLAT

21
FASTA
  • Find hot-spots (pairs of words of length k these
    are called hits in BLAST)
  • Locate sequences of consecutive hot spots on a
    diagonal
  • Combine sub-alignments into a longer alignment
  • Find alternative local alignments using dynamic
    programming restricted to a ribbon along the
    diagonal containing best run

22
FASTA Algorithm
23
FASTA Algorithm
24
FASTA Programs
  • FastA3 - Compare a protein sequence to a protein
    database, or a DNA sequence to a DNA database
  • Ssearch3 - Compare a protein sequence to a
    protein database, or a DNA database, using the
    Smith-Waterman algorithm. It is very slow but
    much more sensitive for full-length proteins
    comparison.
  • Fastx3 - Compare a DNA sequence to a protein
    database, by comparing the translated DNA
    sequence in three frames and allowing gaps and
    frameshifts.

25
BLAST(Basic Local Alignment Search Tool)
  • Word substring of a sequence
  • Word pair pair of words of the same length
  • Score numerical value of the gapless alignment
    of the words in a word pair
  • Hit a short, high-scoring word pair, presumably
    from homologous sequences.

26
BLAST Parameters
  • Word Size (W)
  • Minimum hit score (T)
  • For BLOSUM62, w 3 and T 13

27
BLAST Steps
  • Compile the list of possible high-scoring words
  • Generate hits by matching the possible words from
    against the database
  • Extend hits until the score drops gtX return
    highest-scoring segment pair (HSP)
  • Evaluate significance of extended hits

28
Query word (W 3)
Query GSDFWQETRASFGCSLAALLNKCKTPQGQRLVNQWIKQPLMDK
NRIEERLNLVEAFGCATSWPI
PQG 18 PEG 15 PRG 14 PKG 14 PNG 13 PDG 13 PHG 13 P
MG 13 PSG 13 PQA 12 PQN 12
Neighborhood words
Neighborhood score threshold (T 13)
Query SLAALLNKCKTPQGQRLVNQWIKQPLMDKNRIEERLNLVEA
LAL TP G R W P D ER
A Subject TLASVLDCTVTPMGSRMLKRWLHMPVRDTRVLLERQQT
IGA
High-scoring Segment Pair (HSP)
29
HSPs are Aligned Regions
  • The results of the word matching and attempts to
    extend the alignment are segments called HSPs
    (High-scoring Segment Pairs)
  • BLAST often produces several short HSPs rather
    than a single aligned region

30
BLAST Scoring
SPQR SPIN
Score (BLOSUM62) 4 7 -3 0 8
Random alignments must give a negative score
31
Gapped BLAST
  • Find two non-overlapping hits of score at least T
    and distance at most d one from another
  • Ungapped extension
  • If the HSP generated has high enough normalized
    score, start gapped extension
  • Report resulting alignment if it has sufficiently
    large E-value

32
E-values
  • Lower values give more stringent results
  • Default is 10 fractional values permissible
  • E10 means 10 matches are expected by chance
  • E lt 0.05 is significant
  • Short regions of high identity are less
    significant than long regions of moderate
    similarity

33
Equations
  • E Kmne-lS (K is search space l is scoring
    system)
  • S (lS ln K) / ln 2
  • E mn2-S
  • P 1 e-E

34
BLAST Programs
35
Other BLAST Programs
  • MEGABLAST
  • Genomic BLAST
  • Fugu rubripes
  • A/G BLAST

36
BLAST Databases
37
Benefits of Protein Searches
  • 20 amino acids reduce chance of random matches
  • Protein DBs are smaller
  • Degrees of similarity
  • Chemical
  • Observed mutation frequencies
  • Steps for mutation

38
BLAST and FASTA
  • BLAST for proteins and speed
  • FASTA for DNA and frameshifts
  • FASTA for accurate statistics (protein and coding
    DNA)
  • SSEARCH for optimal

39
Blast vs. Fasta
  • Database format is the same
  • BLAST is generally faster
  • FASTA may produce better alignments
  • blastp will show several alignments between
    domains in the same sequences. Fasta3 shows one
    alignment for each sequence pair.

40
Blast vs. Fasta II
  • Different default matrices and gap penalties
  • BLAST cannot search very short sequences (can use
    Fasta, but EMBOSS fuzznuc or fuzzpro are better)
  • Fasta searches one strand of DNA
  • Fasta does not filter low complexity by default

41
Anatomy of a BLAST Search
42
BLAST Input
  • Program
  • Filters
  • Expect
  • Word Size
  • Show
  • Number of Descriptions/Alignments

43
BLAST Filters
  • Low complexity
  • Human repeats
  • Mask for lookup table only
  • Mask lower case
  • Limit by Entrez query

44
Low Complexity Regions
pi probability of symbol i ? T,G,A,C
Information content of i -log2 pi
-log2 pi increases exponentially as pi approaches
0
-log2 pi 0 when pi 1
Thus, sequences such as AAAAAAAAAAAAAAAAAA have
no information content
45
ENTREZ Queries
  • nucleotide allFilter NOT specimen-voucherAll
    Fields
  • You can eliminate most of the BAC-type records
    from the default nucleotide database with the
    following query nucleotide allFilter NOT
    htgKeyword

46
BLAST Output
  • Reference
  • Query
  • Database
  • TaxBLAST link
  • Graphical Overview aligned matches

47
BLAST Output Significant Alignments
  • SeqID Genbank ID and Accession
  • Description (links to entry)
  • Score (links down)
  • E-value
  • LocusLink
  • Unigene

48
BLAST Output - Alignments
  • Length
  • Score
  • Expect
  • Identities/Gaps
  • Strand (e.g. Plus/Plus)
  • Alignment shows identity

49
BLAST Output - Proteins
  • Conserved Domains Pfam CD-Search
  • Alignments
  • Matching letters shown
  • Mismatches blank
  • Conservative substitutions shown by
  • Gaps are shown by dashes

50
BLAST Output End Summary
  • Database -- of letters sequences
  • Lambda, K, H
  • Gapped Lambda, K, H
  • Matrix
  • Gap Penalties
  • Hit summary , extensions, gt 10.0
  • HSPs
  • length of query
  • T,A, X1, X2, X3, S1, S2

51
Profile and Pattern Searches
  • PSI-BLAST
  • PHI-BLAST
  • PROBE

52
Position Specific Iterated BLAST (PSI-BLAST)
  • Using the query sequence as a template, align all
    sequences that match with an E-value below
    threshold
  • Assign weights to the sequences
  • Construct a position-specific scoring matrix
  • Iterate
  • E-values unreliable

53
PSI-BLAST
  • Position-specific score matrix (PSSM) same length
    as query
  • Multiple alignment constructed from BLAST output
  • Sequences weighted on column-by-column basis
  • Effective number of independent observations
    estimated
  • Targert frequencies derived using data-dependent
    pseudo-counts
  • Log-odds weight matrix scores calculated to scale
  • BLAST applied to PSSM
  • Statistical evaluation of results
  • Iteration

54
Pattern-Hit-Initiated BLAST (PHI-BLAST)
  • Combines regular expression matching with local
    alignments
  • Finds proteins containing the pattern and
    similarity in the region of the pattern
  • Integrated with PSI-BLAST
  • E-values are computed differently
  • Under development

55
BLASTing Custom Databases
  • Assemble Database (e.g., by using Batch ENTREZ)
  • Run formatdb
  • Schedule Job

56
Sites using WU-BLAST
  • European Bioinformatics Institute BLAST server
  • EMBL Advanced BLAST2 Search
  • Institut Pasteur
  • Berkeley Drosophila Genome Project (BDGP)
  • European Drosophila Genome Project at the EBI
  • Mendel Bioinformatics Group at the John Innes
    Centre
  • Mouse Genome Database at the Jackson Laboratory
  • PlasmoDB (Plasmodium Genome Resource) at the
    University of Pennsylvania
  • Saccharomyces cerevisiae search at Stanford
    University
  • TAP (Transcript Assembly Program) by Zhengyan
    ("George") Kan
  • TIGR Gene Indices
  • TIGR Microbial Genomes Database
  • WU Genome Sequencing Center
Write a Comment
User Comments (0)
About PowerShow.com