From Pairwise To - PowerPoint PPT Presentation

1 / 65
About This Presentation
Title:

From Pairwise To

Description:

From Pairwise To – PowerPoint PPT presentation

Number of Views:37
Avg rating:3.0/5.0
Slides: 66
Provided by: maya58
Category:
Tags: pairwise | ubta

less

Transcript and Presenter's Notes

Title: From Pairwise To


1
From Pairwise To Multiple Sequence Alignments
Workshop, January 2008 Maya Schushan
2
Outline
1. Introduction
2. Applications
3. General Alignment Methodology
  • 4. Pairwise Alignment
  • Smith-Waterman
  • Needlman-Wunch
  • 5. Multiple Sequence Alignment
  • ClustalW
  • MUSCLE
  • T-coffee

3
What Is An Alignment?
Introduction
  • Comparing 2 (pairwise) or more (multiple)
    sequences.
  • Searching for a series of identical or similar
    characters in the sequences.

4
Introduction
What Is An Alignment?
A process of lining-up 2 or more sequences to
achieve maximum level of identity, in order to
find homologies.
T C A T G C A T T G
?
T C A T G C A T T G
T C A T G C A T T G
or
5
Introduction
Basic Terms
  • Homology
  • Relation of sequences which is a result of
    divergence from a common ancestor.
  • Identity
  • Sequences or Sub-sequences that are invariant.
  • Similarity
  • Sequences or Sub-sequences that are related.

6
Introduction
Homologues Orthology vs Paralogy



Reproduced from NCBI education website
7
Introduction
The Limits of Sequence Similarity
8
Outline
1. Introduction
2. Applications
3. General Alignment Methodology
  • 4. Pairwise Alignment
  • Smith-Waterman
  • Needlman-Wunch
  • 5. Multiple Sequence Alignment
  • ClustalW
  • MUSCLE
  • T-coffee

9
Applications
Why Sequence Alignment?
  • Predict characteristics of a protein
  • Use the structure or function information on
    homologous protein to predict structure\function
    of an unknown protein.
  • Similar sequences ? similar proteins (?)
  • Conserved vs. variable regions ? functional site

10
Applications
Why Sequence Alignment?
  • Predict characteristics of a protein

VTISCTGSSSNIGAG-NHVKWYQQLPG VTISCTGSSSNIGS--ITVNW
YQQLPG LRLSCTGSGFIFSS--YAMYWYQQAPG LSLTCTGSGTSFDD-
QYYSTWYQQPPG
11
Applications
Why Sequence Alignment?
A model is generated according to a template
structure of a homologous protein
12
Applications
Why Sequence Alignment?
  • 2. Learn about evolutionary relationships
  • Two sequences from different organisms are
    similar ? they may have a common ancestor.
  • Needed for construction of phylogenetic trees

13
Applications
Why Sequence Alignment?
  • Research of disease
  • Comparison of sequences between individuals can
    detect changes that are related to diseases
  • Analysis of residues substitutions mutation or
    polymorphism?

14
Applications
Why Sequence Alignment?
  • Examples for specific applications
  • Evolutionary conservation analysis
    (ConSeq/ConSurf)
  • Motif and domain prediction (Prosite/InterPro/Pfam
    )
  • Phylogenetic trees

ConSurf analysis of PDB entry 1hyt-hydrolase
15
Outline
1. Introduction
2. Applications
3. General Alignment Methodology
  • 4. Pairwise Alignment
  • Smith-Waterman
  • Needlman-Wunch
  • 5. Multiple Sequence Alignment
  • ClustalW
  • MUSCLE
  • T-coffee

16
Sequence Modifications
General Alignment Methodology
  • 1. Insertion - an insertion of a letter or
    several letters to the
  • sequence.
  • AAGA? AAGTA
  • 2. Deletion - deleting a letter (or more) from
    the sequence.
  • AAGA? AGA
  • 3. Substitution - replacing a sequence letter by
    another.
  • AAGA? AACA

INEDL- Insertions Deletions
17
General Alignment Methodology
Measuring An Alignment
S ACTG S AC_TG S ACTG S ACTG T
AGT T A_GT_ T AGT_ T _AGT
Good Identical characters- match. Bad Different
characters- mismatch gap (InDel).
  • Each pair of characters gets a value, depending
    on its identity.
  • The similarity score of the alignment is the sum
    of pair values.

18
Example Aligning Two Globins
General Alignment Methodology
  • Human Hemoglobin (HH)
  • VLSPADKTNVKAAWGKVGAHAGYEG
  • Sperm Whale Myoglobin (SWM)
  • VLSEGEWQLVLHVWAKVEADVAGHG

19
Example Aligning Two Globins
General Alignment Methodology
  • No Gaps
  • Percent identity 36
  • Percent similarity 40
  • (HH) VLSPADKTNVKAAWGKVGAHAGYEG
  • (SWM) VLSEGEWQLVLHVWAKVEADVAGHG

20
Example Aligning Two Globins
General Alignment Methodology
  • With Gaps
  • Gaps 2
  • Percent identity 45.833 (instead of 36 without
    gaps)
  • Percent similarity 54.167 (instead of 40
    without gaps)
  • (HH) VLSPADKTNVKAAWGKVGAH-AGYEG
  • (SWM) VLSEGEWQLVLHVWAKVEADVAGH-G

21
General Alignment Methodology
Types of Gap Penalties
(insertions or deletions)
  • InDels are rare in evolution once created, easy
    to
  • extend
  • Gap open penalty for the first residue in a
    gap
  • Gap extension penalty for additional residue
    in a gap.

22
General Alignment Methodology
Types of Gap Penalties
Motivation Aligning cDNAs to Genomic DNA
cDNA query
Genomic DNA
Conclusion gap opening and extension should be
ranked differently to properly align the sequences
23
Alignment Scoring
General Alignment Methodology
  • 1. Assume independent mutation model
  • 2. Score at each position
  • Positive if the same/similar
  • Negative if different or gap
  • 3. Score of an alignment is sum of position score

24
Scoring Matrix
General Alignment Methodology
  • A matrix n ? n n4 for DNA, n20 for proteins
  • Each entry matrix defines the score for observing
    the two letters in the alignment
  • Positive if likely to change
  • Negative otherwise

25
Alignment Scoring
General Alignment Methodology
  • Different scoring ? different best alignments
  • Scoring systems implicitly represent a particular
    theory of evolution
  • Some mismatches are more plausible
  • Transition vs. Transversion
  • Lys?Arg ? Lys?Cys
  • Gap extension Vs. Gap opening

26
DNA scoring matrices
General Alignment Methodology
27
DNA scoring matrices
General Alignment Methodology
  • Transitions purine to purine or pyrmidine to
    pyrmidine
  • (4 possibilities)
  • Transversions purine to pyrmidine or pyrmidine
    to purine (8 possibilities)
  • By chance alone transversions should occur twice
    as often as transitions.
  • De-facto transitions are more frequent than
    transversions.

28
General Alignment Methodology
DNA scoring matrices
Transversion
Match
Transition
29
General Alignment Methodology
Proteins scoring matrices
30
General Alignment Methodology
Proteins scoring matrices
  • Observation some substitutions are more frequent
    than others, e.g., chemically similar amino acids
  • As for DNA, protein matrices define the
    probabilities of change between the different
    amino acids
  • Popular matrices are based on empirical data
  • PAM
  • BLOSUM

T L Y D K T L Y D K T L Y E K T L Y D K T L Y Q
K T L Y D K
In the fourth column E and D are found in 7 / 8
31
PAM Matrices
General Alignment Methodology
  • PAM matrices are based on sequences with 85
    identity.
  • The changes are accepted by natural selection
  • 1 PAM unit
  • the probability of 1 point mutation per 100
    residues.
  • Multiplying PAM1 by itself gives higher PAMs
    matrices that are suitable for larger
    evolutionary distance.

32
General Alignment Methodology
PAM250
33
General Alignment Methodology
PAM Matrices
34
General Alignment Methodology
BLOSUM Matrices
  • Based on BLOCKS database
  • 2000 blocks from 500 families of related
    proteins
  • Blocks short conserved patterns of 3-60 aa
    without gaps
  • Different BLOSUMn matrices are calculated
    independently from BLOCKS
  • BLOSUMn is based on sequences that shared at
    least n percent identity

35
General Alignment Methodology
BLOSUM Matrices
  • Low BLUSOM numbers for distant sequences
  • High BLUSOM numbers for similar sequence
  • Generally
  • BLOSUM62 for general use
  • BLOSUM80 for close relations
  • BLOSUM45 for distant relations

36
General Alignment Methodology
Proteins scoring matrices
Closer sequences
  • PAM100 BLOSUM90
  • PAM120 BLOSUM80
  • PAM160 BLOSUM60
  • PAM200 BLOSUM52
  • PAM250 BLOSUM45

Distant sequences
37
General Alignment Methodology
Proteins scoring matrices
  • BLOSUM matrices are based on the replacement
    patterns found in more highly conserved regions
    of the sequences without gaps
  • PAM matrices based on mutations observed
    throughout a global alignment, includes both
    highly conserved and highly mutable regions

38
Outline
1. Introduction
2. Applications
3. General Alignment Methodology
  • 4. Pairwise Alignment
  • Smith-Waterman
  • Needlman-Wunch
  • 5. Multiple Sequence Alignment
  • ClustalW
  • MUSCLE
  • T-coffee

39
Local vs. Global
Pairwise Alignment
  • Global alignment finds the best alignment
    across the whole two sequences.

ADLGAVFALCDRYFQ ADLGRTQN-CDRYYQ
  • Local alignment finds regions of similarity in
    parts of the
  • sequences.

ADLG CDRYFQ ADLG CDRYYQ
40
Pairwise Alignment
Global Needleman Wunsch (1970)
  • Involves an iterative matrix method of
    calculation
  • All possible pairs of residues are presented
  • All possible alignments are presented as
    pathways through
  • this array

Needleman, S. B. and Wunsch, C. D., 1970
41
Pairwise Alignment
Local Smith Waterman (1981)
  • Makes an optimal alignment of the best segment
    of similarity
  • between two sequences
  • Sequences that but contain regions that are
    highly similar
  • Use when one sequence is short and the other is
    very long
  • Can return a number of highly aligned segments

Smith, T.F. and Waterman, M.S., 1981
42
Alignment in reality
Pairwise Alignment
  • Main use of pairwise alignment finding sequence
    in a large database
  • Database are huge ? heuristic procedures in
    search
  • Take a lot of time
  • Take a lot of memory
  • Find local alignments
  • In use in Blast, Fasta

43
User Input
Pairwise Alignment
  • Pair of sequences
  • Local or global alignment
  • Scoring
  • Gap penalties opening/extension
  • Scoring matrix

44
Scoring
Pairwise Alignment
  • The final score of the alignment is the sum of
    the positive scores and penalty scores
  • Number of Identities
  • Number of Similarities
  • - Number of Gap insertions
  • - Number of Gap extensions
  • Alignment score

Scoring Matrix
Gap penalties
45
Outline
1. Introduction
2. Applications
3. General Alignment Methodology
  • 4. Pairwise Alignment
  • Smith-Waterman
  • Needlman-Wunch
  • 5. Multiple Sequence Alignment
  • ClustalW
  • MUSCLE
  • T-coffee

46
Multiple Sequence Alignment
Pairwise Vs. Multiple Sequence Alignment
Alignments help to analyze sequence
data organize, visualize.
MSA For more than 2 sequences
Pairwise For 2 sequences
  • F G K ? G K G
  • F G K F G K G
  • - G K Q G K G
  • - - K F G K G

F G K ? G K G F G K F G K G
47
Multiple Sequence Alignment
Rules For Choosing Sequences
  • Very similar sequences have little information
  • Very different sequences cause troublelt30
    identical with more than half of the other
    sequences in the set
  • Choose sequences as distantly related as possible
  • Sequence between 30-80 identical with more than
    half of the sequences in the set
  • The more sequences the better

48
Multiple Sequence Alignment
Similarity Score of MSA
  • Each position gets a value, depending on its
    identity.
  • The similarity score of the alignment is the sum
    of all
  • position values.
  • A popular way to compute position values
  • SP - Sum of Pairs - each pair gets the score
    from the
  • similarity matrix (PAM, BLOSUM).

Goal Find MSA with maximum similarity score Bad
News This problem is NP hard
49
Multiple Sequence Alignment
ClustalW- Introduction
  • This heuristic approach works because it uses
    the biological
  • meaning of MSA
  • Based on the idea that the sequences we usually
    want to align
  • are phylogenetically related
  • The first program to implement progressive MSA
  • Was introduced in 1994 and still used today.

Thompson, J.D. et al, 1994
50
Multiple Sequence Alignment
ClustalW- Progressive Alignment
Hbb_Human 1 Hbb_Horse 2 Hba_Human 3 Hba_Horse
4 Myg_Whale 5
1. Quick pairwise alignment calculate distance
matrix
Hbb_Human Hbb_Horse Hba_Human Hba_Horse Myg_Whale
2. Build a guide tree using the NJ phylogenetic
method
3. Progressive alignment following guide tree
51
Multiple Sequence Alignment
ClustalW- Progressive Alignment
A
B
C
D
52
Multiple Sequence Alignment
ClustalW- Additional Features
  • Sequence weighting
  • Each sequence gets a weight derived from the
    guide tree
  • Close sequences are down-weighted
  • Distant sequences receive high weights
  • The weights are normalized so that the highest
    is 1

W(Hbb_Human) .081 ½.226 ¼.061 1/5.015
1/6.062 0.221
w1
w2
w3
w4
w5
w6
w7
53
Multiple Sequence Alignment
ClustalW- Additional Features
  • Sequence weighting
  • Each sequence gets a weight derived from the
    guide tree
  • Close sequences are down-weighted
  • Distant sequences receive high weights
  • The weights are normalized so that the highest
    is 1

Without weights Score M(t,v) M(t,i) M(i,v)
M(l,i) M(k,v) M(k,i) M(k,v) M(k,i) / 8
With weights Score M(t,v)W1W5 M(t,i)W1W6
M(i,v)W2W5 M(l,i)W2W6 M(k,v)W3W5 M(k,i)W3W6
M(k,v)W4W5 M(k,i)W4W6 / 8
54
Multiple Sequence Alignment
ClustalW- Additional Features
  • Chooses different scoring matrices, as the
    alignment proceeds, depending
  • on the estimated divergence of the sequences
    at each stage.
  • Position specific gap penalties depends on the
    position of in the alignment
  • - Small penalty
  • 1. In free loops (stretches of hydrophilic
    residues)
  • 2. Existing gaps
  • - Big penalty Near existing gaps
  • Residue specific gap penalty

55
Multiple Sequence Alignment
ClustalW- Problems
  • Sequences that are similar only in some smaller
    regions
  • ? ClustalW tries to find global alignments, not
    local.
  • Sequence that contains a large insertion
    compared to the rest
  • ? global not local
  • Sequence that contains a repetitive element,
    while another sequence only
  • contains one copy.

Vs
56
Multiple Sequence Alignment
MUSCLE- Introduction
  • The most recent popular MSA software
  • Considered to be the most accurate MSA software
    available
  • today
  • The basic idea Progressive Alignment

Edgar, R.C., 2004
57
Multiple Sequence Alignment
MUSCLE Innovations
  • Faster distance estimation between the input
    sequences
  • Faster construction of an evolutionary tree
  • (UPGMA instead of NJ in ClustalW )

faster
  • Applying new score function to the profile
    alignments
  • Refinement of the initial results

more accurate
Edgar R.C., 2004
58
Multiple Sequence Alignment
MUSCLE Innovations- Refinement Step
  • An edge is chosen from the progressive alignment
    tree.
  • The tree is divided into two subtrees by deleting
    this edge.
  • The MSA from each subtree is computed by
    progressive alignment.
  • The two MSAs are aligned, generating an entire
    new MSA
  • If the new MSA achieves higher score than the
    previous ? keep it

New MSA ------------------------------------------
--------------------------------------------------
------------------
Old MSA ------------------------------------------
--------------------------------------------------
------------------
MSA2 ----------------------------
MSA1 ----------------------------------
--------
59
Multiple Sequence Alignment
MUSCLE- Its Even More Complicated
60
Multiple Sequence Alignment
T-Coffee- Introduction
  • Heuristic methods based on progressive alignment
    use a phylogenetic guide tree to gradually built
    the alignment.
  • ? Errors made in the first alignment cant be
    correct later on as more sequences are added
  • T-Coffee will attempted to minimize the effect by
    using data from the all the sequences while
    building the alignment.
  • T-coffee combine the best properties of global
    and local multiple alignments

Notredame et al., 2000
61
Multiple Sequence Alignment
T-Coffee-Alignment Stages
  • Create pairwise library
  • Primary library- global pairwise (ClustalW)
  • Extended library- 10 top scoring non intersecting
    local alignments
  • Assign weights to each residue pair using the
    extended library
  • Use weights to produce pairwise alignments
    instead of regular matrices
  • Producing a distance matrix and tree using the NJ
    method
  • MSA using the guide tree

62
Multiple Sequence Alignment
T-Coffee- Applications
  • Disadvantages
  • Except for the distance matrix, there is no
    possibility to choose parameters (like open gap,
    gap extension).
  • Advantages
  • - Combination of local and global alignments
  • - Data from the all the sequences while building
    the alignment
  • Makes the results more accurate and is
    especially good for
  • alignments with low identity and big insertions
    ? identification
  • of motifs and domains

63
Multiple Sequence Alignment
All Against All- SH2 domains
T-coffee
MUSCLE
Edgar, R.C., 2004
64
Multiple Sequence Alignment
All Against All- BaliBase 2005
MUSCLE is superior!
Edgar, R.C., 2004
65
Thanks
This presentation was partly based on
presentations of Dr. Meytal Landau, Dr. Metsada
Pasmanik-Chor and the Pupko lab
Write a Comment
User Comments (0)
About PowerShow.com