Alignment basics - PowerPoint PPT Presentation

1 / 27
About This Presentation
Title:

Alignment basics

Description:

General ('X and Y have the same fold') Specific (comparative modeling) ... Affine: scores for opening & extending gaps. Local: *gaps at the ends are 'free' ... – PowerPoint PPT presentation

Number of Views:50
Avg rating:3.0/5.0
Slides: 28
Provided by: ianho9
Category:

less

Transcript and Presenter's Notes

Title: Alignment basics


1
Alignment basics
2
Why do we need alignment?
  • To predict function of proteins or RNAs
  • Complication function evolves!
  • To predict structure of proteins or RNAs
  • a.k.a. Homology Modelling
  • General (X and Y have the same fold)
  • Specific (comparative modeling)
  • To identify conserved elements
  • critical residues in proteins (active sites,
    binding pockets)
  • functional domains in proteins
  • protein-coding genes in genomes (Comparative
    genomics)
  • To study molecular evolutionIn essence,
    alignment is the basic operation of comparing
    sequences to see if how they are related

3
Function Prediction
  • Function prediction by homology
  • a gene or protein is compared against other genes
    or proteins in a database
  • if a sequence can be detected whose similarity is
    statistically significant, the function of the
    unknown gene or protein is inferred.
  • first-order approximation of the molecular
    function of the proteins encoded in a genome
  • prioritize experimental investigation

4
Inference of function
Example p53 tumor suppressor. Inactivated
(sequestered by) mdm2 activated by DNA damage
Many experiments in mice, fruitflies, yeast (e.g.
response to DNA damage first demonstrated using
mouse homologue TP53 cell cycle mostly figured
out in yeast)
Many other examples of homology models (e.g.
fruitfly limb development)
5
Complications
  • Many proteins belong to large families.
  • Composed of subfamilies by gene duplication
    events
  • Gene duplication allows one copy to assume a new
    biological role through mutation
  • Hence, subfamilies often differ in their
    biological functionality yet still exhibit a high
    degree of sequence similarity.
  • Other complications
  • Ignoring the multi-domain organization of
    proteins.
  • Error propagation
  • Insufficient masking of low complexity regions
  • Alternative splicing
  • Recombination, gene conversion

6
Why Do We Need Homology Modeling?
  • Ab Initio protein folding (random sampling)
  • 100 aa, 3 conf./residue gives approximately 1048
    different overall conformations!
  • Random sampling is NOT feasible, even if
    conformations can be sampled at picosecond (10-12
    sec) rates.
  • Levinthals paradox
  • Do homology modelling instead.

7
How Is It Possible?
  • The structure of a protein is uniquely determined
    by its amino acid sequence(but sequence is
    sometimes not enough)
  • prions
  • pH, ions, cofactors, chaperones
  • Structure is conserved much longer than sequence
    in evolution.
  • Structure gt Function gtgt Sequence

8
How Is It Done?
  • Identify template(s) initial alignment
  • Improve alignment
  • Backbone generation
  • Loop modelling
  • Side chains
  • Refinement
  • Validation

9
Inference of structure by comparative (homology)
modeling
10
CASP competition
  • The main goal of CASP is to obtain an in-depth
    and objective assessment of our current abilities
    and inabilities in the area of protein structure
    prediction. To this end, participants will
    predict as much as possible about a set of soon
    to be known structures. These will be true
    predictions, not post-dictions made on already
    known structures.

11
Critical residue prediction
Example aquaporins EMBO Journal (2000)
12
Domain identification
13
What is Comparative Gene Finding?
Problem Predict genes in a target genome S based
on the contents of S and also based on the
contents of one or more informant genomes I(1)...
I(n)
S AAGGGAAGACAGGTGAGGGTCAAGCCCCAGCAAGTGCACCCAG---
---------ACACC I1 AAGGGAAGACAGGTGAGGGTCAAGCCCCAGC
AAGTGCACCCAG------------ACACC I2
AAGGGAAGACATTTACGAGTCAAGCCACAGAAAGAGCCCCTGAG------
-----GTGCC I3 AAAGGAGGACATGTGAGGGCCAAACTACTGAAGGT
TCAACCAGG-----------ATGCT I4 AAGGGGAGACAGGGGAGGGT
CACACCATGGCAGAGG--CCAAG------------ACAGC I5
AAAGGAAACAATGGGAAGGTTA-TCAACTCCAAGTATGCCCAAGATCAAG
GGAACCCCTT I6 AAAGGAAACCACTGGGAGGTTA-GAAATCACAGGT
GCACCCAAGATCAAGGAA--CCCCT
Rationale Natural selection should operate more
strongly on protein-coding DNA than on the
non-functional junk DNA between genes.
Intervals of strongly conserved DNA should
therefore be more likely to contain functional
elements.
14
How Does Conservation Help?
ATG
TGA
A. fumigatus
1
2
4
3
ATG
TGA
A. nidulans
1
2
4
3
15
Molecular evolution
  • Questions like
  • Where did sequence X originate?
  • What is the phylogeny of X, Y and Z?
  • Does genome G contain any horizontally
    transferred sequences?
  • Are there any duplicated genes? What are their
    orthology/paralogy relationships?
  • What are the relative rates of various kinds of
    mutations (synonymous, nonsynonymous,
    frame-preserving, etc.)?

16
How to make alignments?
  • Visual inspection
  • dotplots
  • Manual editing
  • alignment editors
  • Automated methods
  • scoring schemes
  • dynamic programming algorithms

17
Dotplots
18
Dotplot vs self repetitive sequence
19
Pairwise alignments terminology
  • Substitutions insertions/deletions (indels)
  • Collinear alignment
  • Place gap characters (- or .) in the sequence
    so that homologous residues are aligned

1433E_SHEEP REDLVYQAKLAEQ---YDEMVESMKKVAGMDV BMH1_
YEAST REDSVYLAKLAEQAERYEEMVENMK--ASSGQ
20
Pairwise alignment (DNA)
21
Pairwise alignment (DNA-protein)
1 MetAlaSerLeuLeuProLeuLeuCysLeuCysValValAla
AlaHisLeuAlaGlyAlaAr 21

MetAlaSerLeuLeuProLeuLeuCysLe
uCysValValAlaAlaHisLeuAlaGlyAlaAr 1045
ATGGCGTCGCTGCTGCCACTGCTCTGTCTCTGTGTCGTCGCTGCGCACCT
GGCGGGGGCCCG 1105 22 gA gtgtgtgt Target
Intron 1 gtgtgtgt spAlaThrProThrGluGluProMetA
31 1265 bp
gA
spAlaThrProThrGluGluProMetA
1106 AGgt.........................agACGCCA
CCCCCACCGAGGAGCCAATGG 2400 32
laThrAlaLeuGlyLeuGluArgArgSerValTyrThrGlyGlnProSer
ProAlaLeuGlu 51

laThrAlaLeuGlyLeuGluArgArgSerValTyrThrGlyGlnProSer
ProAlaLeuGlu 2401 CGACTGCACTGGGCCTGGAAAGACGGTCC
GTGTACACCGGCCAGCCCTCACCAGCCCTGGAG 2460 52
AspTrpGluG gtgtgtgt Target Intron 2 gtgtgtgt
luAlaSerGluTrpThrSe 61
426 bp
AspTrpGluG
luAlaSerGluTrpThrSe 2461
GACTGGGAAGgt.........................agAGGCCAG
CGAGTGGACGTC 2916 62 rTrpPheAsnValAspHisPr
oGlyGlyAspGlyAspPheGluSerLeuAlaAlaIleArgP 82

rTrpPheAsnValAspHisPro
GlyGlyAspGlyAspPheGluSerLeuAlaAlaIleArgP 2917
CTGGTTCAACGTGGACCACCCCGGAGGCGACGGCGACTTCGAGAGCCTGG
CTGCCATCCGCT 2979 83 heTyrTyrGlyProAlaArgV
alCysProArgProLeuAlaLeuGluAlaArgThrThrAsp 102

heTyrTyrGlyProAlaArgVa
lCysProArgProLeuAlaLeuGluAlaArgThrThrAsp 2980
TCTACTACGGGCCAGCGCGCGTGTGCCCGCGACCGCTGGCGCTGGAAGCG
CGCACCACGGAC 3039
22
Alignment path graph
23
Global vs local alignment
  • Global alignment
  • Entirety of sequences must match
  • Local alignment
  • Best match between subsequences
  • Semi-local etc.
  • Global w.r.t. query, local w.r.t. target
  • Alignment path graph views

24
Scoring schemes
  • Substitution scores
  • Simple, e.g. 5 for match, -4 for mismatch
  • More differentiated
  • e.g. different scores for transitions/transversion
    s
  • Most flexible substitution matrix
  • Gap penalties
  • Linear fixed penalty per gap column
  • Affine scores for opening extending gaps
  • Local gaps at the ends are free

chimp CT---AGGCAATCTGGAGTTTGGGTCATTCGGGGGGGT
G human CTGACAGGCAATCT-GAGTTTGGG-TATTCGGGGG
GGTG
25
Example edit distance
  • Concept from information theory
  • Minimum number of edit operations required to
    change one string into another
  • Hamming distance
  • Each substituted character scores -1
  • Levenshtein distance
  • Each substituted, inserted or deleted character
    scores -1

26
How might we make edit distance more realistic?
  • Each inserted or deleted character scores -3
  • DNA
  • Transition (AG or CT) scores -1
  • Transversion scores -2
  • English text
  • Vowelvowel or consonantconsonant scores -1
  • Vowelconsonant or consonantvowel scores -2

27
Needleman-Wunsch
Write a Comment
User Comments (0)
About PowerShow.com