Title: Alignment basics
1Alignment basics
2Why do we need alignment?
- To predict function of proteins or RNAs
- Complication function evolves!
- To predict structure of proteins or RNAs
- a.k.a. Homology Modelling
- General (X and Y have the same fold)
- Specific (comparative modeling)
- To identify conserved elements
- critical residues in proteins (active sites,
binding pockets) - functional domains in proteins
- protein-coding genes in genomes (Comparative
genomics) - To study molecular evolutionIn essence,
alignment is the basic operation of comparing
sequences to see if how they are related
3Function Prediction
- Function prediction by homology
- a gene or protein is compared against other genes
or proteins in a database - if a sequence can be detected whose similarity is
statistically significant, the function of the
unknown gene or protein is inferred. - first-order approximation of the molecular
function of the proteins encoded in a genome - prioritize experimental investigation
4Inference of function
Example p53 tumor suppressor. Inactivated
(sequestered by) mdm2 activated by DNA damage
Many experiments in mice, fruitflies, yeast (e.g.
response to DNA damage first demonstrated using
mouse homologue TP53 cell cycle mostly figured
out in yeast)
Many other examples of homology models (e.g.
fruitfly limb development)
5Complications
- Many proteins belong to large families.
- Composed of subfamilies by gene duplication
events - Gene duplication allows one copy to assume a new
biological role through mutation - Hence, subfamilies often differ in their
biological functionality yet still exhibit a high
degree of sequence similarity. - Other complications
- Ignoring the multi-domain organization of
proteins. - Error propagation
- Insufficient masking of low complexity regions
- Alternative splicing
- Recombination, gene conversion
6Why Do We Need Homology Modeling?
- Ab Initio protein folding (random sampling)
- 100 aa, 3 conf./residue gives approximately 1048
different overall conformations! - Random sampling is NOT feasible, even if
conformations can be sampled at picosecond (10-12
sec) rates. - Levinthals paradox
- Do homology modelling instead.
7How Is It Possible?
- The structure of a protein is uniquely determined
by its amino acid sequence(but sequence is
sometimes not enough) - prions
- pH, ions, cofactors, chaperones
- Structure is conserved much longer than sequence
in evolution. - Structure gt Function gtgt Sequence
8How Is It Done?
- Identify template(s) initial alignment
- Improve alignment
- Backbone generation
- Loop modelling
- Side chains
- Refinement
- Validation
9Inference of structure by comparative (homology)
modeling
10CASP competition
- The main goal of CASP is to obtain an in-depth
and objective assessment of our current abilities
and inabilities in the area of protein structure
prediction. To this end, participants will
predict as much as possible about a set of soon
to be known structures. These will be true
predictions, not post-dictions made on already
known structures.
11Critical residue prediction
Example aquaporins EMBO Journal (2000)
12Domain identification
13What is Comparative Gene Finding?
Problem Predict genes in a target genome S based
on the contents of S and also based on the
contents of one or more informant genomes I(1)...
I(n)
S AAGGGAAGACAGGTGAGGGTCAAGCCCCAGCAAGTGCACCCAG---
---------ACACC I1 AAGGGAAGACAGGTGAGGGTCAAGCCCCAGC
AAGTGCACCCAG------------ACACC I2
AAGGGAAGACATTTACGAGTCAAGCCACAGAAAGAGCCCCTGAG------
-----GTGCC I3 AAAGGAGGACATGTGAGGGCCAAACTACTGAAGGT
TCAACCAGG-----------ATGCT I4 AAGGGGAGACAGGGGAGGGT
CACACCATGGCAGAGG--CCAAG------------ACAGC I5
AAAGGAAACAATGGGAAGGTTA-TCAACTCCAAGTATGCCCAAGATCAAG
GGAACCCCTT I6 AAAGGAAACCACTGGGAGGTTA-GAAATCACAGGT
GCACCCAAGATCAAGGAA--CCCCT
Rationale Natural selection should operate more
strongly on protein-coding DNA than on the
non-functional junk DNA between genes.
Intervals of strongly conserved DNA should
therefore be more likely to contain functional
elements.
14How Does Conservation Help?
ATG
TGA
A. fumigatus
1
2
4
3
ATG
TGA
A. nidulans
1
2
4
3
15Molecular evolution
- Questions like
- Where did sequence X originate?
- What is the phylogeny of X, Y and Z?
- Does genome G contain any horizontally
transferred sequences? - Are there any duplicated genes? What are their
orthology/paralogy relationships? - What are the relative rates of various kinds of
mutations (synonymous, nonsynonymous,
frame-preserving, etc.)?
16How to make alignments?
- Visual inspection
- dotplots
- Manual editing
- alignment editors
- Automated methods
- scoring schemes
- dynamic programming algorithms
17Dotplots
18Dotplot vs self repetitive sequence
19Pairwise alignments terminology
- Substitutions insertions/deletions (indels)
- Collinear alignment
- Place gap characters (- or .) in the sequence
so that homologous residues are aligned
1433E_SHEEP REDLVYQAKLAEQ---YDEMVESMKKVAGMDV BMH1_
YEAST REDSVYLAKLAEQAERYEEMVENMK--ASSGQ
20Pairwise alignment (DNA)
21Pairwise alignment (DNA-protein)
1 MetAlaSerLeuLeuProLeuLeuCysLeuCysValValAla
AlaHisLeuAlaGlyAlaAr 21
MetAlaSerLeuLeuProLeuLeuCysLe
uCysValValAlaAlaHisLeuAlaGlyAlaAr 1045
ATGGCGTCGCTGCTGCCACTGCTCTGTCTCTGTGTCGTCGCTGCGCACCT
GGCGGGGGCCCG 1105 22 gA gtgtgtgt Target
Intron 1 gtgtgtgt spAlaThrProThrGluGluProMetA
31 1265 bp
gA
spAlaThrProThrGluGluProMetA
1106 AGgt.........................agACGCCA
CCCCCACCGAGGAGCCAATGG 2400 32
laThrAlaLeuGlyLeuGluArgArgSerValTyrThrGlyGlnProSer
ProAlaLeuGlu 51
laThrAlaLeuGlyLeuGluArgArgSerValTyrThrGlyGlnProSer
ProAlaLeuGlu 2401 CGACTGCACTGGGCCTGGAAAGACGGTCC
GTGTACACCGGCCAGCCCTCACCAGCCCTGGAG 2460 52
AspTrpGluG gtgtgtgt Target Intron 2 gtgtgtgt
luAlaSerGluTrpThrSe 61
426 bp
AspTrpGluG
luAlaSerGluTrpThrSe 2461
GACTGGGAAGgt.........................agAGGCCAG
CGAGTGGACGTC 2916 62 rTrpPheAsnValAspHisPr
oGlyGlyAspGlyAspPheGluSerLeuAlaAlaIleArgP 82
rTrpPheAsnValAspHisPro
GlyGlyAspGlyAspPheGluSerLeuAlaAlaIleArgP 2917
CTGGTTCAACGTGGACCACCCCGGAGGCGACGGCGACTTCGAGAGCCTGG
CTGCCATCCGCT 2979 83 heTyrTyrGlyProAlaArgV
alCysProArgProLeuAlaLeuGluAlaArgThrThrAsp 102
heTyrTyrGlyProAlaArgVa
lCysProArgProLeuAlaLeuGluAlaArgThrThrAsp 2980
TCTACTACGGGCCAGCGCGCGTGTGCCCGCGACCGCTGGCGCTGGAAGCG
CGCACCACGGAC 3039
22Alignment path graph
23Global vs local alignment
- Global alignment
- Entirety of sequences must match
- Local alignment
- Best match between subsequences
- Semi-local etc.
- Global w.r.t. query, local w.r.t. target
- Alignment path graph views
24Scoring schemes
- Substitution scores
- Simple, e.g. 5 for match, -4 for mismatch
- More differentiated
- e.g. different scores for transitions/transversion
s - Most flexible substitution matrix
- Gap penalties
- Linear fixed penalty per gap column
- Affine scores for opening extending gaps
- Local gaps at the ends are free
chimp CT---AGGCAATCTGGAGTTTGGGTCATTCGGGGGGGT
G human CTGACAGGCAATCT-GAGTTTGGG-TATTCGGGGG
GGTG
25Example edit distance
- Concept from information theory
- Minimum number of edit operations required to
change one string into another - Hamming distance
- Each substituted character scores -1
- Levenshtein distance
- Each substituted, inserted or deleted character
scores -1
26How might we make edit distance more realistic?
- Each inserted or deleted character scores -3
- DNA
- Transition (AG or CT) scores -1
- Transversion scores -2
- English text
- Vowelvowel or consonantconsonant scores -1
- Vowelconsonant or consonantvowel scores -2
27Needleman-Wunsch