Title: Alignments and Comparative Genomics
1Alignments and Comparative Genomics
2Welcome to CS374!
- Today
- Serafim Alignments and Comparative Genomics
- Omkar Administrivia
3Biology in One Slide Twentieth Century
and today
ACGTGACTGAGGACCGTG CGACTGAGACTGACTGGGT CTAGCTAGAC
TACGTTTTA TATATATATACGTCGTCGT ACTGATGACTAGATTACAG
ACTGATTTAGATACCTGAC TGATTTTAAAAAAATATT
4Complete DNA Sequences
nearly 200 complete genomes have been sequenced
5Evolution
6Evolution at the DNA level
Deletion
Mutation
ACGGTGCAGTTACCA
SEQUENCE EDITS
AC----CAGTCCACCA
REARRANGEMENTS
Inversion
Translocation
Duplication
7Evolutionary Rates
next generation
OK
OK
OK
X
X
Still OK?
8Sequence conservation implies function
- Alignment is the key to
- Finding important regions
- Determining function
- Uncovering the evolutionary forces
9Sequence Alignment
AGGCTATCACCTGACCTCCAGGCCGATGCCC TAGCTATCACGACCGCGG
TCGATTTGCCCGAC
-AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC-
-GACCGC--GGTCGATTTGCCCGAC
Definition Given two strings x x1x2...xM, y
y1y2yN, an alignment is an assignment of
gaps to positions 0,, N in x, and 0,, N in y,
so as to line up each letter in one sequence
with either a letter, or a gap in the other
sequence
10What is a good alignment?
- Alignment
- The best way to match the letters of one
sequence with those of the other - How do we define best?
- Alignment
- A hypothesis that the two sequences come from a
common ancestor through sequence edits - Parsimonious explanation
- Find the minimum number of edits that transform
one sequence into the other
11Scoring Function
- Sequence edits AGGCCTC
-
- Mutations
- AGGACTC
- Insertions
- AGGGCCTC
- Deletions
- AGG.CTC
- Scoring Function
- Match m
- Mismatch -s
- Gap -d
- Score F ( matches) ? m - ( mismatches) ? s
(gaps) ? d
12How do we compute the best alignment?
AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA
Too many possible alignments O( 2MN)
AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC
13Dynamic Programming
- Given two sequences x x1xM and y y1yN
- Let F(i, j) Score of best alignment of x1xi
to y1yj - Then, F(M, N) Score of best alignment
- Idea
- Compute F(i, j) for all i and j
- Do this by using F(i1 , j), F(i, j1), F(i1,
j1)
14Dynamic Programming (contd)
- Notice three possible cases
- xi aligns to yj
- x1xi-1 xi
- y1yj-1 yj
- 2. xi aligns to a gap
- x1xi-1 xi
- y1yj -
- yj aligns to a gap
- x1xi -
- y1yj-1 yj
m, if xi yj F(i,j) F(i-1, j-1)
-s, if not
F(i,j) F(i-1, j) - d
F(i,j) F(i, j-1) - d
15Dynamic Programming (contd)
- How do we know which case is correct?
- Inductive assumption
- F(i, j-1), F(i-1, j), F(i-1, j-1) are optimal
- Then,
- F(i-1, j-1) s(xi, yj)
- F(i, j) max F(i-1, j) d
- F( i, j-1) d
- Where s(xi, yj) m, if xi yj -s, if not
i-1, j-1
i-1, j
i, j-1
i, j
16Example
- x AGTA m 1
- y ATA s -1
- d -1
F(i,j) i 0 1 2 3 4
Optimal Alignment F(4,3) 2 AGTA A - TA
j 0
1
2
3
17The Needleman-Wunsch Matrix
x1 xM
Every nondecreasing path from (0,0) to (M, N)
corresponds to an alignment of the two
sequences
y1 yN
18The Needleman-Wunsch Algorithm
- Initialization.
- F(0, 0) 0
- F(0, j) - j ? d
- F(i, 0) - i ? d
- Main Iteration. Filling-in partial alignments
- For each i 1M
- For each j 1N
- F(i-1,j) d case 1
- F(i, j) max F(i, j-1) d case
2 - F(i-1, j-1) s(xi, yj) case 3
- UP, if case 1
- Ptr(i,j) LEFT if case 2
- DIAG if case 3
- Termination. F(M, N) is the optimal score, and
- from Ptr(M, N) can trace back optimal alignment
19Performance
20Alignment on a Large Scale
- Given a newly sequenced organism,
- Which subregions align with other organisms?
- Potential genes
- Other biological characteristics
- Assume we use Dynamic Programming
Our newly sequenced mammal
3?109
The entire genomic database
1010 - 1011
21Index-based Local Alignment
- Main idea
- Construct a dictionary of all the words in the
query - Initiate a local alignment for each word match
between query and DB - Running Time
- Theoretical worst case O(MN)
- Fast in practice
query
DB
22 Index-based Local Alignment BLAST
- Dictionary
- All words of length k (11)
- Alignment initiated between exact-matching words
- (more generally, between words of alignment
score ? T) - Alignment
- Ungapped extensions until score
- below statistical threshold
- Output
- All local alignments with score
- gt statistical threshold
query
scan
DB
query
23Index-based Local Alignment BLAST
A C G A A G T A A G G T C
C A G T
Example k 4, T 4 The matching word GGTC
initiates an alignment Extension to the left and
right with no gaps until alignment falls lt
50 Output GTAAGGTCC GTTAGGTCC
C C C T T C C T G G A T T
G C G A
24Gapped BLAST
A C G A A G T A A G G T C
C A G T
- Added features
- Pairs of words can initiate alignment
- Nearby alignments are merged
- Extensions with gaps until score lt T below best
score so far - Output
- GTAAGGTCCAGT
- GTTAGGTC-AGT
C T G A T C C T G G A T T
G C G A
25Example
- Query gattacaccccgattacaccccgattaca (29 letters)
2 mins - Database All GenBankEMBLDDBJPDB sequences
(but no EST, STS, GSS, or phase 0, 1 or 2 HTGS
sequences) 1,726,556 sequences 8,074,398,388
total letters - gtgi28570323gbAC108906.9 Oryza sativa
chromosome 3 BAC OSJNBa0087C10 genomic sequence,
complete sequence Length 144487 Score 34.2
bits (17), Expect 4.5 Identities 20/21 (95)
Strand Plus / Plus - Query 4 tacaccccgattacaccccga 24
-
- Sbjct 125138 tacacccagattacaccccga 125158
- Score 34.2 bits (17),
- Expect 4.5 Identities 20/21 (95) Strand
Plus / Plus - Query 4 tacaccccgattacaccccga 24
-
- Sbjct 125104 tacacccagattacaccccga 125124
- gtgi28173089gbAC104321.7 Oryza sativa
chromosome 3 BAC OSJNBa0052F07 genomic sequence,
complete sequence Length 139823 Score 34.2
bits (17), Expect 4.5 Identities 20/21 (95)
Strand Plus / Plus
26Efficient global alignment
27Global alignment with the chaining approach
- Find local alignments
- Chain them into a rough global map
- Align regions in-between
28LAGAN 1. FIND Local Alignments
- Find Local Alignments
- Chain Local Alignments
- Restricted DP
Mike Brudno, Chuong B Do, et al.
29LAGAN 2. CHAIN Local Alignments
- Find Local Alignments
- Chain Local Alignments
- Restricted DP
Mike Brudno, Chuong B Do, et al.
30LAGAN 3. Restricted DP
- Find Local Alignments
- Chain Local Alignments
- Restricted DP
Mike Brudno, Chuong B Do, et al.
31Restricted DP (contd)
- What if a box is too large?
- Recursive application of LAGAN,
- more sensitive word search
32Multiple Alignment
33(No Transcript)
34Scoring Function Sum Of Pairs
- Definition Induced pairwise alignment
- A pairwise alignment induced by the multiple
alignment - Example
-
- x AC-GCGG-C
- y AC-GC-GAG
- z GCCGC-GAG
- Induces
- x ACGCGG-C x AC-GCGG-C y AC-GCGAG
- y ACGC-GAC z GCCGC-GAG z GCCGCGAG
35Sum Of Pairs (contd)
- The sum-of-pairs score of an alignment is the
sum of the scores of all induced pairwise
alignments - S(m) ?kltl s(mk, ml)
- s(mk, ml) score of induced alignment (k,l)
36Dynamic Programming for Multiple Alignment
AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA
AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC
37Progressive Alignment
- Multiple Alignment is NP-complete
- Most used heuristic Progressive Alignment
- Algorithm
- Until all sequences are aligned
- Align two (multi-)sequences to each other, and
treat the result as a new sequence - Example aligning AACTGTA with AATGTC, gives
- AACTGTA
- AA-TGTC, with letters (AA), (AA), (C-), (TT),
(GG), (TT), (AC) - Running Time O(NL2), where N seqs, L length
of a seq
38MLAGAN Progressive Alignment
Human
Baboon
Mouse
Rat
- Given N sequences, phylogenetic tree
- Align pairwise, in order of the tree (LAGAN)
- With needed generalizations for multi-anchoring
scoring edit distance
39Evolution at the DNA level
Deletion
Mutation
ACGGTGCAGTTACCA
SEQUENCE EDITS
AC----CAGTCCACCA
REARRANGEMENTS
Inversion
Translocation
Duplication
40Local Global Alignment
Global
Local
AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA
AGTGACCTGGGAAGACCCTGAACCCTGGGTCACAAAACTC
41Glocal Alignment Problem
- Find least cost transformation of one sequence
into another using shuffle operations
AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA
- Sequence edits
- Inversions
- Translocations
- Duplications
- Combinations of above
AGTGACCTGGGAAGACCCTGAACCCTGGGTCACAAAACTC
42SLAGAN 1. Find Local Alignments
- Find Local Alignments
- Build Rough Homology Map
- Globally Align Consistent Parts
43SLAGAN 2. Build Homology Map
- Find Local Alignments
- Build Rough Homology Map
- Globally Align Consistent Parts
44SLAGAN 2. Build Homology Map
Chain using Sparse Dynamic Programming
- Penalties
- regular
- translocation
c) inversion d) inverted translocation
45SLAGAN 2. Build Homology Map
- Find Local Alignments
- Build Rough Homology Map
- Globally Align Consistent Parts
46SLAGAN 3. Global Alignment
- Find Local Alignments
- Build Rough Homology Map
- Globally Align Consistent Parts
47SLAGAN Example Chromosome 20
- Human Chromosome 20 versus Mouse Chromosome 2
- 270 Segments of conserved synteny
- 70 Inversions
48SLAGAN example HOX cluster
- 10 paralogous genes
- Conserved order in Human/Mouse/Rat
49SLAGAN example HOX cluster
- 10 paralogous genes
- Conserved order in Human/Mouse/Rat
50Whole-genome alignment with SLAGAN
- Two-step Shuffle
- Shuffle for large-scale synteny map
- Shuffle each syntenic region for
microrearrangements
51The ENCODE Project
52ENCODE regions shuffled
Hum/Rat
Hum/Mus
53ENCODE regions shuffled
Hum/Mus
Hum/Rat
54ENCODE regions shuffled
Hum/Rat
Hum/Mus
55ENCODE regions shuffled
Hum/Mus
Hum/Rat
56ENCODE regions shuffled
Hum/Rat
Hum/Mus
57Constrained Elements in Alignments
58Human-Mouse-Rat
Berkeley Genome Pipeline http//pipeline.lbl.gov
59Human-Mouse-Rat
60More DNA is coming