Title: DNA
1DNA Sequence Comparison and Alignment
Nick Heppenstall (biology) Michal Dvir
(Mathematics/CS) Andrew Dittmore (physics) Under
guidance of Dr. Yung-Pin Chen (Mathematics)
2Outline
- Pi in base 4
- DNA Overview
- Markov Chains Models
- Sequence Alignment
- Future Plans
3p
in base 10
3.14159
in base 4
3.021003331
4The Normality of Pi
Looking at p in base 4, the chance of seeing 2
is 14 22 is 116 222 is 164 2222 is 1256
5Digits of p in base 4
3, 0, 2, 1, 0, 0, 3, 3, 3, 1, 2, 2, 2, 2, 0, 2,
0, 2, 0, 1, 1, 2, 2, 0, 3, 0, 0, 2, 0, 3, 1, 0,
3, 0, 1, 0, 3, 0, 1, 2, 1, 2, 0, 2, 2, 0, 2, 3,
2, 0, 0, 0, 3, 1, 3, 0, 0, 1, 3, 0, 3, 1, 0, 1,
0, 2, 2, 1, 0, 0, 0, 2, 1, 0, 3, 2, 0, 0, 2, 0,
2, 0, 2, 2, 1, 2, 1, 3, 3, 0, 3, 0, 1, 3, 1, 0,
0, 0, 0, 2,0, 0, 2, 3, 2, 3, 3, 2, 2, 2, 1, 2, 0,
3, 2, 3, 0, 1, 0, 3, 2, 1, 2, 3, 0, 2, 0, 2, 1,
1, 0, 1, 1, 0, 2, 2, 0, 0, 2, 0, 1, 3, 2, 1, 2,
0, 3, 2, 0, 3, 1, 0, 0, 0, 1, 0, 3, 1, 3, 1, 3,
2, 3, 3, 2, 1, 1, 1,0, 1, 2, 1, 2, 3, 0, 3, 3, 0,
3, 1, 0, 3, 2, 2, 1, 0, 0, 3, 0, 1, 2, 3, 0, 3,
0, 0, 0, 2, 2, 3, 0, 0, 2, 2, 1, 2, 3, 1, 3, 3,
0, 2, 1, 1, 3, 3, 0, 1, 1, 0, 0, 3, 1, 3, 1, 0,
3, 3, 3, 2, 0, 1, 0, 3, 1, 1, 1, 2, 3, 1, 1, 2,
3, 1, 1, 1, 0, 1, 3, 0, 0, 2, 1, 0, 1, 1, 3, 2,
1, 0, 2, 0, 1, 1, 2, 3, 1, 1, 1, 3, 1, 2, 1, 2,
0, 2, 1, 1, 3, 2, 1, 3, 3, 2, 3, 0, 1, 2, 3, 3,
1, 0, 1, 0, 3, 0, 1, 0, 0, 2, 3, 2, 2, 1, 2, 2,
1, 2, 0, 3, 1, 3, 3, 2, 3, 1, 1, 2, 2, 3, 0, 0,
2, 3, 3, 3, 3, 3, 1, 1, 3, 0, 2, 3, 1, 2, 3, 3,
1, 0, 0, 0, 1, 2, 2, 3, 1, 3, 3, 2, 3, 1, 3, 2,
3, 2, 0, 3, 2, 0, 1, 2, 2, 3, 3, 3, 2, 3, 1, 1,
2, 2, 2, 0, 2, 1, 2, 1, 3, 3, 2, 2, 1, 1, 2, 2,
3, 2, 2, 1, 3, 3, 0, 2, 1, 0, 0, 1, 0, 1, 1, 3,
3, 0, 1, 0, 2, 3, 0, 1, 3, 3, 3, 2, 1, 2, 1, 0,
2, 1, 0, 2, 2, 0, 1, 2, 1, 2, 1, 1, 0, 1, 3, 2,
3, 0, 3, 2, 1, 0, 1, 1, 2, 3, 0, 3, 3, 1, 3, 0,
0, 2, 0, 0, 0, 0, 1, 3, 3, 0, 2, 3, 2, 0, 2, 2,
0, 1, 1, 2, 0, 3, 2, 3, 3, 3, 0, 0, 1, 1, 2, 1,
2, 0, 3, 1, 2, 2, 1, 0, 2, 0, 0, 3, 1, 2, 0, 1,
3, 0 . . .
6First 5000 digits of p in base 4.
7Bases of the cowpox genome
t,a,g,t,a,a,a,a,t,t,a,a,a,t,t,a,a,t,t,a,t,a,a,a,a,
t,t,a,t,a,t,a,t,a,t,a,a,t,t,t,a,c,t,a,a,c,t,t,t,a,
g,t,t,a,g,a,t,a,a,a,t,t,a,a,t,a,a,t,a,t,a,t,a,a,g,
t,t,t,t,a,g,t,a,c,a,t,t,a,a,t,a,t,t,a,t,a,t,t,t,t,
a,a,a,t,a,t,t,t,t,a,t,t,t,a,g,t,g,t,c,t,a,g,a,a,a,
a,a,a,a,t,g,t,g,t,a,a,c,c,c,a,t,g,a,c,t,g,t,a,g,g,
a,a,a,c,t,c,t,a,ga,g,g,g,t,a,a,g,a,a,a,g,a,t,c,g,a
,t,c,g,c,t,t,t,a,t,a,g,a,g,a,c,c,a,t,c,a,g,a,a,a,g
,a,g,g,t,t,t,a,a,t,a,t,t,t,t,t,g,t,g,a,g,a,c,c,a,t
,t,g,a,a,g,a,g,a,g,a,a,a,g,a,g,a,a,a,g,a,g,a,a,t,a
,a,a,a,a,t,a,t,t,t,t,a,g,t,g,a,c,t,c,c,a,tc,a,g,a,
a,a,g,a,g,g,t,t,t,a,a,t,a,t,t,t,t,t,g,t,g,a,g,a,c,
c,a,t,t,g,a,a,g,a,g,a,g,a,a,a,g,a,g,a,a,a,g,a,g,a,
a,t,a,a,a,a,a,t,a,t,t,t,t,a,g,t,g,a,c,t,c,ca,t,c,a
,g,a,a,a,g,a,g,g,t,t,t,a,a,t,a,t,t,t,t,t,g,t,g,a,g
,a,c,c,a,t,c,g,a,a,g,a,g,a,g,a,a,a,g,a,g,a,a,t,a,a
,a,a,a,t,a,t,t,t,t,t,g,t,a,a,a,a,c,t,t,t,t,t,t,a,t
,g,a,g,a,c,c,a,t,t,g,a,a,g,a,g,a,g,a,a,a,g,a,g,a,a
,t,a,a,a,a,a,t,a,t,t,tt,t,g,t,a,a,a,a,c,t,t,t,t,t,
t,a,t,g,a,g,a,c,c,a,t,t,g,a,a,g,a,g,a,g,a,a,a
8First 5000 bases of the cowpox genome.
9Three-Dimensional Trajectories
Pi (random)
DNA
H.T. Chang, N Lo, W. Lu, C.J. Kuo,
Visualization and Comparison of DNA Sequences by
Use of Three-Dimensional Trajectories.
10DNA
- Deoxyribonucleic Acid
- Double helix
- Chain of nucleotide subunits
- Four bases in DNA (A,T,C,G)
- Hold information for maintaining life
- Passed from parent(s) to offspring
11Mutations
- Environmental factors
- Copying errors
- Single base substitutions
- Insertions/Deletions
- Duplications
- Translocations
- Inversions
12DNA sequence comparison
- Homologous genes
- Conserved sequences
- Identify mutations
- Forensics
- Evolution
QUANTITATIVE!
13Markov Chain
Definition A collection of random variables
having the property that, given the present, the
future is conditionally independent of the past.
Example Annual percentage migration between city
and country
0.03
Country
City
0. 97
0. 95
0.05
14Hidden Markov Model
A Hidden Markov Model is a Markov chain, where
each state (City/Country) generates an
observation or emission (Pet). The state can be
predicted by observing emissions.
Example Annual percentage migration between city
and country
0.03
Cow 0.5 Dog 0.3 Cat 0.1 None 0.1
Cow 0.0 Dog 0.1 Cat 0.4 None 0.5
0. 97
0. 95
0.05
City
Country
15HMM State Transitions
Match
Mismatch
InDel
States Match, Mismatch and Indel
16HMM Emissions
A
C
G
T
Match
A/C
A/G
A/T
C/G
C/T
G/T
Mismatch
InDel
A/-
C/-
G/-
T/-
Emissions A, C, G and T
17Alignment/Comparison
- Mutations are recorded in DNA
- Allow for comparison/alignment
- Types of alignment
- Local
- Global
- Gapped
- Ungapped
18Scoring matrices
A C G T
A 1 0 0 0
C 0 1 0 0
G 0 0 1 0
T 0 0 0 1
Gap -1
19Local alignment
Human TATGGTGGCGAGCAAACGTTGCGTGCGTA Mouse
GAGCAAA
20Local alignment
Human TATGGTGGCGAGCAAACGTTGCGTGCGTA Mouse
GAGCAAA
Score 0100000 1
21Local alignment
Human TATGGTGGCGAGCAAACGTTGCGTGCGTA Mouse
GAGCAAA
Score 0010000 1
22Local alignment
Human TATGGTGGCGAGCAAACGTTGCGTGCGTA Mouse
GAGCAAA
Score 0010000 1
23Local alignment
Human TATGGTGGCGAGCAAACGTTGCGTGCGTA Mouse
GAGCAAA
Score 1111111 7
24Global alignment
Human TATGGTGGCGAGCAAACGTTGCGTGCGTA Mouse
CATTGTGGTGAGCAAAGCGGTGGGCGGGTA
25Global alignment
14 matches 16 mismatches
Score 14(1)16(0) 14
26Global alignment
24 Matches 5 Mismatches 1 Indel
Score 24(1)5(0)1(-1) 23
27The scoring problem
28Our Solution
What if we align the DNA sequence to a model,
instead of another sequence?
29Why is this a solution?
Start with an initial model with equally likely
probabilities. Then modify the model recursively
using one or more parent sequences. The initial
model is updated to replace the random
probabilities.
1/3 1/3 1/3
1/3 1/3 1/3
1/3 1/3 1/3
0.92 0.03 0.05
0.18 0.69 0.13
0.14 0.19 0.67
Recursive
Modification
30How does it score?
- Modification number
- Length of original sequence
- Transition matrix
- Each emission matrix
ACTGTGTAG
The Model
- Match/Match
- Match/Mismatch
- Match/Indel
- 4. Mismatch/Match
- .
- .
- .
Without knowing the initial state, the algorithm
checks all possible state transitions and
emissions for a best fit to the model.
31How does it score?
ACTGTGTAG
- Modification number
- Length of original sequence
- Transition matrix
- Each emission matrix
The Model
- Match/Match
- Match/Mismatch
- Match/Indel
Now the previous state is defined, so we have
only 3 possible transitions to consider.
32How does it score?
- Modification number
- Length of original sequence
- Transition matrix
- Each emission matrix
ACTGTGTAG
The Model
- Mismatch/Match
- Mismatch/Mismatch
- Mismatch/Indel
This process will continue through the sequence,
calculating the score and remembering the best
fit to the model.
33Future Plans
- Create working Hidden Markov Model.
- Find convergence as the Model is modified.
- Apply similar model to codon analysis.
- Develop DNA trajectories as an alternative
approach to sequence comparison.
34Modeling DNA with a Tetrahedron
35Directional Vectors
G
A
C
T
36AGTTCG
G
A
C
T
37AGTTCG
G
A
C
T
38AGTTCG
G
A
C
T
39G
AGTTCG
A
C
T
40G
AGTTCG
A
C
T
41AGTTCG
G
A
C
T
42AGTTCG
G
A
C
T
43AGTTCG
G
A
C
T
44AGTTCG
G
A
C
T
45AGTTCG
G
A
C
T
46AGTTCG
G
A
C
T
47AGTTCG
G
A
C
T
48AGTTCG
G
A
C
T
49(No Transcript)
50Change Points
51Approximate Vectors Between Change Points
52Quantify Regions Between Change Points
- Trajectory Length
- Tells the base count
- Vector Direction
- Tells the relative frequencies of each base
- Vector Length vs. Trajectory Length
- Tells how much the trajectory deviates from a
straight line
53DNA trajectories can be used to
- Match patterns by grouping similar vectors
- Find conserved regions (vectors that do not
change from sequence to sequence) - Perform many local alignments to assemble global
alignments
54Thanks!
- Kellar Autumn
- Jeff Ely
- Amanda Gassett
- Deborah Lycan
- Harvey Schmidt
- Collin Trail
- Greg Hermann
- Matt Wilkinson
55Work supported by
- John S. Rogers Science Research Program