Title: Identifying Patterns in DNA Change
1Identifying Patterns in DNA Change
Jason Gilder Bioinformatics Research Group Wright
State University MAICS Presentation April 12,
2003
2ALU Background
- Well-known sequences broken into families
- Short Interspersed Repetitive Elements (SINEs)
- Approximately 280 bp long
- 10 of human genome
3ALU Background Cont
- Proliferate through retrotransposition
- - Copy is transcribed, reverse transcribed,
and - reinserted at distant site
- Original progenitor sequence known
- Can trace evolutionary path
- - Number of changes
- - Types of changes
4Problem
- Use EC to predict substitution rates
- (e.g. The number of Cs that used to be As)
- Only use features of repeat itself, not
progenitor - - Content information for repeat
- - GC content in flanking regions
-
- Feasible?
- - Enough features?
- - Correct features?
5Feature Set
- 16 Features
- - Length of repeat, of As, of Gs, of
Cs, - of Ts, and GC Content Percentage of ALU
- GC content for 10 flanking regions
-
(500 20,000 nts)
6Data Set
- ALU Y Family in Chromosome 1
- CENSOR used to get substitution data
- http//www.girinst.org/Censor_Server.html
- 6,749 examples
- 5,000 Training ( 74 )
- 1,749 Holdout Testing ( 26 )
- Each chosen randomly
7CENSOR Alignments
ALUY 11 282 CONTIG-1P1 317456 317723
0.88 0.10 1.53 272 206.52
GGTGGCTCACGCCTGTAATCCCAGCACTTTGGGAGGCCGAGGCGGGCGGA
TCACGAGGTCAGGAGATCGA
GCTGGATGTC-CCTGTAATCCCAGCACTTTGGGAGG
CCGAGGCGGGTGGATCATGAGGTCAGGAGATCGA
GACCATCCTGGCTAACACGGTGAAACCCCGTCTCTACTAAAAATACAAAA
AATTAGCCGGGCGTGGTGGC
GACCATTCTGGCTAACACAGTGAAACCCCGTCTCTACTAAAAATACAAAA
AATTAGCCAGGCATGGTGGC GGGCGCCTGTAGTCCCAGCTACT
CGGGAGGCTGAGGCAGGAGAATGGCGTGAACCCGGGAGGCGGAGCTT
ACACGCCTCTAGTCCCAACTA
CTCAGGAGGCTGACACAGGAGAATCACTTGGACCCGGGAGGTGGAGGTT
GCAGTGAGCCGAGATCGCGCCACTGCACTCCAGCCTGGGCGACA
GAGCGAGACTCCGTCTCA
GCAGTGAGCTGAGATCACGCCACTGCACTCCAGCCTGGGTGA-AAA--GA
GACTCCGCCTCA Containing 239 matches, 3 gaps
and 29 mismatches including 19 transitions
8Genetic Programming
- Equations built like parse trees
- Operator, input, and constant nodes
- (I6 35) 2.87
I6
35
2.87
9GP Reproduction
Parent 1
Parent 2
I2
I4
2.87
I6
35
Child 2
Child 1
2.87
I4
I2
I6
35
10Operator Node Set
Min returns the minimum of two nodes or
subtrees Max returns the maximum of two nodes
or subtrees  Cos if the connected nodes are x
and y, it returns x Cos(y) Â Sin if the
connected nodes are x and y, it returns x
Sin(y) Â Ave if the connected nodes are x and
y, it returns (x y)/2 Log if the connected
nodes are x and y, it returns x Log(y)
11Mask Operator Nodes
- All features mutable binary mask
- Summation fi mi
- Â
- Multiplication fi mi
- Â
- SumSquareRoot fi mi
12Initial GP Results
- Classifying C -gt G
- Fitness average absolute error
- Classification Rate 46
- Average absolute error 0.75
13Theta Factor Offset
- Average absolute error 0.75
- If error lt 0.5, correct classification
- Subtract theta from solution to get correct
solution - Using 0.30, classification rates jumped to 66
- Linear search for best theta 0, 1 in 0.01
increments
14Initial Results
Progenitor Sequence
Alu
( training classification, test classification )
15Regional GC Analysis
- All Experiments redone utilizing only GC flanking
content. - No ALU information used
16Regional GC Analysis Results
- Features regional GC content
Progenitor Sequence
Alu
( Previous classification rates in parentheses )
17Context Analysis Masking CpGs
- Substitution rates from Cs and Gs redone
- All CpGs were masked
- Removed some independent mutation factors
18Masked CpG Results
- Features Flanking GC with masked CpGs
Progenitor Sequence
Alu
( Previous classification rates in parentheses )
19Conclusions
- Successfully predicted substitution rates
- Regional GC Content holds needed information
- 10 / 12 rates gt 80
- 6 / 12 rates gt 90
- Future Work Classifying entire genome
20Acknowledgements
- Dr. Dan Krane
- Dr. Travis Doom
- Dr. Michael Raymer
Dr. Mateen Rizki
Bioinformatics Research Group http//birg.cs.wrig
ht.edu
This work was supported in part by the National
Science Foundation (grant EIA-0122582), and by
the Dayton Area Graduate Studies Institute.