Lior, Bernd - PowerPoint PPT Presentation

1 / 31
About This Presentation
Title:

Lior, Bernd

Description:

rat. Human tctctggttagtttgtaacatcaagtacttacCTCATTCAGCATTTTTCTTTCTTTAATAGACTGGGTCA ... Rat tccgggattagtctgt---atgaggtacccacCACACTCAGAAGTTTTCTTTCTTGGATAGACTTGATCA ... – PowerPoint PPT presentation

Number of Views:34
Avg rating:3.0/5.0
Slides: 32
Provided by: liorpa
Category:
Tags: bernd | lior

less

Transcript and Presenter's Notes

Title: Lior, Bernd


1
Algebraic Statistics for Computational Biology
  • Lior, Bernd Seth

2
What is Biology? The study of living
organisms. What is Statistics? The science
concerned with the collection, organization,
analysis and interpretation of data. What is
Algebra? The part of mathematics that deals with
generalized arithmetic.
3
What is Algebraic Statistics?
4
What is Algebraic Statistics? There is no
dictionary definition yet. The term was coined by
European statisticians interested in applying
Gröbner bases to the design of experiments. Their
book is G. Pistone, E. Riccomagno and H. Wynn,
Algebraic Statistics Computational Algebra in
Statistics. CRC Press, 2000.
5
  • Table of Contents
  • Part I - Introduction to the four themes
  • 1. Statistics
  • 2. Computation
  • 3. Algebra
  • 4. Biology
  • Part II - Studies on the four themes
  • 5. Parametric Inference
  • 6. Polytope Propagation on Graphs
  • 7. Parametric Sequence Alignment
  • 8. Bounds for Optimal Sequence Alignment
  • 9. Inference Functions
  • 10. Geometry of Markov Chains
  • 11. Equations Defining Hidden Markov Models
  • 12. The EM Algorithm for Hidden Markov Models

New book
Algebraic Statistics for Computational
Biology Edited by Lior Pachter and Bernd
Sturmfels Cambridge University Press, 2005
6
Algebraic Statistics for Computational Biology
Group Department of Mathematics, U.C. Berkeley
http//math.berkeley.edu/lpachter/ascb/
Photo courtesy of Robert Fisher Lawrence Hall of
Science March 7th, 2005
7
Who is this girl ?
8
The human genome
Consists of 2.8 billion DNA bases. Sequenced in
2001 and finished in 2004. Contains genes -
these are subsequences which code for protein. -
estimated number of genes 20,000-25,000. -
genes make up less than 5 of the
genome. Example Breast-ovarian cancer
susceptibility gene (BRCA1)
9
The human genome
10
The human genome
11
gthg17_dna rangechr1738464686-38473085 5'pad0
3'pad0 revCompFALSE strand? repeatMaskingnoneA
TCCAGAAGTCTAGTATACATCTCAAAATTCATGCATCTGGCCGGGCACAG
TGGCTCACACCTGCAATCCCAGCACTTTGGGAGGCCGAGGTGGGTGGATT
ACCTGAGGTCAGGAGTTTAAGACCAGCCTGGCCAACATGGTAAAACCCCA
TCTCTACTAAAAATACAAGTATTAGCCAGGCATTGTGGCAGGTGCCTGTA
ATCCCAGCTACTCGGGAGGCTGAGGCAGGAAAATCACTTGAACCGGGAGG
CGGAGGTTGGAGTGAGCTGAGATCGTGCTACCGCACTCCATGCACTCTAG
CCTGGGCAACAGAACGAGATGCTGTCACAACAACAACAACAACAACAACA
ACAACAACAACAACAACAACAAATTCTCACATCTAAAACAGAGTTCCTGG
TTCCATTCCTGCTTCCTGCCTTTCCCACTCCCCCATATTCCCTACCATGC
CTTCTTCATCTAATTTAATATTACTAACAAGATCTATTGTTCAAGCCAAA
ACCCAAGTGTCACTCCTTCAATTTCTCTTTACCTTATCCTCCAAATTTAA
TCCATTAGCAAGTCCTCTCTTCAAACCCATCCCAAACCAACCTTGTTTTT
AACCATCTCCACACCACCAATTACCACAAGGATAAAATCTGAATTCCTTA
CCACCAAATACTATGTGATCTGGCCCTCATCTATGACCTTCTCCCATTCC
TTGTGTAATCTCTGCCTCCACACATAATTTGCAAATTACTCCAGCTACAC
TGGCCTATTATTATTATTATTATTATTTTTGAGACGGAGTCTTGCTCTTT
CGCCCAGCCTGGAGTGCAGTGGCGCAATCTCAGCTCACTGCAATCTCCGC
CTCCTGGGTTCAAGCGATTCTCCTGCCCCAGCCTCCCAAGTAGCTGTGAT
TACAGGCACATGCCACCATTCCCAGCTAATTTTTTTTTGTTTTTGAGATG
GAGTTTCACTCTTGTTGCCCAGGCTGGAGTGCAATGGTGCGATCTCAGCT
CACCACAACCTCCACCTCCCGGGTTGATGAAGTGATTCTCTTGTCTCAGC
CTCCCGTGTAGCTGGGATTAGAGGCACGCGCCACCACGCTGGGCAAATTT
TTGTATTTTTAGTAGAGACAGGGTTTCTACCTCAGTGATCTGTCCGCCTT
GACCTCCCAAAGTGCTGGGATTACAGGAATGAGCCACCACACCCAGCCGT
GCCCAGCTAATTTTTGCATTTTTTAGTAGAGATGGGGTTTTGCCACGTTG
GCCAGGCTGGTCTCAAACTCCTGACCTCAGGGGATCTGCCTGCCTCGGCC
TCCTAGAGTGCTGGAATTACAGGTGTGAGCCACTGTGCCCGAACCTTTTA
TCATTATTATTTCTTGAGACAGGAGTCTTGCTCTGTCGTTCAGGCTGGAG
TGCAGTGATGCGATCTTGGCTCACTGTAACTCCTACCTTTCGGTTCAAGT
GATTCTCCTGCCTCAGCCTCTGGAGTAGCTGGGATTACAGGCACTGGGAT
TACAGGCACACACCACCACACCATGCTAGTTTTTTGTATTTTTAGTAGAG
ATGGGGTTTCACCATGTTGGCCAGGCTGGTCTCGAACTCCTGACCTCAAG
TGATTTGCCTGCCTTGGCTTCCCAAAGTGCTGGGATTATAGGCACGAGCC
ACCACACACGACCAACATTGGCCTATCTTTTAAAAAATAAACCAAGCTCT
GGCCGGGCACAGTGGCTCACACCTGTGATCCCAGCACTTTGGGAGGTTGA
GGTGGTTGGATCACTTGAGTTCAGGAGTTTGAGACCAGCCTGACCAACGT
GGTAAAACCCCATCTCTACTAAAAATAAAAACTAGTCGGGTGTGGTAGCA
CGCGTGCCTGTAATACCAGCTACTCAGGAGGCCAAGGCAGGAGAATTGCT
TGAACCCAGGAGACAGAGTTTGCAGTGAGCCAAGATTGTGCCACTGCACT
CCAGCCTGGGGGATAGAGGGAGACACCATCTCAAAAAAACCAAAATACAG
AAATCAAAAAACCACACTCATTATTACCTCAAGACCTTTATGTTTGCTAT
TCCTCTGCCTATAAGATGCATTCCCTTCATTTTTCAAGGACAATTATTTC
TTGTTATTTAGGTCTCAGCTCAATTTTTTCAGAAAGGCTTTCCCTGGCCT
CCTTAAACGAAAGTAATCAACAACCTTTGACAGCTAATACTATTCCACTG
TTCTGTATATTTCTCCATAGCATTTATTGTTATCTTAAATTCATCTTTAT
TGTGTATCTCCCCTCGACAGAACCTGAATCCTACCAGGGACTTAGTTAGT
CTTATTTACTGTTGCATTCCTAGTGCCCAGAACACAGTAGGCTCCCAATA
AATAGCCACTGAATAAAAGTTAAAACCAACAAAAATAATCATTTAATTAA
TTATGAATACATCGAATTGTGCACAATAGTTTATAAAATTACTTTTTTTT
TTTTTTTAAGACAGGGTCTCATTCTGTCTCACAGGCTGGAGTGCAGTGGT
GCAATCTAGGCTCACTGCAACCTCCGCCTCCCGGGTTCAAGTGATTCTCC
TGCCTCAGCCTCCCCAGCAGCTAGGATTACAGGCACATGCCACCACGCTC
GACTAATTTTTTTGTGTTTTTAGTAGAGACAAGGTTTCACCATGTTGACC
AGGCTGGTCTCGAACTCCTGACCTCAAGTGATCCACCTGCCTTGGCCACT
CAAAGTGCTGGGATTATAGGCATGAGCCACCACGCCTGGCCTATAAAATT
ACTTTCACATTTCATTTTGCCTGATCTGTTGTCACAGAAGTTCTCAGATG
GCTGTTCTGAAATTATTCCTCCTCCTACACTCTATCTTATTTACTTCTCA
CTGTTCTCAGTATCATAAAGTGCAACATCTTTTTGAAGCAATCTGAATTA
TAAACAGATACATTTGCATGTATATATATGTATATATGCATATGCACACA
CACACTTTTTTTTTTTTAAGAGACAGGGTCTTGCTCTGTGCAAGTGCAAG
AGTGCAATGGTATGATCATAGCTCACTGCAGCCTTGAACTCCTGGGCTCA
AGTGATTCTTCTGGCTTAGCTTCCTCAGTAGCTAAGACTACAGAAGCACA
CTGCCATGCCCGGCTAATTAAAAAAAAATTTTGTGGAGACAGAGTCTCAC
TATGTTGCCCAGGCTGGTTTCAAACTCCTGGCCTCAAGTAATCTTCCTGT
CTCAGCCTCCCAAAGGGCTGAGATTATAAGTGTGAGCCACTGCATCTGGA
CTGCATATTAATATGAAGAGCTTTTCTTCAACAACAGTGAACAGTTTTCT
ACAAAGGTATATGCAAGTGGGCCCACTTCTTGTTCTTATGAATCTTTTCT
TTCCTTTTATAAAACTCCTTTTCCTTTCTCTTTTCCCCAAAGAAAGGACT
GTTTCTTTTGAAATCTAGAACAAATGAGAACAGAGGATATCCTGGTTTGC
GCTGCAAAATTTTTTTTTTTTTTAAGACGGAGTCTCGCTCTGTTGCCAGG
TTGGAGTGCAGTGGCACGATCTTGGCTCATTGCAACCTCCACCTCCCGGG
TTCAAGAGATTCTCCTGCCTCAGCCTCCTGAGTAGCTGGAACTAAAGGCG
CATGCCACCACGCTGAGTAATTTTTTGTATTTTAGTAGAGACAGGGTTTC
ACCATGTTGCCCAGGCTGATCTCGAACTCCTGAGCTCAGGCAATCTGCCT
GTCTTGGCCTCCCACAGTGTTAGGATTACAGGCATGAGCCACTGCACCCG
ATTTTTTTTTTCTTTTGATGGAGTTTTGCTCTTGTTGCCCAGGTTAGAGT
GCAATGATGCGATCTCAGCTCACTGCAACCCCCGCCTCCCAGGTTCAAGT
GATTCTCCTGCCTCAGCCTCCCGAGTAGCTGGAATTACAGGCAAGTGCCA
CCAAGCCCGGCTAATTTTGTATTTTTAGTAGAAACGGGGTTTCTCCATGT
TGGTCAGGCTGGTCTTGAACTCCCGACATCAGGTGATCCAAGCGCCTCAG
CCTCCCAAAGCGCTGGGATTATAGGTATGAGCCACAGTGCAGGCCTGCAT
AATTCTTGATGATCCTCATTATCATGGAAAATTTGTGCATTGTTAAGGAA
AGTGGTGCATTGATGGAAGGAAGCAAATACATTTTTAACTATATGACTGA
ATGAATATCTCTGGTTAGTTTGTAACATCAAGTACTTACCTCATTCAGCA
TTTTTCTTTCTTTAATAGACTGGGTCACCCCTAAAGAGATCATAGAAAAG
ACAGGTTACATACAGCAGAAGAACGTGCTCTTTTCACGGAGATAGAGAGG
TCAGCGATTCACAAAAGAGCACAGGAAGAATGACAGAGGAGAGGTCCTTC
CCTCTAAAGCCACAGCCCTTTAATAAGGCTTGTAGCAGCAGTTTCCTTCT
GGAGACAGAGTTGATGTTTAATTTAAACATTATAAGTTTGCCTGCTGCAC
ATGGATTCCTGCCGACTATTAAATAAATCCCTAGCTCATATGCTAACATT
GCTAGGAGCAGATTAGGTCCTATTAGTTATAAAAGAGACCCATTTTCCCA
GCATCACCAGCTTATCTGAACAAAGTGATATTAAAGATAAAAGTAGTTTA
GTATTACAATTAAAGACCTTTTGGTAACTCAGACTCAGCATCAGCAAAAA
CCTTAGGTGTTAAACGTTAGGTGTAAAAATGCAATTCTGAGGTGTTAAAG
GGAGGAGGGGAGAAATAGTATTATACTTACAGAAATAGCTAACTACCCAT
TTTCCTCCCGCAATTCCTAGAAAATATTTCAGTGTCCGTTCACACACAAA
CTCAGCATCTGCAGAATGAAAAACACTCAAAGGATTAGAAGTTGAAAACA
AAATCAGGAAGTGCTGTCCTAAGAAGCTAAAGAGCCTCAGTTTTTTACAC
TCCCAAGATCAATCTGGATTTATGATTCTAAAACCCCTGGTGACAGAATC
AGAGGCTGAAAACACCACTAATTATAACCAGCAGGTATGGATATTTGGAA
GTCTAGGGGAGGCTGATATGAAGTTAAGACCAGAGGAAATATCTGTCCAC
TCCCTCTTCTCAACACCCATCTTCTAGACGCCAAGGCTAGCTATAGATCT
CCATTATAGTGTTCAAGGAATTAGGAATTATCCATGTCAATAGTTTTGAT
TAATGTGGACGGAGAACATCTATATTACTAGATGGCAATATGTGAAAGAA
GAAAACAGTATTGTTGAAAACCTAAATCTGAAATGTCAATGTAATGACAA
ATTTTCACCCCTAGAATGTCTACCTGGGGAGTCCTAACCCTCTAATATTC
CCCTGAGAGGGATGGGAGAATACAGTGCAGAGCTTTTATATAAGTATTTC
AGAAAGCAGTAGCTAAAGAATCACTTGTTTATTTCCCAGTGTTTCAAAGG
CCCTTCTGAAGAACTAAGCAAACTAAGGAAAGACCATTTAGTTTTAAACA
GGAGAAATGTATTTAACTAAATCCTAAACACAGCAGGCTATCTGCAAGCA
GCAGCAGCAGCAGCAGCCATGCTCCCTCACAGAATCCTTACAATTTTTGA
AGTTTTTTGTTTAACTGCTACAAAAGCCGATTTAGTAACATTTATTACAC
TTAAAAACTTCAGTTCATTTGTAGTTCAAAGCAAATGTATTGGCTTTGAG
TTTAAAGACTGAACTACTTTAGATTTGATTTGCATTTTTTTTTTTTTTTT
TTTTTGAGATGCAGTCTTGCTCTGTCAGCCAGGCTGGAGTGCAGTGGCTG
GATCTCAGCTCACGGCAAGCTCTGCCTCCTGGGTTCATGCCATTCTCCTG
CCTCAGCCTCCTGAGTAGCTGGGACTACAGATGCCCGCCACCATGCCCGG
CTAATTTTTTGTATTTTTACTAGAGATGGGGTTTCACCGTGTTAGCCAGG
ATGGTCTCGATCTCCTGACCTCGTGATCTGCCCGCCTTGGCCCCCCAAAG
CGCTGGGATTACAGGCCTGAGCCACCACGCTTGGCATCTTTTTACCTTTC
ATTAACTTTGATGCAAACCTATAGCTTAAGGTATCTTAAACTTTAATGAC
ATTTTTCTCTAAAATAGTAGTTTGTAATAACTTGTTCTGGCACCTGGCTC
CAATGAACACTACCCTCTGACCCTGTGGTATAATTTTCATGAGTAAGTGG
AAACCTAAGATCTTAGAAGTTCAACGGCAATGTGTCCAAGGGGTTTAGAT
CCTCTCCTTAAGTGCCTGTATCTCTGTGAAAAGAATCATCATAGGCTAGG
CGCGATGGCTCACACCTGTAATCCCAGCACTTTGGGAGGCCGAGGTAGGT
GGATCACCTGAGGTCGGGAGTCCAAGACCAGCCTGACTGACATGGAAAAA
CCCTGTCTCTACTAAAAATACAAAATTAGGTATGGTGGTGCATTCCTGTA
ATCCCAGCTACTCGGGAGGCTGAGGCAGGAGAATCGCTTGAACCCGGGAG
GGGGAGGTTGCAGCAAGCCAAGATCGTGCCATTGCACTCCAGCAGCCTGG
GCAACAAGAGTGAAAAACTACACCTCAAAAACAAAAACAAAAACAAAAGA
ATCATCATCAAGTGAACTGGAACACATCCAGAGAACTAATTTTGTTAGAA
AGATTTTAGAGTTGAGCCACACAATCTGCATCTTCTGCGTCCTCCATGCA
CTCGTCTGCTTTCTGGAGCCCCATGAGTGAGTCTTAATCCTGTTCCAGAT
AACAGTTCTCTTCCGGGTAACGGTTCTTCAGATACTTGAAGACAGTGTCT
TATTTCCTTAAATCTTCTCATTTCTTCTTCAAAAGACAGTATTTCAAGTT
ACTTTTATGTATCTTTACCATCTACCTCTGGATAAACACTCTCCAATTTG
TCAGTGACCATGTTAAAAACCAAGCACGGTGCTTAAAACTGACATCATCT
TTCAGGCAATCACTCCATTGGAGAATACAGTGGGGCTCTGGATCTGTACT
TCACTTGCTCCAGAGCCTCTGCTTGTGTTAATACGGCCCAGTTTCAAATA
AGCATTTTTAGCAGCCCTGAAATGTGTACTCAGATTTAGTTTATAGTCAA
CTAAAAACACCCAGAGGTCTCCTGTATTACACAAGTTATAATTAAAACCT
TAAAAGAGAAAGGTATAGGACAAATGATCTGTCTCCTCCCTTTTTTGCTT
TTTCATATGTTAAGACTATCTCGGAGCTGTTATCAGACTTTTTTCCTGAA
AAACTCTCAACAATACTCAAACTAGGTGTTACATGAAGCTGGGGTCTCCA
GGTTTTGCCTCACTTGTTCTTTCTTTTGTTGTTGTTGAGACAGAGTCTCA
CTCTGTCGCCAGGCTGGAGTGCAGTGGCAGGATCTCAGCTGACTGCAACC
TCAGCCTCCAGAGTTCAAGCAATTCTTCTGTGTCAGCCTCCCAAGTAGCT
GGGATTACAGGTGCACACCACCACGCCCAGCCA
12
Another example of annotation
13
tctctggttagtttgtaacatcaagtacttacctcattcagcatttttct
ttctttaatagactgggtcacccctaaagag tccgggattagtctgta
tgaggtacccaccacactcagaagttttctttcttggatagacttgatca
cccctgaagagaag
14
Data summary
a c g t
a 9 4 3 4
c 4 4 6 4
g 3 2 5 2
t 5 9 4 13
tctctggttagtttgtaacatcaagtacttacctcattcagcatttttct
ttctttaatagactgggtcacccctaaagag tccgggattagtctgta
tgaggtacccaccacactcagaagttttctttcttggatagacttgatca
cccctgaagagaag
15
Statistics Question Are the two sequences
independent?
a c g t
a 9 4 3 4
c 4 4 6 4
g 3 2 5 2
t 5 9 4 13
Algebra Question
Is the 4x4 matrix close to rank 1?
16
  • The independence model
  • m 16 observable states A,C,G,T2
  • d 6 unknown parameters
  • (aA , aC , aG , aT , bA , bC , bG , bT) where
  • aA aC aG aT bA bC bG bT 1
  • Independence means probabilities factor
    AG prob(A,G) aAbG

17
  • The independence model
  • m 16 observable states A,C,G,T2
  • d 6 unknown parameters
  • (aA , aC , aG , aT , bA , bC , bG , bT) where
  • aA aC aG aT bA bC bG bT 1
  • Independence means probabilities factor
    AG prob(A,G) aAbG
  • The model is the polynomial map
  • (a,b) aTb

18
Models for discrete data
A statistical model is a parameterized family of
probability distributions
U Q
U D
d number of parameters m number of observable
states Q the parameter space D probability
simplex on the m states
19
The geometry of maximum likelihood estimation
parameter space
data
probability simplex
20
Observed data
tctctggttagtttgtaacatcaagtacttacctcattcagcatttttct
ttctttaatagactgggtcacccctaaagagatc tccgggattagtct
gtatgaggtacccaccacactcagaagttttctttcttggatagacttga
tcacccctgaagagaag
21
tctctggttagtttgtaacatcaagtacttacCTCATTCAGCATTTTTCT
TTCTTTAATAGACTGGGTCACCCctaaagagatc tccgggattagtct
gt---atgaggtacccacCACACTCAGAAGTTTTCTTTCTTGGATAGACT
TGATCACCCctgaagagaag


Hidden data
22
The alignment problem is to find the shortest
path in the alignment graph
This is solved with dynamic programming and is
known in computational biology as the
Needleman-Wunsch algorithm.
23
The algebraic statistical model for sequence
alignment, known as the pair hidden Markov
model, is the image of a map
whose coordinates are polynomials with one term
for each path in the alignment graph.
The logarithms of the 33 parameters give the edge
lengths for the shortest path problem on the
alignment graph.
24
General Mathematical Framework
  • Statistical models are algebraic varieties.
  • Algebraic varieties can be tropicalized.
  • Tropicalized models are useful
  • for MAP inference in statistics.

L. Pachter and B. Sturmfels, Tropical Geometry of
Statistical Models, Proceedings of the National
Academy of Sciences, Volume 10146 (2004), p
16132--16137. L. Pachter and B. Sturmfels,
Parametric Inference for Biological Sequence
Analysis, Proceedings of the National Academy of
Sciences, Volume 10146 (2004), p 16138--16143.
25
2.1. Tropical arithmetic and dynamic programming
In tropical algebraic geometry, varieties are
piecewise linear
26
Comparative Genomics
human
tctctggttagtttgtaacatcaagtacttacCTCATTCAGCATTTTTCT
TTCTTTAATAGACTGGGTCACCCctaaagagatc tccgggattagtct
gt---atgaggtacccacCACACTCAGAAGTTTTCTTTCTTGGATAGACT
TGATCACCCctgaagagaag


rat
27
Comparative Genomics
A phylogenetic tree on 5 taxa.
28
Comparative Genomics
Petersen graph parametrizes trees on 5 taxa.
29
Trees are Ubiquitous in Biology
Fig. 1.
Y Chromosome of D. pseudoobscura Is Not
Homologous to the Ancestral Drosophila Y
Antonio Bernardo Carvalho and Andrew G. Clark,
Science, January 7 2005.
30
1
5
2
4
3
1
5
4
2
3
1
4
2
5
3
1
3
2
4
5
31
Summer school Themes
Algebra, discrete mathematics and statistics
are relevant for genomics and vice versa...
Organ (liver)
Organ system (digestive)
Tissue (liver sinusoid)
Cell (hepatocyte)
Organelle (nucleus)
TAGAGACGGGGGTTTCACAATGTTGGCCA
Molecule (DNA)
Write a Comment
User Comments (0)
About PowerShow.com