Alignment of Whole Genomes: Algorithms - PowerPoint PPT Presentation

1 / 85
About This Presentation
Title:

Alignment of Whole Genomes: Algorithms

Description:

nearly 200 complete genomes have been sequenced ... Penalties are affine (event and distance. components) Penalties: regular. translocation ... – PowerPoint PPT presentation

Number of Views:122
Avg rating:3.0/5.0
Slides: 86
Provided by: Sera161
Category:

less

Transcript and Presenter's Notes

Title: Alignment of Whole Genomes: Algorithms


1
Alignment of Whole GenomesAlgorithms Tools
  • Michael Brudno
  • Department of Computer Science
  • University of Toronto
  • CBW 02/15/06

2
The Human Genome
ACAAGATGCCATTGTCCCCCGGCCTCCTGCTGCTGCTGCTCTCCGGGGCC
ACGGCCACCGCTGCCCTGCCCCTGGAGGGTGGCCCCACCGGCCGAGACAG
CGAGCATATGCAGGAAGCGGCAGGAATAAGGAAAAGCAGCTCCTGACTTT
CCTCGCTTGGTGGTTTGAGTGGACCTCCCAGGCCAGTGCCGGGCCCCTCA
TAGGAGAGGAAGCTCGGGAGGTGGCCAGGCGGCAGGAAGGCGCACCCCCC
CAGCAATCCGCGCGCCGGGACAGAATGCCCTGCAGGAACTTCTTCTGGAA
GACCTTCTCCTCCTGCAAATAAAACCTCACCCATGAATGCTCACGCAAGT
TTAATTACAGACCTGAACTCCTGACTTTCCTCGCTTGGTGGTTTGAGTGG
ACCTCCCAGGCCAGTGCCGGGCCCCTCATAGGAGAGGAAGCTCGGGAGGT
GGCCAGGCGGCAGGAAGGCGCACCCCCCCAGCAATCCGCGCGCCGGGACA
GAATGCCTGCAGGAACTTCTTCTGGAAGACCTTCTCCTCCTGCAAATAAA
ACCTCACCCATGGGAATGCTCACGCATTTAATTACAGACCTGAAAGGAGA
GGAAGCTCGGGAGGTGG
3
Basic Biology
  • DNA (4 residues, Double-stranded)
  • RNA (4 residues, Single-stranded)
  • Protein (20 amino acids)
  • A.a. code triplet of RNA codes 1 amino acid

gene
E
UTR
P
exon
UTR
exon
UTR
exon
UTR
exon
exon
exon
4
The Human Genome
ACAAGATGCCATTGTCCCCCGGCCTCCTGCTGCTGCTGCTCTCCGGGGCC
ACGGCCACCGCTGCCCTGCCCCTGGAGGGTGGCCCCACCGGCCGAGACAG
CGAGCATATGCAGGAAGCGGCAGGAATAAGGAAAAGCAGCTCCTGACTTT
CCTCGCTTGGTGGTTTGAGTGGACCTCCCAGGCCAGTGCCGGGCCCCTCA
TAGGAGAGGAAGCTCGGGAGGTGGCCAGGCGGCAGGAAGGCGCACCCCCC
CAGCAATCCGCGCGCCGGGACAGAATGCCCTGCAGGAACTTCTTCTGGAA
GACCTTCTCCTCCTGCAAATAAAACCTCACCCATGAATGCTCACGCAAGT
TTAATTACAGACCTGAACTCCTGACTTTCCTCGCTTGGTGGTTTGAGTGG
ACCTCCCAGGCCAGTGCCGGGCCCCTCATAGGAGAGGAAGCTCGGGAGGT
GGCCAGGCGGCAGGAAGGCGCACCCCCCCAGCAATCCGCGCGCCGGGACA
GAATGCCTGCAGGAACTTCTTCTGGAAGACCTTCTCCTCCTGCAAATAAA
ACCTCACCCATGGGAATGCTCACGCATTTAATTACAGACCTGAAAGGAGA
GGAAGCTCGGGAGGTGG
5
Complete DNA Sequences
nearly 200 complete genomes have been sequenced
6
Complete DNA Sequences
ACAAGATGCCATTGTCCCCCGGCCTCCTGCTGCTGCTGCTCTCCGGGGCC
ACGGCCACCGCTGCCCTGCCCCTGGAGGGTGGCCCCACCGGCCGAGACAG
CGAGCATATGCAGGAAGCGGCAGGAATAAGGAAAAGCAGCTCCTGACTTT
CCTCGCTTGGTGGTTTGAGTGGACCTCCCAGGCCAGTGCCGGGCCCCTCA
TAGGAGAGGAAGCTCGGGAGGTGGCCAGGCGGCAGGAAGGCGCACCCCCC
CAGCAATCCGCGCGCCGGGACAGAATGCCCTGCAGGAACTTCTTCTGGAA
GACCTTCTCCTCCTGCAAATAAAACCTCACCCATGAATGCTCACGCAAGT
TTAATTACAGACCTGAACTCCTGACTTTCCTCGCTTGGTGGTTTGAGTGG
ACCTCCCAGGCCAGTGCCGGGCCCCTCATAGGAGAGGAAGCTCGGGAGGT
GGCATGTGACCTCCGAGCAGTCACCADCCAGGCGGCAGGAAGGCGCACCC
CCCCAGCAATCCGCGCGCCGGGACAGAATGCCTGCAGGAACTTCTTCTGG
AAGACCTTCTCCTCCTGCAAATAAAACCTCACCCATGGGAATGCTCACGC
ATTTAATTACAGACCTGAAAGGAGAGGAAGCTCGGGAGGTGGGCATCTGA
CA
ACAAGATGCCATTGTCCCCCGGCCTCCTGCTGCTGCTGCTCTCCGGGGCC
ACGGCCACCGCTGCCCTGCCCCTGGAGGGTGGCCCCACCGGCCGAGACAG
CGAGCATATGCAGGAAGCGGCAGGAATAAGGAAAAGCAGCTCCTGACTTT
CCTCGCTTGGTGGTTTGAGTGGACCTCCCAGGCCAGTGCCGGGCCCCTCA
TAGGAGAGGAAGCTCGGGAGGTGGCCAGGCGGCAGGAAGGCGCACCCCCC
CAGCAATCCGCGCGCCGGGACAGAATGCCCTGCAGGAACTTCTTCTGGAA
GACCTTCTCCTCCTGCAAATAAAACCTCACCCATGAATGCTCACGCAAGT
TTAATTACAGACCTGAACTCCTGACTTTCCTCGCTTGGTGGTTTGAGTGG
ACCTCCCAGGCCAGTGCCGGGCCCCTCATAGGAGAAGGAAGCTCGGGAGG
TGGCCAGGCGGCAGGAAGGCGCACCCCCCCAGCAATCCGCGCGCCGGGAC
AGAATGCCTGCAGGAACTTCTTCTGGAAGACCTTCTCCTCCTGCAAATAA
AACCTCACCCATGGGAATGCTCACGCATTTAATTACAGACCTGAAAGGAG
AGGAAGCTACAGTCATGTGCFCGGGAGGTGGGCATCTGACA
ACAAGATGCCATTGTCCCCCGGCCTCCTGCTGCTGCTGCTCTCCGGGGCC
ACGGCCACCGCTGCCCTGCCCCTGGAGGGTGGCCCCACCGGCCGAGACAG
CGAGCATATGCAGGAAGCGGCAGGAATAAGGAAAAGCAGCTCCTGACTTT
CCTCGCTTGGTGGTTTGAGTGGACCTCCCAGGCCAGTGCCGGGCCCCTCA
TAGGAGAGGAAGCTCGGGAGGTGGCCAGGCGGCAGGAAGGCGCACCCCCC
CAGCAATCCGCGCGCCGGGACAGAATGCCCTGCAGGAACTTCTTCTGGAA
GACCTTCTCCTCCTGCAAATAAAACCTCACCCATGAATGCTCACGCAAGT
TTAATTACAGACCTGAACTCCTGACTTTCCTCGCTTGGTGGTTTGAGTGG
ACCTCCCAGGCCAGTGCCGGGCCCCTCATAGGAGAAGGAAGCTCGGGAGG
TGGCCAGGCGGCAGGAAGGCGCACCCCCCCAGCAATCCGCGCGCCGGGAC
AGAATGCCTGCAGGAACTTCTTCTGGAAGACCTTCTCCTCCTGCAAATAA
AACCTCACCCATGGGAATGCTCACGCATTTAATTACAGACCTGAAAGGAG
AGGAAGCTACAGTCATGTGCFCGGGAGGTGGGCATCTGACA
ACAAGATGCCATTGTCCCCCGGCCTCCTGCTGCTGCTGCTCTCCGGGGCC
ACGGCCACCGCTGCCCTGCCCCTGGAGGGTGGCCCCACCGGCCGAGACAG
CGAGCATATGCAGGAAGCGGCAGGAATAAGGAAAAGCAGCTCCTGACTTT
CCTCGCTTGGTGGTTTGAGTGGACCTCCCAGGCCAGTGCCGGGCCCCTCA
TAGGAGAGGAAGCTCGGGAGGTGGCCAGGCGGCAGGAAGGCGCACCCCCC
CAGCAATCCGCGCGCCGGGACAGAATGCCCTGCAGGAACTTCTTCTGGAA
GACCTCCTCCTCCTGCAAATAAAACCTCACCCATGAATGCTCACGCAAGT
TTAATTACAGACCTGAACTCCTGACTTTCCTCGCTTGGTGGTTTGAGTGG
ACCTCCCAGGCCAGTGCCGGGCCCCTCATAGGAGAGGAAGCTCGGGAGGT
GGCATGTGACCTCCGAGCAGTCACCADCCAGGCGGCAGGAAGGCGCACCC
CCCCAGCAATCCGCGCGCCGGGACAGAATGCCTGCAGGAACTTCTTCTGG
AAGACCTTCTCCTCCTGCAAATAAAACCTCACCCATGGGAATGCTCACGC
ATTTAATTACAGACCTGAAAGGAGAGGAAGCTCGGGAGGTGGGCATCTGA
C
ACAAGATGCCATTGTCCCCCGGCCTCCTGCTGCTGCTGCTCTCCGGGGCC
ACGGCCACCGCTGCCCTGCCCCTGGAGGGTGGCCCCACCGGCCGAGACAG
CGAGCATATGCAGGAAGCGGCAGGAATAAGGAAAAGCAGCTCCTGACTTT
CCTCGCTTGGTGGTTTGAGTGGACCTCCCAGGCCAGTGCCGGGCCCCTCA
TAGGAGAGGAAGCTCGGGAGGTGGCCAGGCGGCAGGAAGGCGCACCCCCC
CAGCAATCCGCGCGCCGGGACAGAATGCCCTGCAGGAACTTCTTCTGGAA
GACCTCCTCCTCCTGCAAATAAAACCTCACCCATGAATGCTCACGCAAGT
TTAATTACAGACCTGAACTCCTGACTTTCCTCGCTTGGTGGTTTGAGTGG
ACCTCCCAGGCCAGTGCCGGGCCCCTCATAGGAGAGGAAGCTCGGGAGGT
GGCATGTGACCTCCGAGCAGTCACCADCCAGGCGGCAGGAAGGCGCACCC
CCCCAGCAATCCGCGCGCCGGGACAGAATGCCTGCAGGAACTTCTTCTGG
AAGACCTTCTCCTCCTGCAAATAAAACCTCACCCATGGGAATGCTCACGC
ATTTAATTACAGACCTGAAAGGAGAGGAAGCTCGGGAGGTGGGCATCTGA
C
ACAAGATGCCATTGTCCCCCGGCCTCCTGCTGCTGCTGCTCTCCGGGGCC
ACGGCCACCGCTGCCCTGCCCCTGGAGGGTGGCCCCACCGGCCGAGACAG
CGAGCATATGCAGGAAGCGGCAGGAATAAGGAAAAGCAGCTCCTGACTTT
CCTCGCTTGGTGGTTTGAGTGGACCTCCCAGGCCAGTGCCGGGCCCCTCA
TAGGAGAGGAAGCTCGGGAGGTGGCCAGGCGGCAGGAAGGCGCACCCCCC
CAGCAATCCGCGCGCCGGGACAGAATGCCCTGCAGGAACTTCTTCTGGAA
GACCTTCTCCTCCTGCAAATAAAACCTCACCCATGAATGCTCACGCAAGT
TTAATTACAGACCTGAACTCCTGACTTTCCTCGCTTGGTGGTTTGAGTGG
ACCTCCCAGGCCAGTGCCGGGCCCCTCATAGGAGAAGGAAGCTCGGGAGG
TGGCCAGGCGGCAGGAAGGCGCACCCCCCCAGCAATCCGCGCGCCGGGAC
AGAATGCCTGCAGGAACTTCTTCTGGAAGACCTTCTCCTCCTGCAAATAA
AACCTCACCCATGGGAATGCTCACGCATTTAATTACAGACCTGAAAGGAG
AGGAAGCTACAGTCATGTGCFCGGGAGGTGGGCATCTGACA
7
Evolution
8
Conservation Implies Function
Gene
Exon
CNS Other Conserved
Dubchak, Brudno et al 2000
9
Whole Genome Similarity Map
10
Map of Rearrangements
11
Alignment of a Syntenic Area
12
84790 84800 84810 84820
84830 84840 seq1 -AGGACTTCTTCCTTTCACCATATA
CAAAAATCAACCCAAGATTGATACAATACTTTAAT

seq2 CAGGATCGACGTCTCTCACCTTATACAAAAA
TCAACTGAAGATGCATCACAGACTTTAA- 13770
13780 13790 13800 13810 13820
84850 84860 84870 84880
84890 84900 seq1 GTAAAATAAAAAACTGTAAAACTC
TAGAAGAAAATGTAGAAACA-----CCATCTGGACA

seq2 ---AAACTGAAACCATAACAATTCTAGAAGTA
AACATTGAAAAAAAAACCCTTCTAGACA 13830
13840 13850 13860 13870 13880
84910 84920 84930 84940
84950 84960 seq1 TCAGCCTGGGCAGAGAATTTATGA
CTAAGTCCTCAAAAGTAATTTCAACAAAAATAAACA
seq2
TTTGCTTAGGCAAAGACTTCATGACAAAGAATGCAAAAGCAG--------
---------- 13890 13900 13910
13920
Local Area of Similarity
13
seq1 AAATAAAAAACTGTAAAACTCTAGAAGAAAATGTAGAAACA
-----CCATCTGGACA
seq2
AAACTGAAACCATAACAATTCTAGAAGTAAACATTGAAAAAAAAACCCTT
CTAGACA
How Similar are these Sequences?
14
Edit Distance Model
  • Minimal weighted sum of insertions, deletions
    mutations required to transform one string into
    another
  • AGGCACA--CA AGGCACACA
  • or
  • A--CACATTCA ACACATTCA

Levenshtein 1966
15
Edit Distance Model
  • Lets figure out how to compute the sequence of
    events that gives the highest score
  • http//meetings.cshl.edu/tgac/tgac/flash/DynamicPr
    ogramming.swf

16
84790 84800 84810 84820 84830
84840 seq1 -AGGACTTCTTCCTTTCACCATATACAAAAATCA
ACCCAAGATTGATACAATACTTTAAT

seq2 CAGGATCGACGTCTCTCACCTTATACAAAAATC
AACTGAAGATGCATCACAGACTTTAA- 13770 13780
13790 13800 13810 13820
84850 84860 84870 84880
84890 84900 seq1 GTAAAATAAAAAACTGTAAAACTCT
AGAAGAAAATGTAGAAACA-----CCATCTGGACA

seq2 ---AAACTGAAACCATAACAATTCTAGAAGTA
AACATTGAAAAAAAAACCCTTCTAGACA 13830
13840 13850 13860 13870 13880
84910 84920 84930 84940
84950 84960 seq1 TCAGCCTGGGCAGAGAATTTATGA
CTAAGTCCTCAAAAGTAATTTCAACAAAAATAAACA
seq2
TTTGCTTAGGCAAAGACTTCATGACAAAGAATGCAAAAGCAG--------
---------- 13890 13900 13910
13920
Local Area of Similarity (Var-mers)
17
Local Alignment
F(i,j) max (F(i,j), 0) Return all paths with a
position i,j where F(i,j) gt C Time O( n2 ) for
two seqs, ?( nk ) for k seqs
Smith Waterman 1982
18
Heuristic Local Alignment
BLAST
FASTA
Altschul et al 1990
Pearson 1987
19
Alignment of a Syntenic Area (LAGAN)
20
Global Alignment
AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA
AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC
21
LAGAN 1. FIND Local Alignments
  • Find Local Alignments
  • Chain Local Alignments
  • Restricted DP

Brudno, Do et al 2003
22
LAGAN 2. CHAIN Local Alignments
  • Find Local Alignments
  • Chain Local Alignments
  • Restricted DP

Brudno, Do et al 2003
23
LAGAN 3. Restricted DP
  • Find Local Alignments
  • Chain Local Alignments
  • Restricted DP

Brudno, Do et al 2003
24
MLAGAN 1. Progressive Alignment
Human
Baboon
Mouse
Rat
  • Given N sequences, phylogenetic tree
  • Align pairwise, in order of the tree (LAGAN)

25
MLAGAN 2. Multi-anchoring
To anchor the (X/Y), and (Z) alignments
X
Z
Y
Z
X/Y
Z
26
Cystic Fibrosis (CFTR), 12 species
Chicken
Zebrafish
Cow
Pig
Chimp
Human
Dog
Rat
Fugufish
Cat
Baboon
Mouse
  • Human sequence length 1.8 Mb
  • Total genomic sequence 13 Mb

27
CFTR (contd )
MAX MEMORY (Mb)
TIME (sec)
Exons Aligned
90
550
99.7
Mammals
LAGAN
90
862
84
Chicken Fishes
Mammals
99.8
MLAGAN
Chicken Fishes
670
4547
91
28
Evolution Over a Chromosome
Cooper, Brudno et al 2004
29
How Many Genomes?
Cooper, Brudno et al 2003
30
Fish-Human Comparison (Woolfe et al 2004)
Highly conserved vertebrate non-coding elements
direct tissue-specific reporter gene expression
31
Phylo-VISTA Visualization
Shah,, Brudno, et al 2004
32
Summary of LAGAN Alignment
  • It is possible to build megabase long multiple
    alignments for dozens of sequences
  • The alignments are accurate at aligning major
    biological functional areas
  • These alignments can be used to better
    understand evolution and regulation

33
Map of Rearrangements (Shuffle-LAGAN)
34
Evolution at the DNA level
Deletion
Mutation
ACGGTGCAGTTACCA
SEQUENCE EDITS
AC----CAGTCACCA
REARRANGEMENTS
Inversion
Translocation
Duplication
35
Local Global Alignment
Global
Local
36
Glocal Alignment Problem
  • Find least cost transformation of one sequence
    into another using new operations
  • Sequence edits
  • Inversions
  • Translocations
  • Duplications
  • Combinations of above

37
Shuffle-LAGAN

A glocal aligner for long DNA sequences
Brudno, Malde et al 2003
38
S-LAGAN Find Local Alignments
  • Find Local Alignments
  • Build Rough Homology Map
  • Globally Align Consistent Parts

39
S-LAGAN Build Homology Map
  • Find Local Alignments
  • Build Rough Homology Map
  • Globally Align Consistent Parts

40
Building the Homology Map
Chain (using Eppstein Galil) each alignment gets
a score which is MAX over 4 possible
chains. Penalties are affine (event and
distance components)
  • Penalties
  • regular
  • translocation

c) inversion d) inverted translocation
41
S-LAGAN Build Homology Map
  • Find Local Alignments
  • Build Rough Homology Map
  • Globally Align Consistent Parts

42
S-LAGAN Global Alignment
  • Find Local Alignments
  • Build Rough Homology Map
  • Globally Align Consistent Parts

43
S-LAGAN Results (CFTR)
Local
Glocal
44
S-LAGAN Results (CFTR)
Hum/Mus
Hum/Rat
45
S-LAGAN results (IGF cluster)
46
S-LAGAN results (HOX)
  • 12 paralogous genes
  • Conserved order in mammals

47
S-LAGAN results (HOX)
  • 12 paralogous genes
  • Conserved order in mammals

48
Rearrangements in Human v. Mouse
  • Some conclusions
  • Rearrangements come in all sizes
  • Duplications worse conserved than other
    rearranged regions
  • Half of exons not alignable by LAGAN aligned by
    S-LAGAN

49
Whole Genome Similarity Map
50
Handling Chromosomes Symmetry
  • Problems
  • S-LAGAN is meant to run on two sequences
  • S-LAGAN is not symmetric (it has a base genome)
  • Solutions
  • Switch penalty
  • Super-monotonic maps

Sundararajan, Brudno et al 2004 Brudno, Kislyuk
et al unpublished
51
Handling Chromosomes Switch Penalty
Chr 3
Chr 2
Chr 1
Chr 4
Switch Penalty
Base chromosome
52
Supermap Algorithm
Duplication Inversion Translocation
  • Build 1-monotonic maps with both base genomes
  • (cyan pink)

53
Supermap Algorithm
Duplication Inversion Translocation
  • Build 1-monotonic maps with both base genomes
  • (cyan pink)

54
Supermap Algorithm
Duplication Inversion Translocation
  • Build 1-monotonic maps with both base genomes
  • (cyan pink)
  • Whenever the maps agree, join them (blue)

55
Supermap Algorithm
Duplication Inversion Translocation
  • Build 1-monotonic maps with both base genomes
  • (cyan pink)
  • Whenever the maps agree, join them (blue)
  • Syntenic areas are those with a degree of 1

56
Human Mouse Rearrangement Map
57
Human Genome Alignment Results
  • Compared with the previous tandem local/global
    approach
  • 2-fold speedup
  • Sensitivity of exon alignment unchanged in
    human/mouse, improved in human/chicken
  • 9-fold reduction in the number of mapped syntenic
    segments in human/mouse, and a 2-fold reduction
    in human/chicken.

58
VISTA Genome Browser
http//pipeline.lbl.gov
Brudno, Poliakov et al 2004
59
Recap
60
84790 84800 84810 84820 84830
84840 seq1 -AGGACTTCTTCCTTTCACCATATACAAAAATCA
ACCCAAGATTGATACAATACTTTAAT

seq2 CAGGATCGACGTCTCTCACCTTATACAAAAATC
AACTGAAGATGCATCACAGACTTTAA- 13770 13780
13790 13800 13810 13820
84850 84860 84870 84880
84890 84900 seq1 GTAAAATAAAAAACTGTAAAACTCT
AGAAGAAAATGTAGAAACA-----CCATCTGGACA

seq2 ---AAACTGAAACCATAACAATTCTAGAAGTA
AACATTGAAAAAAAAACCCTTCTAGACA 13830
13840 13850 13860 13870 13880
84910 84920 84930 84940
84950 84960 seq1 TCAGCCTGGGCAGAGAATTTATGA
CTAAGTCCTCAAAAGTAATTTCAACAAAAATAAACA
seq2
TTTGCTTAGGCAAAGACTTCATGACAAAGAATGCAAAAGCAG--------
---------- 13890 13900 13910
13920
Local alignment (var-mers)
61
Global Alignment (LAGAN)
62
Rearrangements (S-LAGAN)
63
Whole Genome Similarity Map
64
Is Sequence Alignment Solved?
65
Progressive Alignment
  • We want to get an alignment that equally
    reflects all species.
  • In the phylogenetic tree
  • - Leaves are real genomes
  • Internal nodes are Ancestors
  • At every step we do an alignment corresponding
    to some internal node, in fact building some
    estimation of the Ancestors genome.

Human
Baboon
Mouse
Rat
66
Which Organism Should be the Base?
Duplication Inversion Translocation
  • After Supermap we have 1 or 2 alternatives at
    the end of each syntenic area
  • Whenever there are 2 alternatives at most one of
    the two corresponds to the ancestor

67
What is similar to the Ancestor?
68
What is similar to the Ancestor?
  • Ask the Outgroup on the phylogenetic tree!

69
What is similar to the Ancestor?
  • Ask the Outgroup on the phylogenetic tree!

C1
U
C2
70
What is similar to the Ancestor?
  • Ask the Outgroup on the phylogenetic tree!
  • S (UMax(C1,C2))/ Min(C1,C2)

C1
U
-log(1-S)
C2
71
What is similar to the Ancestor?
  • Ask the Outgroup on the phylogenetic tree!
  • S (UMax(C1,C2))/ Min(C1,C2)

-log(1-S)
C1
C2
U
72
What is similar to the Ancestor?
  • Ask the Outgroup on the phylogenetic tree!
  • S (UMax(C1,C2))/ Min(C1,C2)
  • Find set of connections s.t. max 1 way in out
    of each syntenic area

-log(1-S1)
-log(1-S2)
73
Transforming the plot into a graph
Duplication Inversion Translocation
74
Reduction to Matching
  • Solve maximum weghted matching for connected
    components (colored edges)

75
Reduction to Matching
  • Solve maximum weghted matching for connected
    components (colored edges)
  • Any edge present in solution joins syntenic
    regions giving us ancestral ordering

76
Back to the sequence
77
Back to the sequence
78
How well does it work?
  • It takes 20 minutes to churn through mouse/rat
    (using human as the outgroup) most of the time
    is spent calculating scores

79
How well does it work?
  • It takes 20 minutes to churn through mouse/rat
    (using human as the outgroup) most of the time
    is spent calculating scores
  • The resulting alignment (to human) is visually
    better

80
How well does it work?
Old New
81
How well does it work?
  • LP?Solve takes 20 minutes to churn through
    mouse/rat (using human as the outgroup) most of
    the time is spent calculating scores
  • The resulting alignment (to human) is visually
    better
  • The new alignments are significantly shorter, but
    have higher coverage

82
Ancestral Alignment (HMR)
  • Used Berkeley Genome Pipeline
  • Human genome aligned to mouse rat
  • Conservation criteria from Waterston, et al

83
Future Work
  • Verify accuracy of ancestral reconstruction
  • Do we actually get the ancestral sequence, or
    something that is easy to align?
  • Build a multi-alignment of all mammals and flies
  • Molecular Evolution
  • We have (for the first time) a dating for the
    various rearrangement events (they are mapped to
    a particular branch of the tree). Does the event
    contribute to evolutionary rate?

84
Overall Conclusions
  • Sequence comparison shows evolution
  • Evolution key to understanding the Human Genome
  • Computational Biology is data-driven
  • - follow the data

85
Acknowledgments
  • Stanford
  • Serafim Batzoglou
  • Arend Sidow
  • Gregory Cooper
  • Chuong (Tom) Do
  • Kerrin Small
  • Mukund Sundararajan

Berkeley Gene Myers Inna Dubchak Alexander
Poliakov Göttingen Burkhard Morgenstern

http//lagan.stanford.edu http//pipeline.lbl.gov
Write a Comment
User Comments (0)
About PowerShow.com