Title: Locating conserved genes in whole genome scale
1Locating conserved genesin whole genome scale
- Prudence Wong
- University of Liverpool
- June 2005
- joint work with
- HL Chan, TW Lam, HF Ting,
- SM Yiu (HKU), WK Sung (NUS)
2Outline
- Motivation
- Challenges of Whole Genome Alignment
- Four approaches and their performance
- Longest Common Subsequence
- Clustering Approach
- Mutation Sensitive Selection
- Hybrid Approach
- Remarks
3Outline
- Motivation
- Challenges of Whole Genome Alignment
- Four approaches and their performance
- Longest Common Subsequence
- Clustering Approach
- Mutation Sensitive Selection
- Hybrid Approach
- Remarks
4Mouse Human
Mouse and human are genetically very similar
Do they look like the same?
What do we mean by similar?
Many genes that can be found in human are also
found in mouse as well conserved genes
Mouse Chromosome 16
Human Chromosome 16
m16
h03
5Whole Genome Alignment
Identify regions on the genomes that possibly
contain their conserved genes.
possibly a mutation
Difference in ordering of conserved could be
related to mutations. For related species, num.
of mutations is usually small.
6Outline
- Motivation
- Challenges of Whole Genome Alignment
- Four approaches and their performance
- Longest Common Subsequence
- Clustering Approach
- Mutation Sensitive Selection
- Hybrid Approach
- Remarks
7Data size
- Usually very large (e.g., human chromosomes vs
mouse chromosomes)
Examples Examples Examples Examples Examples
Human Chr No. Length Mouse Chr No. Length
1 245M 1 134M
3 200M 2 181M
11 135M 7 134M
15 100M 8 129M
20 64M 16 99M
Cannot use global alignment tools because of the
large size
8Observations
- a conserved gene may not be identical in the two
genomes, nevertheless, there are some common
substrings unique to this conserved gene (called
MUM) - locate all MUMs over the two genomes, yet not
every MUM corresponds to conserved genes
9Number of MUMs
Mouse Chr No. Human Chr No. of MUMs
7 19 52,394
15 22 71,613
16 16 66,536
16 22 61,200
17 16 29,001
17 19 56,236
19 11 29,814
Size is smaller comparing with chromosome length
10MUMs for M16-H03
Conserved genes
Mouse Chromosome 16
Human Chromosome 03
11How to choose the right MUMs?
Generation of MUM using suffix tree
12Outline
- Motivation
- Challenges of Whole Genome Alignment
- Four approaches and their performance
- Longest Common Subsequence
- Clustering Approach
- Mutation Sensitive Selection
- Hybrid Approach
- Remarks
13MUM Selection
- MUMmer-1 Delcher et al. Nucleic Acids Research
1999 - longest common subsequences (effectively assume
no mutations) - MUMmer-2 Delcher et al. Nucleic Acids Research
2002 MUMmer-3 Kurtz et al. Genome Biology
2004 - clustering heuristics
- most popular tool to uncover conserved genes in
WG scale - MaxMinCluster Wong et al. Bioinformatics 2004
- clustering, optimization
- MSS Mutation Sensitive Selection Chan et al.
Bioinformatics 2005 - capture mutations
- Hybrid approach Chan et al. Bioinformatics
2005 - combine mutation sensitive and clustering
approaches
our results
14Overview of Results
- Average coverage (sensitivity) in
Mouse/ Human Intragenus Baculovirade Intergenus Baculovirade
MUMmer-3 77 (27) 66 (71) 43 (62)
MaxMinCluster 84 (29) 69 (75) 45 (59)
MSS 91 (29) 79 (75) 36 (53)
MUMmer-3MSS 91 (28) 79 (75) 48 (43)
MaxMinClustesrMSS 91 (27) 79 (82) 51 (53)
- coverage of published conserved genes reported
- sensitivity of MUMs reported that reside in
published conserved genes
15Overview of Results
- Average coverage (sensitivity) in
Mouse/ Human Intragenus Baculovirade Intergenus Baculovirade
MUMmer-3 77 (27) 66 (71) 43 (62)
MaxMinCluster 84 (29) 69 (75) 45 (59)
MSS 91 (29) 79 (75) 36 (53)
MUMmer-3MSS 91 (28) 79 (75) 48 (43)
MaxMinClustesrMSS 91 (27) 79 (82) 51 (53)
MSS outperforms MaxMinCluster and MUMmer-3 on
closely related species
- coverage of published conserved genes reported
- sensitivity of MUMs reported that reside in
published conserved genes
16Overview of Results
- Average coverage (sensitivity) in
Mouse/ Human Intragenus Baculovirade Intergenus Baculovirade
MUMmer-3 77 (27) 66 (71) 43 (62)
MaxMinCluster 84 (29) 69 (75) 45 (59)
MSS 91 (29) 79 (75) 36 (53)
MUMmer-3MSS 91 (28) 79 (75) 48 (43)
MaxMinClustesrMSS 91 (27) 79 (82) 51 (53)
BUT MSS performs worse on species relatively
farther apart
- coverage of published conserved genes reported
- sensitivity of MUMs reported that reside in
published conserved genes
17Overview of Results
- Average coverage (sensitivity) in
Mouse/ Human Intragenus Baculovirade Intergenus Baculovirade
MUMmer-3 77 (27) 66 (71) 43 (62)
MaxMinCluster 84 (29) 69 (75) 45 (59)
MSS 91 (29) 79 (75) 36 (53)
MUMmer-3MSS 91 (28) 79 (75) 48 (43)
MaxMinClustesrMSS 91 (27) 79 (82) 51 (53)
- coverage of published conserved genes reported
- sensitivity of MUMs reported that reside in
published conserved genes
both hybrid approaches perform well for species
farther apart
18Outline
- Motivation
- Challenges of Whole Genome Alignment
- Four approaches and their performance
- Longest Common Subsequence
- Clustering Approach
- Mutation Sensitive Selection
- Hybrid Approach
- Remarks
19Longest Common Subsequence
LCS
20Outline
- Motivation
- Challenges of Whole Genome Alignment
- Four approaches and their performance
- Longest Common Subsequence
- Clustering Approach
- Mutation Sensitive Selection
- Hybrid Approach
- Remarks
LCS Approach (MUMmer-1) does not take mutations
into account
- MUMmer-2 -3 cluster by heuristic
- MaxMinCluster formalizes clustering as a
combinatorial optimization problem
21Clustering approach
- Observations
- Noise MUMs are usually short and isolated
- A conserved gene usually contains a sequence of
MUMs that are close and have sufficient length gt
clusters
Gene X
Gene Y
Gene Y
Gene X
Noise
22Challenge
- Challenge some conserved genes do not induce
clusters of sufficient length - Solution relax the definition of clusters to
allow the presence of noise
23Noisy cluster
- Suppose Gap100, MinSize40
gt 100 apart
length 20
a 1-noisy cluster
24Noisy cluster
- Suppose Gap100, MinSize40
gt 100 apart
length 20
a 2-noisy cluster
25MaxMinClustesr
- Problem formulation
- find a collection of k-noisy clusters such that
the smallest cluster has the maximum weight - Dynamic programmingO(k2n2) time, O(k2n) space
26Outline
- Motivation
- Challenges of Whole Genome Alignment
- Four approaches and their performance
- Longest Common Subsequence
- Clustering Approach
- Mutation Sensitive Selection
- Hybrid Approach
- Remarks
Capture mutations more directly
27Mutation Sensitive Selection
transformed by a few mutations
subset of MUMs
- three types of mutationsreversal,
transposition, reversed-transposition
28k-mutated subsequences
- Given two sequences A B and an integer k,
- a pair of subsequence X of A subsequence Y of B
is called a pair of k-mutated subsequences ifX
can be transformed to Y by at most k mutations
a pair of 2-mutated subsequences
reversal
transposition
MUMs are signed reversal reverts sign of MUMs
29Mutation Sensitive Selection
- Problem formulation
- To find a pair of k-mutated subsequences with
maximum weight - We believe that the problem is NP-hard
- The Genome Rearrangement Problem, believed to be
NP-hard, can be reduced to this problem - We give an efficient approximation algorithm
- the resulting weight is close to (at least
1/(3k1) times) the maximum possible weight
O(n2log n kn2) time, O(n2) space
30Outline
- Motivation
- Challenges of Whole Genome Alignment
- Four approaches and their performance
- Longest Common Subsequence
- Clustering Approach
- Mutation Sensitive Selection
- Hybrid Approach
- Remarks
31Hybrid Approach
- first apply clustering approach to identify
clusters which are obviously conserved genes - can apply either MUMmer-3 or MaxMinCluster
- these clusters are treated as MUM with bigger
weight - then apply MSS to process these MUM together with
the remaining MUM
32Outline
- Motivation
- Challenges of Whole Genome Alignment
- Four approaches and their performance
- Longest Common Subsequence
- Clustering Approach
- Mutation Sensitive Selection
- Hybrid Approach
- Remarks
33Remarks
- Experiments show that
- MaxMinCluster gt LCS
- MMS gt MaxMinCluster for closely related species
- MMS does not perform well for species relatively
farther apart - Hybrid approach is the best for both closely
related and farther apart species
34Thank you!
35Approximation Algorithm
- Super-Backbone
- maximum weight common subsequences
- Identify k mutation blocks
- having high weight
- do not overlap with Super-Backbone too much
- this is formulated as a sub-problem and solved
optimally by dynamic programming - Report Super-Backbone k mutation blocks
O(n2log n kn2) time, O(n2) space
36Mutations
- three types of mutationsreversal,
transposition, reversed-transposition
a b c d e f g h i j k l m n o p q r s t u v w x y
z
a d c b e f g h i j k l m n o p q r s t u v w x y
z
a d c b e k l m n o p q r s t u v w x y f g h i j
z
a d c b e k l t s r q p o m n u v w x y f g h i j
z