Title: Using PQ Trees For Comparative Genomics
1Using PQ Trees For Comparative Genomics
- Gad M. Landau
- Laxmi Parida
- Oren Weimann
2Gene Clusters
- Genes that appear together consistently across
genomes are believed to be functionally related,
however the ordering doesnt have to be the same.
3What is a ?Pattern?
- Given a string Ss1s2s3.sn and an integer K, a
pattern Pp1,p2,p3,,pm is a ?pattern if P
occurs (possibly permuted) in at least K places
in S. - Example
- S a b c d b a c d a b a c b P
a,b,c K4 - P is a 4-?Pattern with location-list
1,5,10,11 - For the moment we will assume that every
character appears once in the pattern. -
4Maximal ?Patterns
- A ?pattern p is non-maximal with respect to
?pattern p if Example - Maximal notation - a representation of a maximal
?pattern p that illustrates all the non-maximal
?patterns with respect to p. - Our goal Given a string S find all ?patterns p
and their maximal notation. - Our solution a linear time algorithm based on
PQ trees.
S a b c d e b a d c e
S a b c d e b a d c e a,b is
non-maximal with respect to a,b,c,d,e
The maximal notation of a,b,c,d,e is
((a,b)-(c,d)-e)
5PQ trees Booth, Lueker Definitions
- PQ trees Booth, Lueker, 1976
- Character labeled leaves.
- P-nodes
- Represent truly permuted components
- Arbitrary permutations of children
- Q-nodes
- Represent bi-connected components
- Only reversion
A
C
6PQ trees Definitions
- Equivalent PQ trees (denoted ).
-
7PQ trees Definitions
- FRONTIER
- C(T) the set of frontiers of all trees
equivalent to T -
-
-
FRONTIER(T)A B C D E F G H I J K
FRONTIER(T)A B C G H I J K E F D"
Theorem If C(T1)C(T2) then T1 T2.
8Our Use of the PQ tree
-
- Suppose the ?Pattern a,b,c,d appears in 4
locations as - ? abcd , acbd , dbca , dcba .
- Our goal
-
- C(T) abcd ,
acbd , dbca , dcba . - Write the P-nodes as , and the Q-nodes as -
and get (a-(b,c)-d) which is exactly the maximal
notation of the ?Pattern a,b,c,d
a
d
b
c
9The minimal Consensus PQ tree
- It is not always possible to find a tree T where
?C(T) - Consider a ?Pattern a,b,c,d that appears as ?
abcd , bdac . - abcd ,
bdac C(T) - Given permutations ??1, ?2,,?k, the consensus
PQ tree T of ? is such that ? C(T), and the
consensus is minimal when there exists no other
T such that ? C(T) and C(T) C(T). - The problem of obtaining a maximal notation for a
?Pattern is the same as obtaining a minimal
consensus PQ tree of all the k occurrences. - Theorem The minimal consensus PQ tree T is
unique.
10The original use of the PQ Tree
- The consecutive 1s problem
-
- The
restriction sets - F a,b,c , b,c , b,c,d , b
- The solution Booth, Lueker, 1976
- Reduce(F )
- The result will be C(T), in our case C(T)abcd ,
acbd , dbca , dcba - and the tree was constructed in O( ) time (for
an n x n matrix) - (Reduce(F) by Booth, Lueker, 1976)
a
d
b
c
11Obtaining the Minimal Consensus PQ tree
- Some definitions Heber, Stoye, 2001
- Common interval an interval that appears as a
consecutive sequence in all the appearances.
4,8 in the example. - We denote all Common intervals
- 1,2,2,3,1,3,1,9,1,8,4,5,4,6,4,7
,4,8,5,6 - A list p of common intervals is a chain if every
two successive intervals in p have a non-trivial
overlap. For example P(1,2,2,3) - A common interval is called reducible if there
is a chain that generates it, otherwise it is
called irreducible. 1,3 is a reducible interval
since it can be generated by the irreducible
intervals 1,2 ,2.3 - We denote all irreducible intervals of ?
- 1,2,1,8,2,3,4,5,4,8,4,8,5,6
-
-
12 Obtaining the Minimal Consensus PQ tree
- Theorem Reduce( ) Reduce( ) minimal
consensus tree. - The Algorithm
- Compute .
- 1,2,1,8,2,3,4,5,4,8,4,8,5,6
- Compute Reduce( ) to get the minimal
consensus tree of ?. - The ?Pattern notation is ((1-2-3)-(((4-5-6),7),8
)-9) - Time Complexity For a a ?pattern of size n that
appears in k places it takes a total
of O(kn ) to compute maximal notation.
13Improving the Time Complexity to O(kn)
- In Heber Stoyes algorithm for obtaining ,
a data structure S was maintained to hold the
chains of the irreducible intervals -
1,2,1,8,2,3, -
4,5,4,8,4,8,5,6 - REPLACE(S)
- Replace every chain by a Q node.
- Replace every element that is not a leaf or a Q
node and is pointed by a vertical link with a P
node.
14Maximal ?Patterns and Sub-Trees
- A sub-tree of the PQ tree T is obtained by
picking a P-node in T with all its descendants,
or by picking a Q-node in T with any number of
consecutive descendants. - Suppose the ?Pattern a,b,c,d appears in 4
locations as - ? abcd , acbd , dbca , dcba .
-
- Theorem 4 If p1 and p2 are ?patterns, and p1 is
non-maximal with respect to p2, then the PQ Tree
T1 that represents p1 is a sub-tree of the PQ
tree T2 that represents p2.
a
d
b
c
15So what did we achieve?
- A first algorithm (and optimal in time) that
generates the maximal notation of a pattern.
Allowing - A visualization of the inner structure of a
pattern. - Filtering of meaningful from apparently
meaningless (non-maximal) clusters. - Experimental results that prove this tool can aid
in predicting gene functions. - Clustering for the various genome models.
16Using Our Tool for Various Genome Models
- Genome model I (orthologs only)
- A sequence is a permutation of the set
1,2,n. Only one maximal ?pattern 1,2.,n. In
O(kn) time we get a PQ tree that describes all
patterns of all sizes and their non-maximal
relations.
17Using Our Tool for Various Genome Models
- Genome model II A gene may appear once in a
sequence or not appear at all in that sequence.
- We can extend the algorithm to work on
sequences that are not permutations of the same
set in - Example consider the 2 sequences
- 1 2 3 4 5 6 7 and 1 8 2 4 3 7 6
- 8 1 2 3 4 5 5 6 7 8 and 5 1 8 8 2 4 3 7 6
5 -
add characters as needed
Build PQ Tree on the new sequences
The sub-trees that have no red leaves Are all
the maximal patterns
8
5
6
5
7
8
1
2
3
4
18Using Our Tool for Various Genome Models
- Genome model III (paralogs and orthologs)
- A gene may appear any number of times in a
sequence (including zero). - The minimal consensus PQ tree is not necessarily
unique. Solution - Example consider 2 appearances of the ?pattern
a,a,b as - ? aab , baa
- 1. ? a1a2b , ba2a1
C(T) a1a2b , ba2a1 -
- 2. ? a1a2b , ba1a2
C(T) a1a2b , ba2a1 , a2a1b ,
ba1a2
19Our Current work
- We are extending the notion of permutation
patterns to permutation patterns of trees
(connected components with the same vertex label
set) and graphs. We are developing algorithms to
represent the maximal notation of a ?pattern in
trees and graphs.
20Using PQ Trees For Comparative Genomics Cast Tr
ees Lines Arrows Intervals Strings
Patterns Frontiers Based on a true story
by Gad M. Landau Laxmi Parida Oren Weimann No
trees were harmed during the making of this
presentation