Title: Activities in Combinatorial Pattern Matching
1Activities in Combinatorial Pattern Matching
- FDK SAB Meeting
- March 22 2004
2Algorithms on strings
- members J Kärkkäinen, V Mäkinen, K Fredriksson
(-12/2002), K Lemström, H Tamm, S Inenaga
(9/2003-), S Burkhardt (2/2004-) - direct linear time construction of suffix arrays
(J Kärkkäinen ICALP 03) elegant algorithm based
on a novel approach textbook material - gapped q-gram filters for approximate string
matching (Burkhardt/Kärkkäinen) generalized
q-grams give better filters
3Algorithms on strings (cont.)
- string database query systems
- efficient implementations of finite multitape
automata - minimization of finite-state automata
- THM A bideterministic finite-state automaton is
minimal - PhD project of H Tamm
4Transposition invariance
Transposition by -2
5Algorithms on strings (cont.)
- transposition invariant string matching
- motivated by music information retrieval
- A a1...am translated by t At a1t,
... ,amt - translation invariant distance dT(A,B) min
t?S d(At,B)
6Algorithms on strings (cont.)
- exact case is simple
- interval ai1 ai
- intervals(At) intervals(A)
- use interval sequences a2-a1, a3-a2, ,
am-am-1 instead of originals in exact
matching - transposition invariance for free
7Algorithms on strings (cont.)
- approximate case edit distance
- repeat dynamic programming for all O(mn) relevant
transpositions O(m2n2) - apply sparse dynamic programming O(mn log m)
- V Mäkinens PhD Thesis (2003)
8Matches cover all pairs in DP table
bj-ai
9 10C C D C F E C C D C G F C C C1 A F E D B B A F G F
piano-roll representation
time
11Algorithms on strings Piano-roll matching
- geometric pattern matching under translations
(Brass 2002)
12(No Transcript)
13Algorithms on strings Patterns with small tree
dimension
kT 2
-Type of a positive edge (a,b) b-a -Tree
dimension kT of P smallest number of edge types
in a positive spanning tree of P
14Algorithms on strings Geometric generalization
of the Knuth-Morris-Pratt algorithm
- THM Translated occurrences of P with
tree-dimension kT can be found in O(2mmkTn)
bitvector operations on m bits long vectors. - THM Finding kT is NP-complete can be
approximated within logarithmic factor by the
greedy set cover algorithm
15Algorithms on strings plans
- software library of string algorithms (J
Kärkkäinen et al) - continue with basic research index structures,
- inversion problems on sequences given a set of
sequences find a model for them - combine combinatorial and probabilistic approach
16Music retrieval and analysis
- members K Lemström, V Mäkinen, A Pienimäki
- efficient computational methods for music
comparison, retrieval and analysis monophonic vs
polyphonic music - content-based retrieval query-by-humming
- geometric (piano-roll) sweepline algorithms
- query engine prototype
- open indexing of polyphonic music
- open distance measures for music, computational
characterization of musical styles
17http//www.cs.helsinki.fi/group/cbrahms/demoengin
e/
18Biological sequence analysis
- members T Kivioja, K Palin, P Rastas, J Vilo
(EBI) - bioinformatics component of a novel DNA
expression measurement technique TRAC developed
at VTTBiotechnology selection and pooling of
hybridization probes for entire genomes - computational methods for optimizing cDNA-AFLP
experiments - T Kiviojas PhD Thesis (expected in 2004)
19Biological sequence analysis (cont)
- SPEXS tool for finding regulatory patterns
(common motifs) from a set of sequences, with
several applications J Vilos PhD Thesis (2002) - general method for finding correlations between
observed and predicted effects of gene knockout
experiments (ECCB2002) gene regulatory networks - the effect of SNPs on the binding affinities of
transcription factors comparative genomics
approach - clustering of binding site motifs over
multiple genomes (unpublished)
20Biological sequence analysis (cont)
- algorithms for finding haplotype blocks and
haplotype mosaics - inversion of recombinations
- new Hidden Markov Model (unpublished)
- Minimum Description Length Principle EM
algorithm - c.f. group Mannila / group Toivonen
21Haplotype data
22Hidden Markov Model
Cross-over probabilities
23Metabolic modeling
- members J Rousu (postdoc in London), A Rantanen,
E Pitkänen - data from isotopic-tracing experiments
- new closed-form method for estimation of
metabolic fluxes in steady state - underdetermined linear systems
- propagation of information through the metabolic
network, guided by the carbon maps of individual
reactions - prototype software system
24Metabolic modeling plans
- www server based on current prototype software
(Neobio/TEKES) - utilizing gene expression data
- what if analysis hypothetical reactions
- better algorithms bi-directional reactions, cell
compartments, higher-order balance equations,
improved propagation of information, incremental
versions
25Computational structural biology
- members J Ravantti, K Fredriksson (-12/2002), T
Mielikäinen, T Ojamies
26Computational struct. biology (cont)
- constrain-based algorithm for tomographic
reconstruction - sinogram-based (sound) method for noise reduction
- finding consistent orientations is NP-hard and
inapproximable - J Ravanttis PhD project
27Computational struct. biology (cont)
- BLAST for 3D density models (arrays of voxels)
lots of applications - model comparison substructure search shared
substructures (all-against-all) - translation-rotation invariance 6D search
- differences in distance and density scales
- solution under development contour extraction
geometric hashing
28Computational struct. biology (cont)
Contour extraction
29Computational struct. biology (cont)
Original model
Model assembled from substructures