Title: Learning From String Sequences
1Learning From String Sequences
- David Lindsay
- Supervisors Zhiyuan Luo, Alex Gammerman and
Volodya Vovk
2Overview
- Background
- Pattern Recognition of String Data
- Traditional Approaches String to Word Vector
(SWV) Implementation Issues - Learning with K-Nearest Neighbours (K-NN)
- The Universal Similarity Metric (USM) as
alternative to SWV - Kolmogorov Complexity Estimation using
Compression Algorithms - Experiments using USM-K-NN Learner
3Lots of string data!!!
- Examples of data
- Text data email messages, news articles, web
pages - Biological data DNA, proteins
- Main problems presented
- Highly symbolic representation
- Complex underlying syntax
- Variable length (could be 100 characters long
or 100,000 long!)
4Pattern recognition using strings
- Goal of pattern recognition find the best
label for each new test object.
5Traditional Approach String-to-Word-Vector (SWV)
- (1) Break down string into fixed number of words
(2) use frequencies as features of string.
Date Tue, 8 Jun 2004 095115 0100 (BST) From
Steve Schneider To CompSci
research postgrads Cc
CompSci academic staff
Subject postgrad colloquium Dear all, Just a
reminder to those who have not already provided
your titles and abstracts for your postgraduate
colloquium talks, that these are due by the end
of today. Steve --------------------------------
------------------------------------ Professor
Steve Schneider Department of Computer
Science Royal Holloway, University of London,
Egham, Surrey TW20 0EX, UK. Tel 44 1784
443431 Fax 44 1784
439786 S.Schneider_at_cs.rhul.ac.uk
6Implementation of SWV
- Pick stop words words that occur too often to
be useful for classification eg.
and, it, are, of, then, etc. - Lemmatise grouping similar words eg. postgrad ?
postgraduate, compsci ? computer science
etc. - Choose which and how many words to use as
features. - Lots of domain knowledge must be incorporated!
7K-Nearest Neighbours
Find K - closest training examples
Choose majority class label observed
Easy estimation of probabilities using label
frequencies
?
8Universal Similarity Metric (USM) Li et al (2003)
- Based on non-computable notion of Kolmogorov
complexity.
- Proven universality recognise all effective
similarities between strings
- Essentially a normalised information distance ?
copes with variable length strings
9Kolmogorov Complexity Estimations
- String x dear john, how are you doing
- Definition K(x) shortest UTM program that
writes string x to output tape.
- Approximation K(x) size of compressed string
10Experiments
- Experimented using well tested real-life data
- Spam Email dataset
- Protein Localisation dataset.
- Implemented a K-NN learner with a USM distance
and tested on the data. - Compare with other methods that used SWV approach
(and variants).
11Spam Results
Prec ()
Recall ()
Algorithm
99.1
92.5
USM-1-NN
98.7
95.01
USM-10-NN
99.14
95.43
USM-20-NN
98.49
94.8
USM-30-NN
99.02
82.35
Naive Bayes
95.92
85.27
TiMBL KNN (Trad 1NN)
87.93
53.01
MS Outlook patterns
12Protein Results
13Reliable probability forecasts
- Empirical Reliability Curves on protein data
Naïve Bayes Error 24.7 Square Loss 0.375 Log
Loss 2.686
USM 30-NN Error 26.0 Square Loss 0.323 Log
Loss 0.972
14Summary
- USM distance natural and successful for use in
K-NN learners. - USM K-NN learners
- gives competitive classification accuracy
- provide reliable probability forecasts
- less pre-processing of data
- Provides new focus when designing learners ? find
a compression algorithm for data - However, USM approach very slow (?50) and memory
intensive!
15Current and future work
- Parallels with cognitive science ? cognition
compression - Try lossy compression, and alternative
compression algorithms - Try multi-media data
- use mp3 for music
- divX for video
- jpeg for images
16Questions ?????
17Conditional Kolmogorov Complexity K(x y)
Theoretical Definition
- w.l.o.g. consider TM which has 3 tapes input,
work and output
Input Tape
unidirectional
Work Tape
bi-directional
Output Tape
unidirectional
- To calculate K(x y) we introduce y on the
work tape and empty input on the input tape
18Conditional Kolmogorov Complexity K(x y)
Practical Approximation
- Use Symmetry of Information theorem to simplify