Learning From String Sequences - PowerPoint PPT Presentation

1 / 16
About This Presentation
Title:

Learning From String Sequences

Description:

The Universal Similarity Metric (USM) as alternative to SWV ... Get Va1ium, Vioxx, Ambien, Paxil, Nexium, Xanax, Phentermine, and other popular. meds. ... – PowerPoint PPT presentation

Number of Views:68
Avg rating:3.0/5.0
Slides: 17
Provided by: dav5178
Category:

less

Transcript and Presenter's Notes

Title: Learning From String Sequences


1
Learning From String Sequences
  • David Lindsay
  • Supervisors Zhiyuan Luo, Alex Gammerman and
    Volodya Vovk

2
Overview
  • Background
  • Pattern Recognition of String Data
  • Traditional Approaches String to Word Vector
    (SWV) Implementation Issues
  • Learning with K-Nearest Neighbours (K-NN)
  • The Universal Similarity Metric (USM) as
    alternative to SWV
  • Kolmogorov Complexity Estimation using
    Compression Algorithms
  • Experiments using USM-K-NN Learner

3
Lots of string data!!!
  • Examples of data
  • Text data email messages, news articles, web
    pages
  • Biological data DNA, proteins
  • Main problems presented
  • Highly symbolic representation
  • Complex underlying syntax
  • Variable length (could be 100 characters long
    or 100,000 long!)

4
Pattern recognition using strings
  • Goal of pattern recognition find the best
    label for each new test object.

5
Traditional Approach String-to-Word-Vector (SWV)
  • (1) Break down string into fixed number of words
    (2) use frequencies as features of string.


Date Tue, 8 Jun 2004 095115 0100 (BST) From
Steve Schneider To CompSci
research postgrads Cc
CompSci academic staff
Subject postgrad colloquium Dear all, Just a
reminder to those who have not already provided
your titles and abstracts for your postgraduate
colloquium talks, that these are due by the end
of today. Steve --------------------------------
------------------------------------ Professor
Steve Schneider Department of Computer
Science Royal Holloway, University of London,
Egham, Surrey TW20 0EX, UK. Tel 44 1784
443431 Fax 44 1784
439786 S.Schneider_at_cs.rhul.ac.uk
6
Implementation of SWV
  • Pick stop words words that occur too often to
    be useful for classification eg.
    and, it, are, of, then, etc.
  • Lemmatise grouping similar words eg. postgrad ?
    postgraduate, compsci ? computer science
    etc.
  • Choose which and how many words to use as
    features.
  • Lots of domain knowledge must be incorporated!

7
K-Nearest Neighbours
Find K - closest training examples
Choose majority class label observed
Easy estimation of probabilities using label
frequencies
?
8
Universal Similarity Metric (USM) Li et al (2003)
  • Based on non-computable notion of Kolmogorov
    complexity.
  • Proven universality recognise all effective
    similarities between strings
  • Essentially a normalised information distance ?
    copes with variable length strings

9
Kolmogorov Complexity Estimations
  • String x dear john, how are you doing
  • Definition K(x) shortest UTM program that
    writes string x to output tape.
  • Approximation K(x) size of compressed string

10
Experiments
  • Experimented using well tested real-life data
  • Spam Email dataset
  • Protein Localisation dataset.
  • Implemented a K-NN learner with a USM distance
    and tested on the data.
  • Compare with other methods that used SWV approach
    (and variants).

11
Spam Results
Prec ()
Recall ()
Algorithm
99.1
92.5
USM-1-NN
98.7
95.01
USM-10-NN
99.14
95.43
USM-20-NN
98.49
94.8
USM-30-NN
99.02
82.35
Naive Bayes
95.92
85.27
TiMBL KNN (Trad 1NN)
87.93
53.01
MS Outlook patterns
12
Protein Results
13
Reliable probability forecasts
  • Empirical Reliability Curves on protein data

Naïve Bayes Error 24.7 Square Loss 0.375 Log
Loss 2.686
USM 30-NN Error 26.0 Square Loss 0.323 Log
Loss 0.972
14
Summary
  • USM distance natural and successful for use in
    K-NN learners.
  • USM K-NN learners
  • gives competitive classification accuracy
  • provide reliable probability forecasts
  • less pre-processing of data
  • Provides new focus when designing learners ? find
    a compression algorithm for data
  • However, USM approach very slow (?50) and memory
    intensive!

15
Current and future work
  • Parallels with cognitive science ? cognition
    compression
  • Try lossy compression, and alternative
    compression algorithms
  • Try multi-media data
  • use mp3 for music
  • divX for video
  • jpeg for images

16
Questions ?????
17
Conditional Kolmogorov Complexity K(x y)
Theoretical Definition
  • w.l.o.g. consider TM which has 3 tapes input,
    work and output

Input Tape
unidirectional
Work Tape
bi-directional
Output Tape
unidirectional
  • To calculate K(x y) we introduce y on the
    work tape and empty input on the input tape

18
Conditional Kolmogorov Complexity K(x y)
Practical Approximation
  • USM Distance
  • Use Symmetry of Information theorem to simplify
Write a Comment
User Comments (0)
About PowerShow.com