Learning From String Sequences

About This Presentation

Title:

Learning From String Sequences

Description:

The Universal Similarity Metric (USM) as alternative to SWV ... Get Va1ium, Vioxx, Ambien, Paxil, Nexium, Xanax, Phentermine, and other popular. meds. ... – PowerPoint PPT presentation

Number of Views:68

Avg rating:3.0/5.0

Slides: 17

Provided by: dav5178

Category:

more less

Transcript and Presenter's Notes

Title: Learning From String Sequences

1
Learning From String Sequences

David Lindsay
Supervisors Zhiyuan Luo, Alex Gammerman and
Volodya Vovk

2
Overview

Background
Pattern Recognition of String Data
Traditional Approaches String to Word Vector
(SWV) Implementation Issues
Learning with K-Nearest Neighbours (K-NN)
The Universal Similarity Metric (USM) as
alternative to SWV
Kolmogorov Complexity Estimation using
Compression Algorithms
Experiments using USM-K-NN Learner

3
Lots of string data!!!

Examples of data
Text data email messages, news articles, web
pages
Biological data DNA, proteins

Main problems presented
Highly symbolic representation
Complex underlying syntax
Variable length (could be 100 characters long
or 100,000 long!)

4
Pattern recognition using strings

Goal of pattern recognition find the best
label for each new test object.

5
Traditional Approach String-to-Word-Vector (SWV)

(1) Break down string into fixed number of words
(2) use frequencies as features of string.

Date Tue, 8 Jun 2004 095115 0100 (BST) From
Steve Schneider To CompSci
research postgrads Cc
CompSci academic staff
Subject postgrad colloquium Dear all, Just a
reminder to those who have not already provided
your titles and abstracts for your postgraduate
colloquium talks, that these are due by the end
of today. Steve --------------------------------
------------------------------------ Professor
Steve Schneider Department of Computer
Science Royal Holloway, University of London,
Egham, Surrey TW20 0EX, UK. Tel 44 1784
443431 Fax 44 1784
439786 S.Schneider_at_cs.rhul.ac.uk
6
Implementation of SWV

Pick stop words words that occur too often to
be useful for classification eg.
and, it, are, of, then, etc.
Lemmatise grouping similar words eg. postgrad ?
postgraduate, compsci ? computer science
etc.
Choose which and how many words to use as
features.
Lots of domain knowledge must be incorporated!

7
K-Nearest Neighbours
Find K - closest training examples
Choose majority class label observed
Easy estimation of probabilities using label
frequencies
?
8
Universal Similarity Metric (USM) Li et al (2003)

Based on non-computable notion of Kolmogorov
complexity.

Proven universality recognise all effective
similarities between strings

Essentially a normalised information distance ?
copes with variable length strings

9
Kolmogorov Complexity Estimations

String x dear john, how are you doing

Definition K(x) shortest UTM program that
writes string x to output tape.

Approximation K(x) size of compressed string

10
Experiments

Experimented using well tested real-life data
Spam Email dataset
Protein Localisation dataset.
Implemented a K-NN learner with a USM distance
and tested on the data.
Compare with other methods that used SWV approach
(and variants).

11
Spam Results
Prec ()
Recall ()
Algorithm
99.1
92.5
USM-1-NN
98.7
95.01
USM-10-NN
99.14
95.43
USM-20-NN
98.49
94.8
USM-30-NN
99.02
82.35
Naive Bayes
95.92
85.27
TiMBL KNN (Trad 1NN)
87.93
53.01
MS Outlook patterns
12
Protein Results
13
Reliable probability forecasts

Empirical Reliability Curves on protein data

Naïve Bayes Error 24.7 Square Loss 0.375 Log
Loss 2.686
USM 30-NN Error 26.0 Square Loss 0.323 Log
Loss 0.972
14
Summary

USM distance natural and successful for use in
K-NN learners.
USM K-NN learners
gives competitive classification accuracy
provide reliable probability forecasts
less pre-processing of data
Provides new focus when designing learners ? find
a compression algorithm for data
However, USM approach very slow (?50) and memory
intensive!

15
Current and future work

Parallels with cognitive science ? cognition
compression
Try lossy compression, and alternative
compression algorithms
Try multi-media data
use mp3 for music
divX for video
jpeg for images

16
Questions ?????
17
Conditional Kolmogorov Complexity K(x y)
Theoretical Definition

w.l.o.g. consider TM which has 3 tapes input,
work and output

Input Tape
unidirectional
Work Tape
bi-directional
Output Tape
unidirectional

To calculate K(x y) we introduce y on the
work tape and empty input on the input tape

18
Conditional Kolmogorov Complexity K(x y)
Practical Approximation

USM Distance

Use Symmetry of Information theorem to simplify

Write a Comment

User Comments (0)

About PowerShow.com

Learning From String Sequences - PowerPoint PPT Presentation

Learning From String Sequences

The Universal Similarity Metric (USM) as alternative to SWV ... Get Va1ium, Vioxx, Ambien, Paxil, Nexium, Xanax, Phentermine, and other popular. meds. ... – PowerPoint PPT presentation