Extracting Key-Substring-Group Features for Text Classification

About This Presentation

Title:

Description:

Number of Views:10

Avg rating:3.0/5.0

Slides: 12

Provided by: Pay27

Learn more at: https://www.public.asu.edu

Category:

more less

Transcript and Presenter's Notes

Title: Extracting Key-Substring-Group Features for Text Classification

1
Extracting Key-Substring-Group Features forText
Classification

2
Motivation

Treating text documents as a string of characters
rather than a bag of words may provide a better
feature representation of the document for
classification purposes
Sub-word features are captured. e.g.
morphological variants work, worker, works,
worked
Super-word features are captured. e.g phrasal
effects, such as noun-phrases affected cells
Word boundary detection problems can be avoided
(particularly useful for eastern languages)

3
Motivation continued

String based classification can be achieved using
generative classifiers (e.g. Markov-based
classifiers) But
Discriminative classifiers (e.g. SVM) have proven
to be superior But
For discriminative classifiers we need to
represent documents as a bag of features where
the features are string-based rather than
word-based

4
Challenges

Naïve approach bag of all possible sub-strings
Very high-dimensional O(n2) s.t. n d
Redundant features
Better approach
Group all substrings that have the same
distribution and treat each as a single feature.
Throw out groups that are not statistically
significant

5
Approach

Use a generalized suffix-tree to capture all
substrings of a corpus.
Efficiently compute frequency statistics on the
substrings and create substring-groups.
Extract key-substring-groups by eliminating
groups that are
Too frequent or not-frequent enough
Context dependant
Redundant (based on mutual information)

6
Suffix Tree

A directed tree with exactly n numbered leaves
and at most n internal nodes n S
The path from the root to each leaf spells out
the suffix of the string that starts at position
i
If S contains a substring P, at least one suffix
will begin with that substring gt can check for
the existence of P by doing a search of the tree
starting at the root
The frequency a substring can be calculated by
counting the leaves in the sub-tree rooted at the
child node of the edge where the substring search
ended.

7
Suffix Tree continued

Each internal node v has a path string spelled by
the path r-gtv
If the path string of a node u is the suffix for
the path string of another node u, there is a
suffix link from u to v
The suffix tree (including suffix links) for a
corpus of documents with a total of n characters
can be build in O(n) using Ukkonens algorithm
All substrings whose path ends in the edge above
the same node have identical distribution and can
be treated as a substring-group

8
Feature Selection

9
Feature Extraction

Each possible substring starts the suffix that is
the path string for a node.
Accumulate the key-substring-groups for each node
by traversing the suffix tree and collecting
anything that wasnt thrown out
For each document start with the node that
represents the entire document and follow the
suffix links - extracting the feature set for
each node

10
Experiments

Experiments with English, Chinese and Greek Text
all outperformed other methods.
Parameters optimized using cross-validation

11
Comments

The good
A creative use of an existing algorithm /
structure (suffix-tree) to do efficient
string-based feature extraction and selection for
text data
The bad
Did not run own experiments. Results compared to
published results of other researchers.
Did not compare to word-based feature selection
Did not experiment with spam classification