Extracting Key-Substring-Group Features for Text Classification - PowerPoint PPT Presentation

1 / 11
About This Presentation

Extracting Key-Substring-Group Features for Text Classification


Throw out groups that are not statistically significant Approach Use a generalized suffix-tree to capture all substrings of a corpus. – PowerPoint PPT presentation

Number of Views:10
Avg rating:3.0/5.0
Slides: 12
Provided by: Pay27


Transcript and Presenter's Notes

Title: Extracting Key-Substring-Group Features for Text Classification

Extracting Key-Substring-Group Features forText
  • KDD 2006
  • Dell Zhang Univ of London
  • Wee Sun Lee Nat Univ of Singapore
  • Presented by
  • Payam Refaeilzadeh

  • Treating text documents as a string of characters
    rather than a bag of words may provide a better
    feature representation of the document for
    classification purposes
  • Sub-word features are captured. e.g.
    morphological variants work, worker, works,
  • Super-word features are captured. e.g phrasal
    effects, such as noun-phrases affected cells
  • Word boundary detection problems can be avoided
    (particularly useful for eastern languages)

Motivation continued
  • String based classification can be achieved using
    generative classifiers (e.g. Markov-based
    classifiers) But
  • Discriminative classifiers (e.g. SVM) have proven
    to be superior But
  • For discriminative classifiers we need to
    represent documents as a bag of features where
    the features are string-based rather than

  • Naïve approach bag of all possible sub-strings
  • Very high-dimensional O(n2) s.t. n d
  • Redundant features
  • Better approach
  • Group all substrings that have the same
    distribution and treat each as a single feature.
  • Throw out groups that are not statistically

  • Use a generalized suffix-tree to capture all
    substrings of a corpus.
  • Efficiently compute frequency statistics on the
    substrings and create substring-groups.
  • Extract key-substring-groups by eliminating
    groups that are
  • Too frequent or not-frequent enough
  • Context dependant
  • Redundant (based on mutual information)

Suffix Tree
  • A directed tree with exactly n numbered leaves
    and at most n internal nodes n S
  • The path from the root to each leaf spells out
    the suffix of the string that starts at position
  • If S contains a substring P, at least one suffix
    will begin with that substring gt can check for
    the existence of P by doing a search of the tree
    starting at the root
  • The frequency a substring can be calculated by
    counting the leaves in the sub-tree rooted at the
    child node of the edge where the substring search

Suffix Tree continued
  • Each internal node v has a path string spelled by
    the path r-gtv
  • If the path string of a node u is the suffix for
    the path string of another node u, there is a
    suffix link from u to v
  • The suffix tree (including suffix links) for a
    corpus of documents with a total of n characters
    can be build in O(n) using Ukkonens algorithm
  • All substrings whose path ends in the edge above
    the same node have identical distribution and can
    be treated as a substring-group

Feature Selection
  • Compute the leaf frequency for each internal node
  • Mark out the nodes that have too low or too high
    of a frequency
  • Mark out the nodes that have too few children
    (contextual independence)
  • Mutual Information
  • Mark out the nodes for which freq(node)/freq(paren
    t) is too large
  • Mark out the nodes for which freq(node)/freq(suffi
    x) is too large

Feature Extraction
  • Each possible substring starts the suffix that is
    the path string for a node.
  • Accumulate the key-substring-groups for each node
    by traversing the suffix tree and collecting
    anything that wasnt thrown out
  • For each document start with the node that
    represents the entire document and follow the
    suffix links - extracting the feature set for
    each node

  • Experiments with English, Chinese and Greek Text
    all outperformed other methods.
  • Parameters optimized using cross-validation

  • The good
  • A creative use of an existing algorithm /
    structure (suffix-tree) to do efficient
    string-based feature extraction and selection for
    text data
  • The bad
  • Did not run own experiments. Results compared to
    published results of other researchers.
  • Did not compare to word-based feature selection
  • Did not experiment with spam classification
Write a Comment
User Comments (0)
About PowerShow.com