Supervised learning for text

About This Presentation
Title:

Supervised learning for text

Description:

... of the HTML tag tree in which terms are embedded, link neighbors, citations ... similar documents are expected to be assigned the same class label. ... – PowerPoint PPT presentation

Number of Views:44
Avg rating:3.0/5.0
Slides: 97
Provided by: Ganeshram8

less

Transcript and Presenter's Notes

Title: Supervised learning for text


1
Supervised learning for text
2
Organizing knowledge
  • Systematic knowledge structures
  • Ontologies
  • Dewey decimal system, the Library of Congress
    catalog, the AMS Mathematics Subject
  • Classification, and the US Patent subject
    classification
  • Web catalogs
  • Yahoo Dmoz
  • Problem Manual maintenance

3
Topic Tagging
  • Finding similar documents
  • Guiding queries
  • Naïve Approach
  • Syntactic similarity between documents
  • Better approach
  • Topic tagging

4
Topic Tagging
  • Advantages
  • Increase vocabulary of classes
  • Hierarchical visualization and browsing aids
  • Applications
  • Email/Bookmark organization
  • News Tracking
  • Tracking authors of anonymous texts
  • E.g. The Flesch-Kincaid index
  • classify the purpose of hyperlinks.

5
Supervised learning
  • Learning to assign objects to classes given
    examples
  • Learner (classifier)

A typical supervised text learning scenario.
6
Difference with texts
  • M.L classification techniques used for structured
    data
  • Text lots of features and lot of noise
  • No fixed number of columns
  • No categorical attribute values
  • Data scarcity
  • Larger number of class label
  • Hierarchical relationships between classes less
    systematic unlike structured data

7
Techniques
  • Nearest Neighbor Classifier
  • Lazy learner remember all training instances
  • Decision on test document distribution of
    labels on the training documents most similar to
    it
  • Assigns large weights to rare terms
  • Feature selection
  • removes terms in the training documents which are
    statistically uncorrelated with the class labels,
  • Bayesian classifier
  • Fit a generative term distribution Pr(dc) to
    each class c of documents d.
  • Testing The distribution most likely to have
    generated a test document is used to label it.

8
Other Classifiers
  • Maximum entropy classifier
  • Estimate a direct distribution Pr(cjd) from term
    space to the probability of various classes.
  • Support vector machines
  • Represent classes by numbers
  • Construct a direct function from term space to
    the class variable.
  • Rule induction
  • Induce rules for classification over diverse
    features
  • E.g. information from ordinary terms, the
    structure of the HTML tag tree in which terms are
    embedded, link neighbors, citations

9
Other Issues
  • Tokenization
  • E.g. replacing monetary amounts by a special
    token
  • Evaluating text classifier
  • Accuracy
  • Training speed and scalability
  • Simplicity, speed, and scalability for document
    modifications
  • Ease of diagnosis, interpretation of results, and
    adding human judgment and feedback

subjective
10
Benchmarks for accuracy
  • Reuters
  • 10700 labeled documents
  • 10 documents with multiple class labels
  • OHSUMED
  • 348566 abstracts from medical journals
  • 20NG
  • 18800 labeled USENET postings
  • 20 leaf classes, 5 root level classes
  • WebKB
  • 8300 documents in 7 academic categories.
  • Industry
  • 10000 home pages of companies from 105 industry
    sectors
  • Shallow hierarchies of sector names

11
Measures of accuracy
  • Assumptions
  • Each document is associated with exactly one
    class.
  • OR
  • Each document is associated with a subset of
    classes.
  • Confusion matrix (M)
  • For more than 2 classes
  • Mi j number of test documents belonging to
    class i which were assigned to class j
  • Perfect classifier diagonal elements Mi i
    would be nonzero.

12
Evaluating classifier accuracy
  • Two-way ensemble
  • To avoid searching over the power-set of class
    labels in the subset scenario
  • Create positive and negative classes
    for each document d (E.g. Sports and
    Not sports (all remaining documents)
  • Recall and precision
  • contingency matrix per (d,c) pair

13
Evaluating classifier accuracy (contd.)
  • micro averaged contingency matrix
  • micro averaged contingency matrix
  • micro averaged precision and recall
  • Equal importance for each document
  • Macro averaged precision and recall
  • Equal importance for each class

14
Evaluating classifier accuracy (contd.)
  • Precision Recall tradeoff
  • Plot of precision vs. recall Better classifier
    has higher curvature
  • Harmonic mean Discard classifiers that
    sacrifice one for the other

15
Nearest Neighbor classifiers
  • Intuition
  • similar documents are expected to be assigned the
    same class label.
  • Vector space model cosine similarity
  • Training
  • Index each document and remember class label
  • Testing
  • Fetch k most similar document to given document
  • Majority class wins
  • Alternative Weighted counts counts of classes
    weighted by the corresponding similarity measure
  • Alternative per-class offset bc which is tuned
    by testing the classier on a portion of training
    data held out for this purpose.

16
Nearest neighbor classification
17
Pros
  • Easy availability and reuse of of inverted index
  • Collection updates trivial
  • Accuracy comparable to best known classifiers

18
Cons
  • Iceberg category questions
  • involves as many inverted index lookups as there
    are distinct terms in dq,
  • scoring the (possibly large number of) candidate
    documents which overlap with dq in at least one
    word,
  • sorting by overall similarity,
  • picking the best k documents,
  • Space overhead and redundancy
  • Data stored at level of individual documents
  • No distillation

19
Workarounds
  • To reducing space requirements and speed up
    classification
  • Find clusters in the data
  • Store only a few statistical parameters per
    cluster.
  • Compare with documents in only the most promising
    clusters.
  • Again.
  • Ad-hoc choices for number and size of clusters
    and parameters.
  • k is corpus sensitive

20
TF-IDF
  • TF-IDF done for whole corpus
  • Interclass correlations and term frequencies
    unaccounted for
  • Terms which occur relatively frequently in some
    classes compared to others should have higher
    importance
  • Overall rarity in the corpus is not as important.

21
Feature selection
  • Data sparsity
  • Term distribution could be estimated if training
    set larger than test
  • Not the case however.
  • Vocabulary documents
  • For Reuters, only about 10300 documents
    available.
  • Over-fitting problem
  • Joint distribution may fit training instances..
  • But may not fit unforeseen test data that well

22
Marginals rather than joint
  • Marginal distribution of each term in each class
  • Empirical distributions may not still reflect
    actual distributions if data is sparse
  • Therefore feature selection
  • Purposes
  • Improve accuracy by avoiding over fitting
  • maintain accuracy while discarding as many
    features as possible to save a great deal of
    space for storing statistics
  • Heuristic, guided by linguistic and domain
    knowledge, or statistical.

23
Feature selection
  • Perfect feature selection
  • goal-directed
  • pick all possible subsets of features
  • for each subset train and test a classier
  • retain that subset which resulted in the highest
    accuracy.
  • COMPUTATIONALLY INFEASIBLE
  • Simple heuristics
  • Stop words like a, an, the etc.
  • Empirically chosen thresholds (task and corpus
    sensitive) for ignoring too frequent or too
    rare terms
  • Discard too frequent and too rare terms
  • Larger and complex data sets
  • Confusion with stop words
  • Especially for topic hierarchies
  • Greedy inclusion (bottom up) vs. top-down

24
Greedy inclusion algorithm
  • Most commonly used in text
  • Algorithm
  • Compute, for each term, a measure of
    discrimination amongst classes.
  • Arrange the terms in decreasing order of this
    measure.
  • Retain a number of the best terms or features
    for use by the classier.
  • Greedy because
  • measure of discrimination of a term is computed
    independently of other terms
  • Over-inclusion mild effects on accuracy

25
Measure of discrimination
  • Dependent on
  • model of documents
  • desired speed of training
  • ease of updates to documents and class
    assignments.
  • Observations
  • sets included for acceptable accuracy tend to
    have large overlap.

26
The test
  • Similar to the likelihood ratio test
  • Build a 2 x 2 contingency matrix per class-term
    pair
  • Under the independence hypothesis
  • Aggregates the deviations of observed values from
    expected values
  • Larger the value of , the lower is our belief
    that the independence assumption is upheld by the
    observed data.

27
The test
  • Feature selection process
  • Sort terms in decreasing order of their
    values,
  • Train several classifier with varying number of
    features
  • Stopping at the point of maximum accuracy.

28
Mutual information
  • Useful when the multinomial document model is
    used
  • X and Y are discrete random variables taking
    values x,y
  • Mutual information (MI) between them is defined
    as
  • Measure of extent of dependence between random
    variables,
  • Extent to which the joint deviates from the
    product of the marginals
  • Weighted with the distribution mass at (x y)

29
Mutual Information
  • Advantages
  • To the extent MI(X,Y) is large, X and Y are
    dependent.
  • Deviations from independence at rare values of
    (x,y) are played down
  • Interpretations
  • Reduction in the entropy of Y given X.
  • MI(X Y ) H(X) H(XY) H(Y) H(YX)
  • KL distance between no-independence hypothesis
    and independence hypothesis
  • KL distance gives the average number of bits
    wasted by encoding events from the correct
    distribution using a code based on a
    not-quite-right distribution

30
Feature selection with MI
  • Fix a term t and let be an event associated
    with that term.
  • E.g. For the binary model, 0/1,
  • Pr( ) the empirical fraction of documents in
    the training set in which event it occurred.
  • Pr( ,c) the empirical fraction of training
    documents which are in class c
  • Pr(c) fraction of training documents belonging
    to class c.
  • Formula
  • Problem document lengths are not normalized.

31
Fisher's discrimination index
  • Useful when documents are scaled to constant
    length
  • Term occurrences are regarded as fractional real
    numbers.
  • E.g. Two class case
  • Let X and Y be the sets of length normalized
    document vectors corresponding to the two
    classes.
  • Let and be
    centroids for each class.
  • Covariance matrices be

32
Fisher's discrimination index (contd.)
  • Goal find a projection of the data sets X and Y
    on to a line such that
  • the two projected centroids are far apart
    compared to the spread of the point sets
    projected on to the same line.
  • Find a column vector such that
  • the ratio of
  • the square of the difference in mean vectors
    projected onto it
  • average projected variance
  • is maximized.
  • This gives

33
Fisher's discrimination index
  • Formula
  • Let X and Y for both the training and test data
    are generated from multivariate Gaussian
    distributions
  • Let
  • Then this value of induces the optimal (minimum
    error) classier by suitable thresholding on
    for a test point q.
  • Problems
  • Inverting S would be unacceptably slow for tens
    of thousands of dimensions.
  • Llinear transformations would destroy already
    existing sparsity.

34
Solution
  • Recall
  • Goal was to eliminate terms from consideration.
  • Not to arrive at linear projections involving
    multiple terms
  • Regard each term t as providing a candidate
    direction t which is parallel to the
    corresponding axis in the vector space model.
  • Compute the Fisher's index of t

35
FI Solution (contd.)
  • Formula
  • For two class case
  • Can be generalized to a set c of more than two
    classes
  • Feature selection
  • Terms are sorted in decreasing order of FI(t)
  • Best ones chosen as features.

36
Validation
  • How to decide a cut-off rank ?
  • Validation approach
  • A portion of the training documents are held out
  • The rest is used to do term ranking
  • The held-out set used as a test set.
  • Various cut-off ranks can be tested using the
    same held-out set.
  • Leave-one-out cross-validation/partitioning data
    into two
  • An aggregate accuracy is computed over all
    trials.
  • Wrapper to search for the number of features
  • In decreasing order of discriminative power
  • Yields the highest accuracy.

37
Validation (contd.)
  • Simple search heuristic
  • Keep adding one feature at every step until the
    classifier's accuracy ceases to improve.

A general illustration of wrapping for feature
selection.
38
Validation (contd.)
  • For naive Bayes-like classier
  • Evaluation on many choices of feature sets can be
    done at once.
  • For Maximum Entropy/Support vector machines
  • Essentially involves training a classier from
    scratch for each choice of the cut-off rank.
  • Therefore inefficient

39
Validation observations
  • Bayesian classifier cannot over fit much

Effect of feature selection on Bayesian
classifiers
40
Truncation algorithms
  • Start from the complete set of terms T
  • Keep selecting terms to drop
  • Till you end up with a feature subset
  • Question When should you stop truncation ?
  • Two objectives
  • minimize the size of selected feature set F.
  • Keep the distorted distribution Pr(CF) as
    similar as possible to the original Pr(CjT)

41
Truncation Algorithms Example
  • Kullback-Leibler (KL)
  • Measures similarity or distance between two
    distributions
  • Markov Blanket
  • Let X be a feature in T. Let
  • The presence of M renders the presence of X
    unnecessary as a feature gt M is a Markov blanket
    for X
  • Technically
  • M is called a Markov blanket for
    if X is conditionally independent of
    given M
  • eliminating a variable because it has a Markov
    blanket contained in other existing features does
    not increase the KL distance between Pr(CT) and
    Pr(CF).

42
Finding Markov Blankets
  • Absence of Markov Blanket in practice
  • Finding approximate Markov blankets
  • Purpose To cut down computational complexity
  • search for Markov blankets M to those with at
    most k features.
  • given feature X, search for the members of M to
    those features which are most strongly correlated
    (using tests similar to the 2 or MI tests) with
    X.
  • Example For Reuters dataset, over two-thirds of
    T could be discarded while increasing
    classification accuracy

43
Feature Truncation algorithm
  • while truncated Pr(CF) is reasonably close to
    original Pr(CT) do
  • for each remaining feature X do
  • Identify a candidate Markov
    blanket M
  • For some tuned constant k, find
    the set M of k variables in F \ X that are most
    strongly correlated with X
  • Estimate how good a blanket M is
  • Estimate
  • end for
  • Eliminate the feature having the best
    surviving Markov blanket
  • end while

44
General observations on feature selection
  • The issue of document length should be addressed
    properly.
  • Choice of association measures does not make a
    dramatic difference
  • Greedy inclusion algorithms scale nearly linearly
    with the number of features
  • Markov blanket technique takes time proportional
    to at least .
  • Advantage of Markov blankets algo over greedy
    inclusion
  • Greedy algo may include features with high
    individual correlations even though one subsumes
    the other
  • Features individually uncorrelated could be
    jointly more correlated with the class
  • This rarely happens
  • Binary feature selection view may not be only
    view to subscribe to
  • Suggestion combine features into fewer, simpler
    ones
  • E.g. project the document vectors to a lower
    dimensional space

45
Bayesian Learner
  • Very practical text classifier
  • Assumption
  • A document can belong to exactly one of a set of
    classes or topics.
  • Each class c has an associated prior probability
    Pr(c),
  • There is a class-conditional document
    distribution Pr(djc) for each class.
  • Posterior probability
  • Obtained using Bayes Rule
  • Parameter set consists of all P(dc)

46
Parameter Estimation for Bayesian Learner
  • Estimate of is based on two sources of
    information
  • Prior knowledge on the parameter set before
    seeing any training documents
  • Terms in the training documents D.
  • Bayes Optimal Classifier
  • Taking the expectation of each parameter over
    Pr( D)
  • Computationally infeasible
  • Maximum likelihood estimate
  • Replace the sum above with the value of the
    summand (Pr(cd, )) for arg max Pr(D
    ),
  • Works poorly

47
Naïve Bayes Classifier
  • Naïve
  • assumption of independence between terms,
  • joint term distribution is the product of the
    marginals.
  • Widely used owing to
  • simplicity and speed of training, applying, and
    updating
  • Two kinds of widely used marginals for text
  • Binary model
  • Multinomial model

48
Naïve Bayes Models
  • Binary Model
  • Each parameter indicates the probability that a
    document in class c will mention term t at least
    once.
  • Multinomial model
  • each class has an associated die with W faces.
  • each parameter denotes probability of the face
    turning up on tossing the die.
  • term t occurs n(d t) times in document d,
  • document length is a random variable denoted L,
  • .
  • .

49
Analysis of Naïve Bayes Models
  • Multiply together a large number of small
    probabilities,
  • Result extremely tiny probabilities as answers.
  • Solution store all numbers as logarithms
  • Class which comes out at the top wins by a huge
    margin
  • Sanitizing scores using likelihood ration
  • Also called the logit function
  • .

50
Parameter smoothing
  • What if a test document contains a term t
    that never occurred in any training document in
    class c ?
  • Ans will be zero
  • Even if many other terms clearly hint at a high
    likelihood of class c generating the document.
  • Bayesian Estimation
  • Estimating probability from insufficient data.
  • If you toss a coin n times and it always comes up
    heads, what is the probability that the (n 1)th
    toss will also come up heads?
  • posit a prior distribution on , called
  • E.g. The uniform distribution
  • Resultant posterior distribution

51
Laplace Smoothing
  • Based on Bayesian Estimation
  • Laplace's law of succession
  • loss function (penalty) for picking a
    smoothed value as against the true' value.
  • E.g. Loss function as the square error
  • For this choice of loss,the best choice of the
    smoothed parameter is simply the expectation of
    the posterior distribution on having observed
    the data
  • .

52
Laplace Smoothing (contd.)
  • Heuristic alternatives
  • Lidstone's law of succession
  • .
  • derivation for the multinomial model
  • there are W possible events where W is the
    vocabulary.
  • .

53
Performance analysis
  • Multinomial naive Bayes classifier generally
    outperforms the binary variant
  • K-NN may outperform naïve Bayes
  • Naïve Bayes is faster and more compact
  • decision boundaries
  • regions of potential confusion

54
NB Decision boundaries
  • Bayesian classier partitions the multidimensional
    term space into regions
  • Within each region, the probability of one class
    is higher than others
  • On the boundaries, the probability of two or more
    classes are exactly equal
  • NB is a linear classier
  • it makes a decision between c 1 and c -1
  • by thresholding the value of
    (bprior) for a suitable vector

55
Pitfalls
  • Strong bias
  • fixes the policy that (tth
    component of the linear discriminant) depends
    only on the statistics of term t in the corpus.
  • Therefore it cannot pick from the entire set of
    possible linear discriminants,

56
Bayesian Networks
  • Attempt to capture statistical dependencies
    between terms themselves
  • Approximations to the joint distribution over
    terms
  • Probability of a term occurring depends on
    observation about other terms as well as the
    class variable.
  • A directed acyclic graph
  • All random variables (classes and terms) are
    nodes
  • Dependency edges are drawn from c to t for each
    t.(parent-child edges)
  • To represent additional dependencies between
    terms dependency edges (parent child) are drawn

57
Bayesian networks. For the naive Bayes
assumption, the only edges are from the
class variable to individual terms. Towards
better approximations to the joint distribution
over terms the probability of a term occurring
may now depend on observation about other terms
as well as the class variable.
58
Bayesian Belief Network (BBN)
  • DAG
  • Parents Pa(X)
  • nodes that are connected by directed edges to a
    node X
  • Fixing the values of the parent variables
    completely determines the conditional
    distribution of X
  • Conditional Probability tables
  • For discrete variables, the distribution data for
    X can be stored in the obvious way as a table
    with each row showing a set of values of the
    parents, the value of X, and a conditional
    probability.
  • Unlike Naïve Bayes
  • P(dc) is not a simple product over all terms.
  • .

59
BBN difficulty
  • Getting a good network structure.
  • At least quadratic time
  • Enumeration of all pairs of features
  • Exploited only for binary model
  • Multinomial model
  • Prohibitive CPT sizes

60
Exploiting hierarchy among topics
  • Ordering between the class labels
  • For Data warehousing
  • E.g. high, medium, or low cancer risk patients.
  • Text Class labels
  • Taxonomy
  • large and complex class hierarchy that relates
    the class labels
  • Tree structure
  • Simplest form of taxonomy
  • widely used in directory browsing,
  • often the output of clustering algorithms.
  • inheritance
  • If class c0 is the parent of class c1, any
    training document which belongs to c1 also
    belongs to c0.

61
Topic Hierarchies Feature selection
  • Discriminating ability of a term sensitive to the
    node (or class) in the hierarchy
  • Measure of discrimination of a term
  • Can be evaluated with respect to only internal
    nodes of the hierarchy.
  • can' may be a noisy word at the root node of
    Yahoo!
  • Help classifying documents under the sub tree of
    /Science/Environment/Recycling.

62
Topic Hierarchies Enhanced parameter estimation
  • Uniform priors not good
  • Idea
  • If a parameter estimate is shaky at a node with
    few training documents, perhaps we can impose a
    strong prior from a well-trained parent to repair
    the estimates.
  • Shrinkage
  • Seeks to improve estimates of descendants using
    data from ancestors,

63
Shrinkage
  • Assume multinomial model
  • introducing a dummy class c0 as the parent of the
    root c1, where all terms are equally likely.
  • For a specific path c0,c1,.cn,
  • shrunk' estimate is determined by a convex
    linear interpolation of the MLE parameters at the
    ancestor nodes up through c0
  • Estimatation of mixing weights
  • Simple form of EM algorithm
  • Determined empirically, by iteratively maximizing
    the probability of a held-out portion Hn of the
    training set for node cn.

64
Shrinkage Observation
  • Improves accuracy beyond hierarchical naïve
    Bayes,
  • Improvement is high when data is sparse
  • Capable of utilizing many more features than
    Naïve Bayes

65
Topic search in Hierarchy
  • By definition
  • All documents are relevant to the root topic
  • Pr(rootd) 1.
  • Given a test document d
  • Find one or more of the most likely leaf nodes in
    the hierarchy.
  • Document cannot belong to more than one path,
  • .

66
Topic search in Hierarchy Greedy Search strategy
  • Search starts at the root
  • Decisions are made greedily
  • At each internal node pick the highest
    probability class
  • Continue
  • Drawback
  • Early errors cause compounding effect

67
Topic search in Hierarchy Best-first search
strategy
  • For finding m most probable leaf classes
  • Find the weighted shortest path from the root to
    a leaf.
  • Edge (c0,ci) is assigned a (non-negative) edge
    weight of Pr(cic0,d)
  • .
  • To make Best first search different from greedy
    search
  • Rescale/smoothen the probabilities

68
Using best-first search on a hierarchy can
improve both accuracy and speed. Because the
hierarchy has four internal nodes, the second
column shows the number of features for each.
These were tuned so that the total number of
features for both at and best-first are roughly
the same (so that the model complexity is
comparable). Because each document belonged to
exactly one leaf node, recall equals precision in
this case and is called accuracy'.
69
The semantics of hierarchical classification
  • Asymmetry
  • training document can be associated with any
    node,
  • test document must be routed to a leaf,
  • Routing test documents to internal nodes
  • none of the children matches the document
  • many children match the document
  • the chances of making a mistake while pushing
    down the test document one more level may be too
    high.
  • Research issue

70
Maximum entropy learners Motivation
  • Bayesian learner
  • first model Pr(dc) at training time
  • Apply Bayes rule at test time
  • Two problems with Bayesian learners
  • d is represented in a high-dimensional term space
  • gtPr(dc) cannot be estimated accurately from a
    training set of limited size.
  • No systematic way of adding synthetic features
  • Such an addition may result in
  • highly correlated features
  • high subsumption

71
Maximum entropy learners
  • Assume that each document has only one class
    label
  • Indicator functions fj(c,d)
  • Flag jth condition relating class c to document
    d
  • Expectation of indicator fj is
  • .
  • Approximating Pr(d,c) and Pr(d) with their
    empirical estimates
  • .

72
Principle of Maximum Entropy
  • Constraints dont determine Pr(cd) uniquely
  • Principle of Maximum Entropy
  • prefer the simplest model to explain observed
    data.
  • Choose Pr(cd) that maximizes the Entropy of
    Pr(cd)
  • In the event of empty training set we should
    consider all classes to be equally likely,
  • Constrained Optimization
  • Maximize the entropy of the model distribution
    Pr(cd)
  • While obeying the constraints for all j
  • Optimize by the method of Lagrange multipliers

73
Maximum Entropy solution
  • Fitting the distribution to the data involves two
    steps
  • Identify a set of indicator functions derived
    from the data.
  • Iteratively arrive at values for the parameters
    that satisfy the constraints while maximizing the
    entropy of the distribution being modeled.
  • An equivalent optimization problem

74
Text Classification using Maximum Entropy Model
  • Example
  • Pick an indicator for each (class, term)
    combination.
  • For the binary document model,
  • For the multinomial document model
  • What we gain with Maximum Entropy over naïve
    Bayes
  • does not suffer from the independence assumptions
  • E.g.
  • if the terms t1 machine and t2 learning are
    often found together in class c,
  • and would be suitably
    discounted.

75
Performance of Maximum Entropy Classifier
  • Outperforms naive Bayes in accuracy, but not
    consistently.
  • Table of figures

76
Discriminative classification
  • Naïve Bayes and Maximum Entropy Classifiers
  • induce linear decision boundaries between
    classes in the feature space.
  • Discriminative classifiers
  • Directly map the feature space to class labels
  • Class labels are encoded as numbers
  • e.g 1 and 1 for two class problem
  • Two examples
  • Linear least-square regression
  • Support Vector Machines

77
Linear least-square regression
  • No inherent reason for going through the modeling
    step as in Bayesian or maximum entropy classifier
    to get a linear discriminant.
  • Linear Regression Problem
  • Look for some arbitrary such that
    directly predicts the label ci of
    document di.
  • Minimize the square error between the observed
    and predicted class variable
  • Widrow-Hoff (WH) update rule.
  • Scaling to norm 1
  • Two equivalent interpretations
  • Classifier is a hyperplane
  • Documents are projected on to a direction
  • Performance
  • Comparable to Naïve Bayes and Max Ent

78
Support vector machines
  • Assumption training and test population are
    drawn from the same distribution
  • Hypothesis
  • Hyperplane that is close to many training data
    points has a greater chance of misclassifying
    test instances
  • A hyperplane which passes through a no-man's
    land, has lower chances of misclassifications
  • Make a decision by thresholding
  • Seek an which maximizes the distance of
    any training point from the hyperplane

79
Support vector machines
  • Optimal separator
  • Orthogonal to the shortest line connecting the
    convex hull of the two classes
  • Intersects this shortest line halfway
  • Margin
  • distance of any training point from the optimized
    hyperplane
  • It is at least

80
Illustration of the SVM optimization problem.
81
SVMs non separable classes
  • Classes in the training data not always
    separable.
  • Introduce fudge variables
  • Equivalent dual

82
SVMs Complexity
  • Quadratic optimization problem.
  • Working set refine a few at a time holding
    the others fixed.
  • On-demand computation of inner-products
  • n documents
  • Recent SVM packages
  • Linear time by clever selection of working sets.

83
Performance
  • Comparison with other classifiers
  • Amongst most accurate classifier for text
  • Better accuracy than naive Bayes and decision
    tree classifier,
  • interesting revelation
  • Linear SVMs suffice
  • standard text classification tasks have classes
    almost separable using a hyperplane in feature
    space
  • Research issues
  • Non-linear SVMs

84
SVM training time variation as the training set
size is increased, with and without sufficient
memory to hold the training set. In the latter
case, the memory is set to about a quarter of
that needed by the training set.
85
Comparison of LSVM with previous classifiers on
the Reuters data set (data taken from Dumais).
(The naive Bayes classier used binary features,
so its accuracy can be improved)
86
Comparison of accuracy across three classifiers
Naive Bayes, Maximum Entropy and Linear SVM,
using three data sets 20 newsgroups, the
Recreation sub-tree of the Open Directory, and
University Web pages from WebKB.
87
Comparison between several classifiers using the
Reuters collection.
88
Hypertext classification
  • Techniques to address hypertextual features.
  • Document Object Model or DOM
  • well-formed HTML document is a properly nested
    hierarchy of regions in a tree-structured
  • DOM tree,
  • internal nodes are elements
  • some of the leaf nodes are segments of text.
  • other nodes are hyperlinks to other Web pages,
  • In turn DOM trees

89
Representing hypertext for supervised learning
  • Paying special attention to tags can help with
    learning
  • keyword-based search
  • assign heuristic weights to terms that occur in
    specific HTML tags
  • Example.. (next slide)

90
Prefixing with tags
  • Distinguishing between the two occurrences of the
    word surfing,
  • Prefixing each term by the sequence of tags that
    we need to follow from the DOM root to get to the
    term,
  • A repeated term in different sections should
    reinforce belief in a class label
  • Using a maximum entropy classier
  • Accumulate evidence from different features
  • maintain both forms of a term
  • plain text and prefixed text (all path prefixes)

91
Experiments
  • 10705 patents from the US Patent Office,
  • 70 error with plain text classier,
  • 24 error with path-tagged terms
  • 17. Error with path prefixes
  • 1700 resumes (with naive Bayes classifier)
  • 53 error with flattened HTML
  • 40 error with prefix-tagged terms

92
Limitations
  • Prefix representations
  • ad-hoc
  • inflexible.
  • Generalisibility
  • How to incorporate additional features ?
  • E.g. adding features derived from hyperlinks.
  • Relations
  • uniform way to codify hypertextual features.
  • Example

93
Rule Induction for relational learning
  • Inductive classifiers
  • discover rules from a collection of relations.
  • Example solution for above
  • Goal Discover a set of predicate rules
  • Consider 2 class setting
  • Positive examples D and negative examples D-
  • Test instance
  • True gt positive instance. Else negative instance.

94
Rule induction with First Order Inductive Logic
(FOIL)
  • Well-known rule learner
  • Start with empty rule set
  • learn new (disjunctive) rule
  • add conjunctive literals to the new rule until no
    negative example is covered by the new rule.
  • pick a literal which increases the ratio of
    surviving positive to negative bindings rapidly.
  • Remove positive examples covered by any rule
    generated thus far.
  • Till no positive instances are left

95
Literals Explored
  • where Q is a relation and Xi
    are variables, at least one of which must be
    already bound.
  • not(L), where L is a literal of the above forms.

96
Analysis
  • Can learn class labels for individual pages
  • Can learn relationships between labels
  • member(homePage, department)
  • teaches(homePage, coursePage)
  • advises(homePage, homePage)
  • writes(homePage, paper)
  • Hybrid approaches
  • Statistical classifier
  • more complex search for literals
  • Inductive learning
  • comparing the estimated probabilities of various
    classes.
  • Recursively labeling relations
  • E.g. relating page label in terms of labels of
    neighboring pages
  • classified(A, facultyPage) -
  • links-to(A, B), classified(B, studentPage),
  • links-to(A, C), classified(C, coursePage),
  • links-to(A, D), classified(D, publicationsPage).
Write a Comment
User Comments (0)