Supervised learning for text

About This Presentation

Title:

Supervised learning for text

Description:

... of the HTML tag tree in which terms are embedded, link neighbors, citations ... similar documents are expected to be assigned the same class label. ... – PowerPoint PPT presentation

Number of Views:44

Avg rating:3.0/5.0

Slides: 97

Provided by: Ganeshram8

more less

Transcript and Presenter's Notes

Title: Supervised learning for text

1
Supervised learning for text
2
Organizing knowledge

Systematic knowledge structures
Ontologies
Dewey decimal system, the Library of Congress
catalog, the AMS Mathematics Subject
Classification, and the US Patent subject
classification
Web catalogs
Yahoo Dmoz
Problem Manual maintenance

3
Topic Tagging

Finding similar documents
Guiding queries
Naïve Approach
Syntactic similarity between documents
Better approach
Topic tagging

4
Topic Tagging

Advantages
Increase vocabulary of classes
Hierarchical visualization and browsing aids
Applications
Email/Bookmark organization
News Tracking
Tracking authors of anonymous texts
E.g. The Flesch-Kincaid index
classify the purpose of hyperlinks.

5
Supervised learning

Learning to assign objects to classes given
examples
Learner (classifier)

A typical supervised text learning scenario.
6
Difference with texts

M.L classification techniques used for structured
data
Text lots of features and lot of noise
No fixed number of columns
No categorical attribute values
Data scarcity
Larger number of class label
Hierarchical relationships between classes less
systematic unlike structured data

7
Techniques

Nearest Neighbor Classifier
Lazy learner remember all training instances
Decision on test document distribution of
labels on the training documents most similar to
it
Assigns large weights to rare terms
Feature selection
removes terms in the training documents which are
statistically uncorrelated with the class labels,
Bayesian classifier
Fit a generative term distribution Pr(dc) to
each class c of documents d.
Testing The distribution most likely to have
generated a test document is used to label it.

8
Other Classifiers

Maximum entropy classifier
Estimate a direct distribution Pr(cjd) from term
space to the probability of various classes.
Support vector machines
Represent classes by numbers
Construct a direct function from term space to
the class variable.
Rule induction
Induce rules for classification over diverse
features
E.g. information from ordinary terms, the
structure of the HTML tag tree in which terms are
embedded, link neighbors, citations

9
Other Issues

Tokenization
E.g. replacing monetary amounts by a special
token
Evaluating text classifier
Accuracy
Training speed and scalability
Simplicity, speed, and scalability for document
modifications
Ease of diagnosis, interpretation of results, and
adding human judgment and feedback

subjective
10
Benchmarks for accuracy

Reuters
10700 labeled documents
10 documents with multiple class labels
OHSUMED
348566 abstracts from medical journals
20NG
18800 labeled USENET postings
20 leaf classes, 5 root level classes
WebKB
8300 documents in 7 academic categories.
Industry
10000 home pages of companies from 105 industry
sectors
Shallow hierarchies of sector names

11
Measures of accuracy

Assumptions
Each document is associated with exactly one
class.
OR
Each document is associated with a subset of
classes.
Confusion matrix (M)
For more than 2 classes
Mi j number of test documents belonging to
class i which were assigned to class j
Perfect classifier diagonal elements Mi i
would be nonzero.

12
Evaluating classifier accuracy

Two-way ensemble
To avoid searching over the power-set of class
labels in the subset scenario
Create positive and negative classes
for each document d (E.g. Sports and
Not sports (all remaining documents)
Recall and precision
contingency matrix per (d,c) pair

13
Evaluating classifier accuracy (contd.)

micro averaged contingency matrix
micro averaged contingency matrix
micro averaged precision and recall
Equal importance for each document
Macro averaged precision and recall
Equal importance for each class

14
Evaluating classifier accuracy (contd.)

Precision Recall tradeoff
Plot of precision vs. recall Better classifier
has higher curvature
Harmonic mean Discard classifiers that
sacrifice one for the other

15
Nearest Neighbor classifiers

Intuition
similar documents are expected to be assigned the
same class label.
Vector space model cosine similarity
Training
Index each document and remember class label
Testing
Fetch k most similar document to given document
Majority class wins
Alternative Weighted counts counts of classes
weighted by the corresponding similarity measure
Alternative per-class offset bc which is tuned
by testing the classier on a portion of training
data held out for this purpose.

16
Nearest neighbor classification
17
Pros

Easy availability and reuse of of inverted index
Collection updates trivial
Accuracy comparable to best known classifiers

18
Cons

Iceberg category questions
involves as many inverted index lookups as there
are distinct terms in dq,
scoring the (possibly large number of) candidate
documents which overlap with dq in at least one
word,
sorting by overall similarity,
picking the best k documents,
Space overhead and redundancy
Data stored at level of individual documents
No distillation

19
Workarounds

To reducing space requirements and speed up
classification
Find clusters in the data
Store only a few statistical parameters per
cluster.
Compare with documents in only the most promising
clusters.
Again.
Ad-hoc choices for number and size of clusters
and parameters.
k is corpus sensitive

20
TF-IDF

TF-IDF done for whole corpus
Interclass correlations and term frequencies
unaccounted for
Terms which occur relatively frequently in some
classes compared to others should have higher
importance
Overall rarity in the corpus is not as important.

21
Feature selection

Data sparsity
Term distribution could be estimated if training
set larger than test
Not the case however.
Vocabulary documents
For Reuters, only about 10300 documents
available.
Over-fitting problem
Joint distribution may fit training instances..
But may not fit unforeseen test data that well

22
Marginals rather than joint

Marginal distribution of each term in each class
Empirical distributions may not still reflect
actual distributions if data is sparse
Therefore feature selection
Purposes
Improve accuracy by avoiding over fitting
maintain accuracy while discarding as many
features as possible to save a great deal of
space for storing statistics
Heuristic, guided by linguistic and domain
knowledge, or statistical.

23
Feature selection

Perfect feature selection
goal-directed
pick all possible subsets of features
for each subset train and test a classier
retain that subset which resulted in the highest
accuracy.
COMPUTATIONALLY INFEASIBLE
Simple heuristics
Stop words like a, an, the etc.
Empirically chosen thresholds (task and corpus
sensitive) for ignoring too frequent or too
rare terms
Discard too frequent and too rare terms
Larger and complex data sets
Confusion with stop words
Especially for topic hierarchies
Greedy inclusion (bottom up) vs. top-down

24
Greedy inclusion algorithm

Most commonly used in text
Algorithm
Compute, for each term, a measure of
discrimination amongst classes.
Arrange the terms in decreasing order of this
measure.
Retain a number of the best terms or features
for use by the classier.
Greedy because
measure of discrimination of a term is computed
independently of other terms
Over-inclusion mild effects on accuracy

25
Measure of discrimination

Dependent on
model of documents
desired speed of training
ease of updates to documents and class
assignments.
Observations
sets included for acceptable accuracy tend to
have large overlap.

26
The test

Similar to the likelihood ratio test
Build a 2 x 2 contingency matrix per class-term
pair
Under the independence hypothesis
Aggregates the deviations of observed values from
expected values
Larger the value of , the lower is our belief
that the independence assumption is upheld by the
observed data.

27
The test

Feature selection process
Sort terms in decreasing order of their
values,
Train several classifier with varying number of
features
Stopping at the point of maximum accuracy.

28
Mutual information

Useful when the multinomial document model is
used
X and Y are discrete random variables taking
values x,y
Mutual information (MI) between them is defined
as
Measure of extent of dependence between random
variables,
Extent to which the joint deviates from the
product of the marginals
Weighted with the distribution mass at (x y)

29
Mutual Information

Advantages
To the extent MI(X,Y) is large, X and Y are
dependent.
Deviations from independence at rare values of
(x,y) are played down
Interpretations
Reduction in the entropy of Y given X.
MI(X Y ) H(X) H(XY) H(Y) H(YX)
KL distance between no-independence hypothesis
and independence hypothesis
KL distance gives the average number of bits
wasted by encoding events from the correct
distribution using a code based on a
not-quite-right distribution

30
Feature selection with MI

Fix a term t and let be an event associated
with that term.
E.g. For the binary model, 0/1,
Pr( ) the empirical fraction of documents in
the training set in which event it occurred.
Pr( ,c) the empirical fraction of training
documents which are in class c
Pr(c) fraction of training documents belonging
to class c.
Formula
Problem document lengths are not normalized.

31
Fisher's discrimination index

Useful when documents are scaled to constant
length
Term occurrences are regarded as fractional real
numbers.
E.g. Two class case
Let X and Y be the sets of length normalized
document vectors corresponding to the two
classes.
Let and be
centroids for each class.
Covariance matrices be

32
Fisher's discrimination index (contd.)

Goal find a projection of the data sets X and Y
on to a line such that
the two projected centroids are far apart
compared to the spread of the point sets
projected on to the same line.
Find a column vector such that
the ratio of
the square of the difference in mean vectors
projected onto it
average projected variance
is maximized.
This gives

33
Fisher's discrimination index

Formula
Let X and Y for both the training and test data
are generated from multivariate Gaussian
distributions
Let
Then this value of induces the optimal (minimum
error) classier by suitable thresholding on
for a test point q.
Problems
Inverting S would be unacceptably slow for tens
of thousands of dimensions.
Llinear transformations would destroy already
existing sparsity.

34
Solution

Recall
Goal was to eliminate terms from consideration.
Not to arrive at linear projections involving
multiple terms
Regard each term t as providing a candidate
direction t which is parallel to the
corresponding axis in the vector space model.
Compute the Fisher's index of t

35
FI Solution (contd.)

Formula
For two class case
Can be generalized to a set c of more than two
classes
Feature selection
Terms are sorted in decreasing order of FI(t)
Best ones chosen as features.

36
Validation

How to decide a cut-off rank ?
Validation approach
A portion of the training documents are held out
The rest is used to do term ranking
The held-out set used as a test set.
Various cut-off ranks can be tested using the
same held-out set.
Leave-one-out cross-validation/partitioning data
into two
An aggregate accuracy is computed over all
trials.
Wrapper to search for the number of features
In decreasing order of discriminative power
Yields the highest accuracy.

37
Validation (contd.)

Simple search heuristic
Keep adding one feature at every step until the
classifier's accuracy ceases to improve.

A general illustration of wrapping for feature
selection.
38
Validation (contd.)

For naive Bayes-like classier
Evaluation on many choices of feature sets can be
done at once.
For Maximum Entropy/Support vector machines
Essentially involves training a classier from
scratch for each choice of the cut-off rank.
Therefore inefficient

39
Validation observations

Bayesian classifier cannot over fit much

Effect of feature selection on Bayesian
classifiers
40
Truncation algorithms

Start from the complete set of terms T
Keep selecting terms to drop
Till you end up with a feature subset
Question When should you stop truncation ?
Two objectives
minimize the size of selected feature set F.
Keep the distorted distribution Pr(CF) as
similar as possible to the original Pr(CjT)

41
Truncation Algorithms Example

Kullback-Leibler (KL)
Measures similarity or distance between two
distributions
Markov Blanket
Let X be a feature in T. Let
The presence of M renders the presence of X
unnecessary as a feature gt M is a Markov blanket
for X
Technically
M is called a Markov blanket for
if X is conditionally independent of
given M
eliminating a variable because it has a Markov
blanket contained in other existing features does
not increase the KL distance between Pr(CT) and
Pr(CF).

42
Finding Markov Blankets

Absence of Markov Blanket in practice
Finding approximate Markov blankets
Purpose To cut down computational complexity
search for Markov blankets M to those with at
most k features.
given feature X, search for the members of M to
those features which are most strongly correlated
(using tests similar to the 2 or MI tests) with
X.
Example For Reuters dataset, over two-thirds of
T could be discarded while increasing
classification accuracy

43
Feature Truncation algorithm

while truncated Pr(CF) is reasonably close to
original Pr(CT) do
for each remaining feature X do
Identify a candidate Markov
blanket M
For some tuned constant k, find
the set M of k variables in F \ X that are most
strongly correlated with X
Estimate how good a blanket M is
Estimate
end for
Eliminate the feature having the best
surviving Markov blanket
end while

44
General observations on feature selection

The issue of document length should be addressed
properly.
Choice of association measures does not make a
dramatic difference
Greedy inclusion algorithms scale nearly linearly
with the number of features
Markov blanket technique takes time proportional
to at least .
Advantage of Markov blankets algo over greedy
inclusion
Greedy algo may include features with high
individual correlations even though one subsumes
the other
Features individually uncorrelated could be
jointly more correlated with the class
This rarely happens
Binary feature selection view may not be only
view to subscribe to
Suggestion combine features into fewer, simpler
ones
E.g. project the document vectors to a lower
dimensional space

45
Bayesian Learner

Very practical text classifier
Assumption
A document can belong to exactly one of a set of
classes or topics.
Each class c has an associated prior probability
Pr(c),
There is a class-conditional document
distribution Pr(djc) for each class.
Posterior probability
Obtained using Bayes Rule
Parameter set consists of all P(dc)

46
Parameter Estimation for Bayesian Learner

Estimate of is based on two sources of
information
Prior knowledge on the parameter set before
seeing any training documents
Terms in the training documents D.
Bayes Optimal Classifier
Taking the expectation of each parameter over
Pr( D)
Computationally infeasible
Maximum likelihood estimate
Replace the sum above with the value of the
summand (Pr(cd, )) for arg max Pr(D
),
Works poorly

47
Naïve Bayes Classifier

Naïve
assumption of independence between terms,
joint term distribution is the product of the
marginals.
Widely used owing to
simplicity and speed of training, applying, and
updating
Two kinds of widely used marginals for text
Binary model
Multinomial model

48
Naïve Bayes Models

Binary Model
Each parameter indicates the probability that a
document in class c will mention term t at least
once.
Multinomial model
each class has an associated die with W faces.
each parameter denotes probability of the face
turning up on tossing the die.
term t occurs n(d t) times in document d,
document length is a random variable denoted L,
.
.

49
Analysis of Naïve Bayes Models

Multiply together a large number of small
probabilities,
Result extremely tiny probabilities as answers.
Solution store all numbers as logarithms
Class which comes out at the top wins by a huge
margin
Sanitizing scores using likelihood ration
Also called the logit function
.

50
Parameter smoothing

What if a test document contains a term t
that never occurred in any training document in
class c ?
Ans will be zero
Even if many other terms clearly hint at a high
likelihood of class c generating the document.
Bayesian Estimation
Estimating probability from insufficient data.
If you toss a coin n times and it always comes up
heads, what is the probability that the (n 1)th
toss will also come up heads?
posit a prior distribution on , called
E.g. The uniform distribution
Resultant posterior distribution

51
Laplace Smoothing

Based on Bayesian Estimation
Laplace's law of succession
loss function (penalty) for picking a
smoothed value as against the true' value.
E.g. Loss function as the square error
For this choice of loss,the best choice of the
smoothed parameter is simply the expectation of
the posterior distribution on having observed
the data
.

52
Laplace Smoothing (contd.)

Heuristic alternatives
Lidstone's law of succession
.
derivation for the multinomial model
there are W possible events where W is the
vocabulary.
.

53
Performance analysis

Multinomial naive Bayes classifier generally
outperforms the binary variant
K-NN may outperform naïve Bayes
Naïve Bayes is faster and more compact
decision boundaries
regions of potential confusion

54
NB Decision boundaries

Bayesian classier partitions the multidimensional
term space into regions
Within each region, the probability of one class
is higher than others
On the boundaries, the probability of two or more
classes are exactly equal
NB is a linear classier
it makes a decision between c 1 and c -1
by thresholding the value of
(bprior) for a suitable vector

55
Pitfalls

Strong bias
fixes the policy that (tth
component of the linear discriminant) depends
only on the statistics of term t in the corpus.
Therefore it cannot pick from the entire set of
possible linear discriminants,

56
Bayesian Networks

Attempt to capture statistical dependencies
between terms themselves
Approximations to the joint distribution over
terms
Probability of a term occurring depends on
observation about other terms as well as the
class variable.
A directed acyclic graph
All random variables (classes and terms) are
nodes
Dependency edges are drawn from c to t for each
t.(parent-child edges)
To represent additional dependencies between
terms dependency edges (parent child) are drawn

57
Bayesian networks. For the naive Bayes
assumption, the only edges are from the
class variable to individual terms. Towards
better approximations to the joint distribution
over terms the probability of a term occurring
may now depend on observation about other terms
as well as the class variable.
58
Bayesian Belief Network (BBN)

DAG
Parents Pa(X)
nodes that are connected by directed edges to a
node X
Fixing the values of the parent variables
completely determines the conditional
distribution of X
Conditional Probability tables
For discrete variables, the distribution data for
X can be stored in the obvious way as a table
with each row showing a set of values of the
parents, the value of X, and a conditional
probability.
Unlike Naïve Bayes
P(dc) is not a simple product over all terms.
.

59
BBN difficulty

Getting a good network structure.
At least quadratic time
Enumeration of all pairs of features
Exploited only for binary model
Multinomial model
Prohibitive CPT sizes

60
Exploiting hierarchy among topics

Ordering between the class labels
For Data warehousing
E.g. high, medium, or low cancer risk patients.
Text Class labels
Taxonomy
large and complex class hierarchy that relates
the class labels
Tree structure
Simplest form of taxonomy
widely used in directory browsing,
often the output of clustering algorithms.
inheritance
If class c0 is the parent of class c1, any
training document which belongs to c1 also
belongs to c0.

61
Topic Hierarchies Feature selection

Discriminating ability of a term sensitive to the
node (or class) in the hierarchy
Measure of discrimination of a term
Can be evaluated with respect to only internal
nodes of the hierarchy.
can' may be a noisy word at the root node of
Yahoo!
Help classifying documents under the sub tree of
/Science/Environment/Recycling.

62
Topic Hierarchies Enhanced parameter estimation

Uniform priors not good
Idea
If a parameter estimate is shaky at a node with
few training documents, perhaps we can impose a
strong prior from a well-trained parent to repair
the estimates.
Shrinkage
Seeks to improve estimates of descendants using
data from ancestors,

63
Shrinkage

Assume multinomial model
introducing a dummy class c0 as the parent of the
root c1, where all terms are equally likely.
For a specific path c0,c1,.cn,
shrunk' estimate is determined by a convex
linear interpolation of the MLE parameters at the
ancestor nodes up through c0
Estimatation of mixing weights
Simple form of EM algorithm
Determined empirically, by iteratively maximizing
the probability of a held-out portion Hn of the
training set for node cn.

64
Shrinkage Observation

Improves accuracy beyond hierarchical naïve
Bayes,
Improvement is high when data is sparse
Capable of utilizing many more features than
Naïve Bayes

65
Topic search in Hierarchy

By definition
All documents are relevant to the root topic
Pr(rootd) 1.
Given a test document d
Find one or more of the most likely leaf nodes in
the hierarchy.
Document cannot belong to more than one path,
.

66
Topic search in Hierarchy Greedy Search strategy

Search starts at the root
Decisions are made greedily
At each internal node pick the highest
probability class
Continue
Drawback
Early errors cause compounding effect

67
Topic search in Hierarchy Best-first search
strategy

For finding m most probable leaf classes
Find the weighted shortest path from the root to
a leaf.
Edge (c0,ci) is assigned a (non-negative) edge
weight of Pr(cic0,d)
.
To make Best first search different from greedy
search
Rescale/smoothen the probabilities

68
Using best-first search on a hierarchy can
improve both accuracy and speed. Because the
hierarchy has four internal nodes, the second
column shows the number of features for each.
These were tuned so that the total number of
features for both at and best-first are roughly
the same (so that the model complexity is
comparable). Because each document belonged to
exactly one leaf node, recall equals precision in
this case and is called accuracy'.
69
The semantics of hierarchical classification

Asymmetry
training document can be associated with any
node,
test document must be routed to a leaf,
Routing test documents to internal nodes
none of the children matches the document
many children match the document
the chances of making a mistake while pushing
down the test document one more level may be too
high.
Research issue

70
Maximum entropy learners Motivation

Bayesian learner
first model Pr(dc) at training time
Apply Bayes rule at test time
Two problems with Bayesian learners
d is represented in a high-dimensional term space
gtPr(dc) cannot be estimated accurately from a
training set of limited size.
No systematic way of adding synthetic features
Such an addition may result in
highly correlated features
high subsumption

71
Maximum entropy learners

Assume that each document has only one class
label
Indicator functions fj(c,d)
Flag jth condition relating class c to document
d
Expectation of indicator fj is
.
Approximating Pr(d,c) and Pr(d) with their
empirical estimates
.

72
Principle of Maximum Entropy

Constraints dont determine Pr(cd) uniquely
Principle of Maximum Entropy
prefer the simplest model to explain observed
data.
Choose Pr(cd) that maximizes the Entropy of
Pr(cd)
In the event of empty training set we should
consider all classes to be equally likely,
Constrained Optimization
Maximize the entropy of the model distribution
Pr(cd)
While obeying the constraints for all j
Optimize by the method of Lagrange multipliers

73
Maximum Entropy solution

Fitting the distribution to the data involves two
steps
Identify a set of indicator functions derived
from the data.
Iteratively arrive at values for the parameters
that satisfy the constraints while maximizing the
entropy of the distribution being modeled.
An equivalent optimization problem

74
Text Classification using Maximum Entropy Model

Example
Pick an indicator for each (class, term)
combination.
For the binary document model,
For the multinomial document model
What we gain with Maximum Entropy over naïve
Bayes
does not suffer from the independence assumptions
E.g.
if the terms t1 machine and t2 learning are
often found together in class c,
and would be suitably
discounted.

75
Performance of Maximum Entropy Classifier

Outperforms naive Bayes in accuracy, but not
consistently.
Table of figures

76
Discriminative classification

Naïve Bayes and Maximum Entropy Classifiers
induce linear decision boundaries between
classes in the feature space.
Discriminative classifiers
Directly map the feature space to class labels
Class labels are encoded as numbers
e.g 1 and 1 for two class problem
Two examples
Linear least-square regression
Support Vector Machines

77
Linear least-square regression

No inherent reason for going through the modeling
step as in Bayesian or maximum entropy classifier
to get a linear discriminant.
Linear Regression Problem
Look for some arbitrary such that
directly predicts the label ci of
document di.
Minimize the square error between the observed
and predicted class variable
Widrow-Hoff (WH) update rule.
Scaling to norm 1
Two equivalent interpretations
Classifier is a hyperplane
Documents are projected on to a direction
Performance
Comparable to Naïve Bayes and Max Ent

78
Support vector machines

Assumption training and test population are
drawn from the same distribution
Hypothesis
Hyperplane that is close to many training data
points has a greater chance of misclassifying
test instances
A hyperplane which passes through a no-man's
land, has lower chances of misclassifications
Make a decision by thresholding
Seek an which maximizes the distance of
any training point from the hyperplane

79
Support vector machines

Optimal separator
Orthogonal to the shortest line connecting the
convex hull of the two classes
Intersects this shortest line halfway
Margin
distance of any training point from the optimized
hyperplane
It is at least

80
Illustration of the SVM optimization problem.
81
SVMs non separable classes

Classes in the training data not always
separable.
Introduce fudge variables
Equivalent dual

82
SVMs Complexity

Quadratic optimization problem.
Working set refine a few at a time holding
the others fixed.
On-demand computation of inner-products
n documents
Recent SVM packages
Linear time by clever selection of working sets.

83
Performance

Comparison with other classifiers
Amongst most accurate classifier for text
Better accuracy than naive Bayes and decision
tree classifier,
interesting revelation
Linear SVMs suffice
standard text classification tasks have classes
almost separable using a hyperplane in feature
space
Research issues
Non-linear SVMs

84
SVM training time variation as the training set
size is increased, with and without sufficient
memory to hold the training set. In the latter
case, the memory is set to about a quarter of
that needed by the training set.
85
Comparison of LSVM with previous classifiers on
the Reuters data set (data taken from Dumais).
(The naive Bayes classier used binary features,
so its accuracy can be improved)
86
Comparison of accuracy across three classifiers
Naive Bayes, Maximum Entropy and Linear SVM,
using three data sets 20 newsgroups, the
Recreation sub-tree of the Open Directory, and
University Web pages from WebKB.
87
Comparison between several classifiers using the
Reuters collection.
88
Hypertext classification

Techniques to address hypertextual features.
Document Object Model or DOM
well-formed HTML document is a properly nested
hierarchy of regions in a tree-structured
DOM tree,
internal nodes are elements
some of the leaf nodes are segments of text.
other nodes are hyperlinks to other Web pages,
In turn DOM trees

89
Representing hypertext for supervised learning

Paying special attention to tags can help with
learning
keyword-based search
assign heuristic weights to terms that occur in
specific HTML tags
Example.. (next slide)

90
Prefixing with tags

Distinguishing between the two occurrences of the
word surfing,
Prefixing each term by the sequence of tags that
we need to follow from the DOM root to get to the
term,
A repeated term in different sections should
reinforce belief in a class label
Using a maximum entropy classier
Accumulate evidence from different features
maintain both forms of a term
plain text and prefixed text (all path prefixes)

91
Experiments

10705 patents from the US Patent Office,
70 error with plain text classier,
24 error with path-tagged terms
17. Error with path prefixes
1700 resumes (with naive Bayes classifier)
53 error with flattened HTML
40 error with prefix-tagged terms

92
Limitations

Prefix representations
ad-hoc
inflexible.
Generalisibility
How to incorporate additional features ?
E.g. adding features derived from hyperlinks.
Relations
uniform way to codify hypertextual features.
Example

93
Rule Induction for relational learning

Inductive classifiers
discover rules from a collection of relations.
Example solution for above
Goal Discover a set of predicate rules
Consider 2 class setting
Positive examples D and negative examples D-
Test instance
True gt positive instance. Else negative instance.

94
Rule induction with First Order Inductive Logic
(FOIL)

Well-known rule learner
Start with empty rule set
learn new (disjunctive) rule
add conjunctive literals to the new rule until no
negative example is covered by the new rule.
pick a literal which increases the ratio of
surviving positive to negative bindings rapidly.
Remove positive examples covered by any rule
generated thus far.
Till no positive instances are left

95
Literals Explored