Jerry Held - PowerPoint PPT Presentation

About This Presentation
Title:

Jerry Held

Description:

Using contingency table of a term t and a category c, where A is the number of ... half of a pair of related documents given that it occurs in the first half. ... – PowerPoint PPT presentation

Number of Views:44
Avg rating:3.0/5.0
Slides: 26
Provided by: anal113
Learn more at: https://cs.gmu.edu
Category:
Tags: halfterm | held | jerry

less

Transcript and Presenter's Notes

Title: Jerry Held


1
Mustafa CayciINFS 795 An Evaluation on Feature
Selection for Text Clustering
2
Introduction
  • Text Clustering is the problem of automatically
    assigning predefined categories to free text
    documents
  • Effective and Efficient Information Retrieval
  • Organized Results
  • Generating Taxonomy and Ontology
  • Text or document is represented as a bag of words.

3
Introduction
  • The major problem of this approach is the high
    dimensionality of the feature space.
  • The feature space is consists of the unique terms
    that occur in documents which can be in tens or
    hundreds of thousands of terms.
  • This is prohibitively high for many learning
    algorithms.

4
Introduction
  • High dimensionality of feature space is a
    challenge for clustering algorithms because of
    the inherent data sparseness.
  • Concept of proximity or clustering may not be
    meaningful in high dimensional feature space.
  • The solution is to reduce the feature space
    dimensionality.

5
Feature Selection
  • Feature selection methods include the removal of
    non-informative terms.
  • The focus of this presentation is the evaluation
    and comparison of feature selection methods in
    the reduction of a high dimensional feature space
    in text clustering problems.

6
Feature Selection
  • What are the strengths and weakness of existing
    feature selection methods applied to text
    clustering?
  • To what extend can feature selection improve the
    accuracy of a classifier?
  • How much of the document vocabulary can be
    reduced without losing useful information in
    category prediction?

7
Feature Selection Methods
  • Give brief introduction on several feature
    selection methods
  • Information Gain (IG)
  • ?2 Statistics (CHI)
  • Document Frequency
  • Term Strength (TS)
  • Entropy-based Ranking
  • Term Contribution

8
Information Gain (IG)
  • Information gain is frequently employed as a
    term-goodness criterion in the field of machine
    learning.
  • It measures the number of bits of information
    obtained for category prediction by knowing the
    presence or absence of a term in a document.

9
Information Gain (IG)
  • Let cii 1m denote the set of categories in
    the target space
  • The information gain of term t is defined to be
  • G(t) - Si 1m Pr(ci)logPr(ci)
  • Pr(t) Si 1m Pr(cit) log
    Pr(cit)
  • Pr(t-) Si 1m Pr(cit-) log
    Pr(cit-)

10
Information Gain (IG)
  • Given a training corpus, for each unique term,
    information gain is computed, and removed from
    the feature space those terms whose information
    gain was less than some predetermined threshold.
  • The computation includes the estimation of the
    conditional probabilities of a category given a
    term, and entropy computations.
  • The probability estimation has a time complexity
    of O(N) and space complexity of O(VN) where N is
    the number of training documents and V is the
    vocabulary size.

11
?2 Statistics (CHI)
  • The ?2 statistic measures the lack of
    independence between t and c and can be compared
    to ?2 distribution with one degree freedom.
  • Using contingency table of a term t and a
    category c, where A is the number of times t and
    c co-occur, B is the number of time the t occurs
    without c, C is the number of times c occurs
    without t, D is the number of times neither c nor
    t occurs and N is the total number of documents,
    the term-goodness measure is

12
?2 Statistics (CHI)
  • The ?2 statistics has a natural value of zero if
    t and c are independent.
  • For each category of ?2 statistic between each
    unique term in a training corpus and that
    category
  • ?2avg (t) S Pr(ci) ?2 (t, ci)

13
Document Frequency (DF)
  • Document frequency is the number of documents in
    which a term occurs.
  • Document frequency is computed for each unique
    term in the training corpus and removed from the
    feature space those terms whose DF is less than
    some predetermined threshold.
  • Rare terms are either non-informative for
    category prediction, or not influential in global
    performance.
  • Observation Low DF terms are assumed to be
    relatively informative and should not be removed
    aggressively.

14
Term Strength (TS)
  • Term strength is originally proposed and
    evaluated by Wilbur and Sirotkin for vocabulary
    reduction in text retrieval.
  • This methods estimates term importance based on
    how commonly a term is likely to appear in
    closely-related documents.
  • It uses a training set of documents to derive
    documents pairs whose similarity is above
    threshold.
  • Term strength is then computed based on the
    estimated conditional probability that a term
    occurs in the second half of a pair of related
    documents given that it occurs in the first half.

15
Entropy Based Ranking
  • Consider each feature Fi as a random variable
    while fi as its value. From entropy theory,
    entropy is
  • E(F1,,FM) - Sf1 SfM p(f1, ,fM)
    log(p(f1, ,fM)
  • where p(f1, ,fM) is the probability or
    density at the point f1, ,fM.
  • If the probability is uniformly distributed and
    we are most certain about the outcome, then
    entropy is maximum.

16
Entropy Based Ranking
  • When the data has well-formed clusters, the
    uncertainty is low so is the entropy.
  • In the real-world data, there are few cases that
    the clusters are well-formed.
  • Two points belonging to the same cluster or 2
    different clusters will contribute to the total
    entropy less that if they were uniformly
    separated.
  • Similarity Si1,i2 between two instances Xi1 and
    Xi2 is high if the 2 instance are very close and
    Si1,i2 is low if the 2 are far away. Entropy
    Ei1,i2 will be low if Si1,i2 is either high or
    low, and Ei1,i2 will be low otherwise.

17
Entropy Based Ranking
  • where Si,i is the similarity value between
    document di and dj and dj Si, j is defined as
    follows
  • Si, j e a x disti,j a - ln(0.5) / dist
  • where disti,j is the distance between the
    document di and dj after the term t is removed

18
Term Contribution
  • Text clustering is highly dependent on the
    documents similarity.
  • Sim(di , dj ) S f(t, di) x f(t, dj)
  • where f(t, di) represents the weight of term t
    in document d
  • tf idf is also represents the weight of a term
    in document d where tf is term frequency and idf
    is the inverse document frequency

19
Term Contribution
  • The contribution of each term is the overall
    contribution to documents similarities and shown
    by the following equation
  • TC(t) S f(t, di) x f(t, dj)

20
Experiments
  • The supervised feature selection methods are
    evaluated
  • IG
  • CHI
  • The unsupervised feature selection methods are
    evaluated
  • DF
  • TS
  • TC

21
Experiments
  • K-Means algorithm is chosen to perform the actual
    clustering
  • Entropy and Precision measures are used to
    evaluate the clustering performance
  • 10 sets of initial centroids are chosen randomly
  • Before performing clustering, tf idf (with
    ltc scheme) is used to calculate the weight of
    each term.

22
Performance Measure
  • Entropy
  • Entropy measures the uniformity or purity of a
    cluster. The Entropy for all clusters is defined
    by the weighted sum of the entropy for all
    clusters
  • where

23
Performance Measure
  • Precision
  • For each cluster, choose the class labels which
    shares most documents in a cluster becomes the
    final class label
  • The final precision is defined as the weighted
    sum of the precision for all clusters

24
Data Sets
  • Data sets are Reuters-21578, 20 Newsgroups and
    one web directory dataset (Web)
  • Data set properties

25
Results and Analysis
  • Supervised Feature Selection
  • IG and CHI feature selection methods are
    performed
  • In general feature selection makes little
    progress on Reuters and 20NG
  • Achieves much improvement on Web directory
    dataset
  • Unsupervised Feature Selection
  • DF, TS, TC and En feature selection methods are
    performed
  • While 90 of terms removed, entropy is reduced by
    2 and precision is increased by 1
  • When more terms are removed, the performance of
    unsupervised methods is dropped quickly, however,
    the performance of supervised methods is still
    improved
Write a Comment
User Comments (0)
About PowerShow.com