Title: COMP53184044, Lecture 10 Knowledge Discovery and Data Mining
1COMP5318/4044, Lecture 10Knowledge Discovery
and Data Mining
- Getting Started
- What if I only have a small training set?
- This set of slides is adapted from two diffeent
work with Irena Koprinska, Jason Chan, Martin
Buchholz and Dirk Pflüger
2Motivation
- Machine Learning usually depends upon a good
(large) size of training set, but it is not
always feasible in the real world to obtain a
large training set - Rare events
- Expensive to obtain
- Does not have the relevant domain knowledge
- How can we improve the classification performance
even when we have a small set of training
examples? - Making use of the unlabelled set
- Creating artificial examples
3Available MethodsCo-Training with a Random Split
of a Single Natural Feature Set
- Make more use of unlabelled data while using very
few labelled instances to help classifiers learn - Co-Testing
- An active learning algorithm that exploits
multiple views. It is based on the idea of
learning from mistakes. More precisely, it
queries examples on which the views predict a
different label - Expectation Maximization (EM)
- EM is a statistical model that makes use of the
finite Gaussian mixtures model. The algorithm is
similar to the K-means procedure in that a set of
parameters are re-computed until a desired
convergence value is achieved. The parameters are
re-computed until a desired convergence value is
achieved. The finite mixtures model assumes all
attributes to be independent random variables.
This algorithm is part of the Weka clustering
package. - Self Training
- Co-Training
4The Co-Training AlgorithmCo-Training with a
Random Split of a Single Natural Feature Set
- Introduced by Avrim Blum and Tom Mitchell in 1998
- Applied to web-page classification
- Problem Identify the home page of a course
- How
- Build two classifiers with separate (disjoint)
features trained on the same small labelled set - Words in the body of the web page
- Words in hyperlinks of other documents referring
to that particular page - Classifiers label instances in unlabelled set
- Each classifier selects the most confidently
predicted examples and adds them to the labelled
set - Repeat from 1
- Obtain small set L of labelled examples
- Obtain large set U of unlabelled examples
- Obtain 2 sets F1 and F2 of features describing
the dataset - while U ?? do
- train classifier C1 from L based on F1
- train classifier C2 from L based on F2
- for each classifier Ci do
- Ci labels examples in U based on Fi
- Ci chooses the most confidently predicted
examples E from U - E is removed from U and added (with their given
labels) to L - end for
- end while
A. Blum, T. Mitchell. Combining Labeled and
Unlabeled Data with Co-Training. In Proceedings
of the Workshop on Computational Learning Theory,
1998
5Assumptions of Co-Training Assumptions of the
Feature SetsCo-Training with a Random Split of a
Single Natural Feature Set
- Conditional independence
- knowing the values of F1 does not mean you can
predict the values of F2 - Redundant sufficiency
- using only one of F1 or F2 separately still gives
good classification accuracy
6Experiment 1 (Web) Setup (1)Co-Training with a
Random Split of a Single Natural Feature Set
- Domain web-browsing agent
- 4 users
- 4 topics
- nuclear fusion
- circulatory system
- food pyramid
- greenhouse effect ozone layer
- 80 pages per topic
7Experiment 1 Setup (2)Co-Training with a Random
Split of a Single Natural Feature Set
- Each document is represented using bag-of-words
- Information Gain used in feature selection
- Top 100 words was selected
- Reduction of about 98
- Term frequency was used in the features vector
- Four types of classifiers were tested
- Decision Tree (DT)
- Random Forest (RF)
- Naïve Bayesian (NB) and
- Support Vector Machine (SVM)
- The classification performance using 10-fold
cross validation
8Feature SetsCo-Training with a Random Split of a
Single Natural Feature Set
- Natural Feature Sets
- Heading
- Titles Headings Hyperlinks all words that
appear in either titles, headings or hyperlinks - Body all words that appear in the Web page
without counting occurrences in the titles,
headings, or hyperlinks - Random Selection Feature Sets
- Half1 a random selection of half of the feature
set Body - Half2 the other half of the words not found in
Half1 - Fifth1 a random selection of a fifth of the
feature set Body - Fifth2 a random selection of another fifth of
the feature set Body
9WebSL F-measures in Supervised Learning
Co-Training with a Random Split of a Single
Natural Feature Set
- Simply supervised learning
- No co-training
- 10-fold cross validation
- 90 of data as training set
- Results were averaged over all users and topics
- The numbers in the table were F-measure
(macro-averaged)
10WebCT The ComparisonCo-Training with a Random
Split of a Single Natural Feature Set
- Co-training with Natural Features
- words in main body of page
- words in titles, headings, and hyperlinks
- Co-training with a Random Split
- using only the words in the body
- Is co-training with a random split still
beneficial?
11WebCT F-measures in Co-TrainingCo-Training with
a Random Split of a Single Natural Feature Set
Table 1. Maximum increase in classification
performance of the combined classifier using
co-training
Initial size of labeled examples 8 instances
12Experiment 2 (Spam) SetupCo-Training with a
Random Split of a Single Natural Feature Set
- Domain spam detection
- LingSpam emails sent to Linguist mailing list
- emails 2893
- legitimate emails 2412 (83.4)
- spam 481 (16.6)
13Experiment 2 Results (1)Co-Training with a
Random Split of a Single Natural Feature Set
- SpamSL
- Simply supervised learning
- No co-training
- 10-fold cross validation
- 90 of data as training set (2595 instances in
training) - Results were averaged over all users and topics
- The numbers in the table were F-measure
(macro-averaged)
- SpamCT
- Initial size of labeled examples 8 instances
14SpamCT Results (2)Co-Training with a Random
Split of a Single Natural Feature Set
NB
Initial labelled spam 1 Initial labelled
non-spam 1 p 1 n 5
15Why is the random split so good?Co-Training with
a Random Split of a Single Natural Feature Set
- One of the natural feature sets is considerably
weaker than the other - The classifier using Title/Subject is incorrectly
labelling many instances in comparison with
classifiers built using the random selection
feature sets, hence transferring many incorrectly
labelled instances into the labelled set. - As discovered in the Supervised Learning, it was
found that using a random selection of half of
the features from all the features results in
classifiers that only perform slightly worse than
a classifier using all the attributes available. - As a result, when performing co-training, both
classifiers using their respective half of the
features are able to improve the training set by
labelling unlabelled instances with a
sufficiently high classification performance.
16ConclusionCo-Training with a Random Split of a
Single Natural Feature Set
- In our experiments
- Comparison between co-training with random
splitting and using two natural feature sets - Co-training with a random split of a single
natural feature set can be just as competitive as
co-training with two natural feature sets - Reasons cited for the observed experimental
results - Random splitting is favoured when the data
contains many weak attributes and/or two natural
feature sets significantly differ in strength - These conditions are very common (e.g. text
categorization), which indicates that co-training
with random splitting has great practical
potential
17Another Approach to Tackle Small Training Set
Artificial Examples
- Domain Problem
- Vast amount of information available
- Use of Search Engines
- Thousands of results
- Often lots of irrelevant pages
18Users Perspective
- Ranking
- Ranking optimised for whole web community, not
for a users individual needs - Query formulation
- Difficult to make ones intention explicit
- NEC 50 one-word queries
- ? Use of example pages to specify the users
intention closer
19Common approaches
- Restrict to limited domain, construct
domain-specific SEs - E.g. create ontologies/hierarchies1
- query refinement (add keywords)2
- Clustering instead of ranking3
1Dwi H. Widyantoro, John Yen A Fuzzy
Ontology-based AbstractSearch and Its User
Studies. Proc. of the 10th IEEE International
Conference on Fuzzy Systems, vol.2, pp.1291-4,
2001. 2Satoshi Oyama, Takashi Kokubo and Toru
Ishida Domain-Specific Web Search with Keyword
Spcies. IEEE Transactions on Knowledge
Engineering, 2003. 3Michael Chau, Daniel Zeng,
Hsinchun Chen Personalized Spiders for Web
Search and Analysis. Proc. of the 1st ACM/IEEE-CS
Joint Conference on Digital Libraries (JCSL01),
Roanoke, Virginia, June 24-28, 2001, pp.79-87..
20Our approach
- Use of general purpose search engine
- No restriction to special domain
- User provides
- general query
- example pages (implicit knowledge)
- Re-ranking based on confidence of ML
- Keep representation of results
21System Design
22Feature Selection
- Feature Reduction Nouns
- A threshold of number of features supplied
- Use features from positive and unclassified
documents - Use of modified tf-idf
- tf additionally weighted with inverse document
length
23Evaluation Method
- Precision and recall not feasible
- No classification but ranking
- User looks only at first few documents
- ? cost-benefit analysis
24Evaluation Method (cont.)
Assume we have a set of 15 documents consisting
of 5 positive and 10 negative documents
- User As many good documents as soon as possible
25Creating Artificial Examples
- Main problems for system
- Very small training set
- No negative documents
- 2 classes necessary for training
- Normalization ? feature vectors ? 0,1d
- Create negative docs where no positive expected
26Creating Artificial Examples (cont.)
- Artificial document Zero (AD0)
- All features zero
- Document completely off-topic
- Further artificial documents
- Value 0 for attributes that occur in pos. docs
- Higher value for other attributes
- Manabu Sassano (2003). Virtual Examples for Text
Classification with Support Vector Machines,
Proceedings of the 2003 Conference on Empirical
Methods in Natural Language Processing. - Partha Niyogi, Federico Girosi, and Tomaso Poggio
(1998). Incorporating prior information in
machine learning by creating virtual examples. In
Proceedings of IEEE, volume 86, pages 21962207.
27Typical observations
- Significant improvement for initial ranking
- Especially for topmost and bottommost positions
of ranking - Later enough real negative examples by feedback
28Different Test Cases
- Scenario 1
- Unspecific query (large domain)
- Positive documents from different subdomains
- Very low percentage of positive results
- Scenario 2
- Very specific query (small domain)
- Returned documents closely related
- Search for very specific subdomain
- Scenario 3
- Example documents from related domain
- Test for generalization capabilities
29Evaluation Scenario 1
- Tourist searching for information (Australia)
- First 100 Google results
- Example pages about accommodation, travel,
tourists - Low percentage (4) of positive results,all from
different subdomains
30Scenario 1 (cont.)
- Works with small amount of positive results
- SVM outperforms others
31Evaluation Scenario 2
- How to play movie on TV connected to laptop
(laptop tv video overlay) - First 100 Google results
- Low percentage (7) of positive results
- Very specialized, difficult for humans
- Target docs mainly FAQs, Forum pages
32Scenario 2 (cont.)
- Performance well above average,even though
closely related - ½ pos. docs on top of list using SVM
33Evaluation Scenario 3
- Search for special recipe (apple pie)
- First 500 Google results
- Some other recipes available (fish, bread, berry
pie) - High percentage (40) of positive results
- Negative documents e.g. movies, books,
34Scenario 3 (cont.)
- Benefit axis most important
- Similar results with only fish example doc? good
generalization capabilities - Leads to domain-specific SE about recipes
35Conclusions
- Similar observations for all scenarios
- SVM outperforms all other MLs
- Generation of artificial negative examples
reasonable (more work to be done) - System able to cope with very small training set