COMP53184044, Lecture 10 Knowledge Discovery and Data Mining - PowerPoint PPT Presentation

1 / 35

About This Presentation

Title:

COMP53184044, Lecture 10 Knowledge Discovery and Data Mining

Description:

Machine Learning usually depends upon a good (large) size ... Some other recipes available (fish, bread, berry pie) High percentage (40%) of positive results ... – PowerPoint PPT presentation

Number of Views:39

Avg rating:3.0/5.0

Slides: 36

Provided by: Lab4151

Category:

more less

Transcript and Presenter's Notes

Title: COMP53184044, Lecture 10 Knowledge Discovery and Data Mining

1
COMP5318/4044, Lecture 10Knowledge Discovery
and Data Mining

Getting Started
What if I only have a small training set?
This set of slides is adapted from two diffeent
work with Irena Koprinska, Jason Chan, Martin
Buchholz and Dirk Pflüger

2
Motivation

Machine Learning usually depends upon a good
(large) size of training set, but it is not
always feasible in the real world to obtain a
large training set
Rare events
Expensive to obtain
Does not have the relevant domain knowledge
How can we improve the classification performance
even when we have a small set of training
examples?
Making use of the unlabelled set
Creating artificial examples

3
Available MethodsCo-Training with a Random Split
of a Single Natural Feature Set

Make more use of unlabelled data while using very
few labelled instances to help classifiers learn
Co-Testing
An active learning algorithm that exploits
multiple views. It is based on the idea of
learning from mistakes. More precisely, it
queries examples on which the views predict a
different label
Expectation Maximization (EM)
EM is a statistical model that makes use of the
finite Gaussian mixtures model. The algorithm is
similar to the K-means procedure in that a set of
parameters are re-computed until a desired
convergence value is achieved. The parameters are
re-computed until a desired convergence value is
achieved. The finite mixtures model assumes all
attributes to be independent random variables.
This algorithm is part of the Weka clustering
package.
Self Training
Co-Training

4
The Co-Training AlgorithmCo-Training with a
Random Split of a Single Natural Feature Set

Introduced by Avrim Blum and Tom Mitchell in 1998
Applied to web-page classification
Problem Identify the home page of a course
How
Build two classifiers with separate (disjoint)
features trained on the same small labelled set
Words in the body of the web page
Words in hyperlinks of other documents referring
to that particular page
Classifiers label instances in unlabelled set
Each classifier selects the most confidently
predicted examples and adds them to the labelled
set
Repeat from 1

Obtain small set L of labelled examples
Obtain large set U of unlabelled examples
Obtain 2 sets F1 and F2 of features describing
the dataset
while U ?? do
train classifier C1 from L based on F1
train classifier C2 from L based on F2
for each classifier Ci do
Ci labels examples in U based on Fi
Ci chooses the most confidently predicted
examples E from U
E is removed from U and added (with their given
labels) to L
end for
end while

A. Blum, T. Mitchell. Combining Labeled and
Unlabeled Data with Co-Training. In Proceedings
of the Workshop on Computational Learning Theory,
1998
5
Assumptions of Co-Training Assumptions of the
Feature SetsCo-Training with a Random Split of a
Single Natural Feature Set

Conditional independence
knowing the values of F1 does not mean you can
predict the values of F2
Redundant sufficiency
using only one of F1 or F2 separately still gives
good classification accuracy

6
Experiment 1 (Web) Setup (1)Co-Training with a
Random Split of a Single Natural Feature Set

Domain web-browsing agent
4 users
4 topics
nuclear fusion
circulatory system
food pyramid
greenhouse effect ozone layer
80 pages per topic

7
Experiment 1 Setup (2)Co-Training with a Random
Split of a Single Natural Feature Set

Each document is represented using bag-of-words
Information Gain used in feature selection
Top 100 words was selected
Reduction of about 98
Term frequency was used in the features vector
Four types of classifiers were tested
Decision Tree (DT)
Random Forest (RF)
Naïve Bayesian (NB) and
Support Vector Machine (SVM)
The classification performance using 10-fold
cross validation

8
Feature SetsCo-Training with a Random Split of a
Single Natural Feature Set

Natural Feature Sets
Heading
Titles Headings Hyperlinks all words that
appear in either titles, headings or hyperlinks
Body all words that appear in the Web page
without counting occurrences in the titles,
headings, or hyperlinks
Random Selection Feature Sets
Half1 a random selection of half of the feature
set Body
Half2 the other half of the words not found in
Half1
Fifth1 a random selection of a fifth of the
feature set Body
Fifth2 a random selection of another fifth of
the feature set Body

9
WebSL F-measures in Supervised Learning
Co-Training with a Random Split of a Single
Natural Feature Set

Simply supervised learning
No co-training
10-fold cross validation
90 of data as training set
Results were averaged over all users and topics
The numbers in the table were F-measure
(macro-averaged)

10
WebCT The ComparisonCo-Training with a Random
Split of a Single Natural Feature Set

Co-training with Natural Features
words in main body of page
words in titles, headings, and hyperlinks
Co-training with a Random Split
using only the words in the body
Is co-training with a random split still
beneficial?

11
WebCT F-measures in Co-TrainingCo-Training with
a Random Split of a Single Natural Feature Set
Table 1. Maximum increase in classification
performance of the combined classifier using
co-training
Initial size of labeled examples 8 instances
12
Experiment 2 (Spam) SetupCo-Training with a
Random Split of a Single Natural Feature Set

Domain spam detection
LingSpam emails sent to Linguist mailing list
emails 2893
legitimate emails 2412 (83.4)
spam 481 (16.6)

13
Experiment 2 Results (1)Co-Training with a
Random Split of a Single Natural Feature Set

SpamSL
Simply supervised learning
No co-training
10-fold cross validation
90 of data as training set (2595 instances in
training)
Results were averaged over all users and topics
The numbers in the table were F-measure
(macro-averaged)

SpamCT
Initial size of labeled examples 8 instances

14
SpamCT Results (2)Co-Training with a Random
Split of a Single Natural Feature Set
NB
Initial labelled spam 1 Initial labelled
non-spam 1 p 1 n 5
15
Why is the random split so good?Co-Training with
a Random Split of a Single Natural Feature Set

One of the natural feature sets is considerably
weaker than the other
The classifier using Title/Subject is incorrectly
labelling many instances in comparison with
classifiers built using the random selection
feature sets, hence transferring many incorrectly
labelled instances into the labelled set.
As discovered in the Supervised Learning, it was
found that using a random selection of half of
the features from all the features results in
classifiers that only perform slightly worse than
a classifier using all the attributes available.
As a result, when performing co-training, both
classifiers using their respective half of the
features are able to improve the training set by
labelling unlabelled instances with a
sufficiently high classification performance.

16
ConclusionCo-Training with a Random Split of a
Single Natural Feature Set

In our experiments
Comparison between co-training with random
splitting and using two natural feature sets
Co-training with a random split of a single
natural feature set can be just as competitive as
co-training with two natural feature sets
Reasons cited for the observed experimental
results
Random splitting is favoured when the data
contains many weak attributes and/or two natural
feature sets significantly differ in strength
These conditions are very common (e.g. text
categorization), which indicates that co-training
with random splitting has great practical
potential

17
Another Approach to Tackle Small Training Set
Artificial Examples

Domain Problem
Vast amount of information available
Use of Search Engines
Thousands of results
Often lots of irrelevant pages

18
Users Perspective

Ranking
Ranking optimised for whole web community, not
for a users individual needs
Query formulation
Difficult to make ones intention explicit
NEC 50 one-word queries
? Use of example pages to specify the users
intention closer

19
Common approaches

Restrict to limited domain, construct
domain-specific SEs
E.g. create ontologies/hierarchies1
query refinement (add keywords)2
Clustering instead of ranking3

1Dwi H. Widyantoro, John Yen A Fuzzy
Ontology-based AbstractSearch and Its User
Studies. Proc. of the 10th IEEE International
Conference on Fuzzy Systems, vol.2, pp.1291-4,
2001. 2Satoshi Oyama, Takashi Kokubo and Toru
Ishida Domain-Specific Web Search with Keyword
Spcies. IEEE Transactions on Knowledge
Engineering, 2003. 3Michael Chau, Daniel Zeng,
Hsinchun Chen Personalized Spiders for Web
Search and Analysis. Proc. of the 1st ACM/IEEE-CS
Joint Conference on Digital Libraries (JCSL01),
Roanoke, Virginia, June 24-28, 2001, pp.79-87..
20
Our approach

Use of general purpose search engine
No restriction to special domain
User provides
general query
example pages (implicit knowledge)
Re-ranking based on confidence of ML
Keep representation of results

21
System Design
22
Feature Selection

Feature Reduction Nouns
A threshold of number of features supplied
Use features from positive and unclassified
documents
Use of modified tf-idf
tf additionally weighted with inverse document
length

23
Evaluation Method

Precision and recall not feasible
No classification but ranking
User looks only at first few documents
? cost-benefit analysis

24
Evaluation Method (cont.)
Assume we have a set of 15 documents consisting
of 5 positive and 10 negative documents

User As many good documents as soon as possible

25
Creating Artificial Examples

Main problems for system
Very small training set
No negative documents
2 classes necessary for training
Normalization ? feature vectors ? 0,1d
Create negative docs where no positive expected

26
Creating Artificial Examples (cont.)

Artificial document Zero (AD0)
All features zero
Document completely off-topic
Further artificial documents
Value 0 for attributes that occur in pos. docs
Higher value for other attributes

Manabu Sassano (2003). Virtual Examples for Text
Classification with Support Vector Machines,
Proceedings of the 2003 Conference on Empirical
Methods in Natural Language Processing.
Partha Niyogi, Federico Girosi, and Tomaso Poggio
(1998). Incorporating prior information in
machine learning by creating virtual examples. In
Proceedings of IEEE, volume 86, pages 21962207.

27
Typical observations