WEKA and Machine Learning Algorithms - PowerPoint PPT Presentation

1 / 29

About This Presentation

Title:

WEKA and Machine Learning Algorithms

Description:

Title: Slide 1 Author: galley Last modified by: Serdar Created Date: 4/5/2004 11:55:47 PM Document presentation format: On-screen Show (4:3) Company – PowerPoint PPT presentation

Number of Views:209

Avg rating:3.0/5.0

Slides: 30

Provided by: gal6151

Category:

more less

Transcript and Presenter's Notes

Title: WEKA and Machine Learning Algorithms

1
WEKA and Machine Learning Algorithms
2
Algorithm Types

Classification (supervised)
Given -gt A set of classified examples
instances
Produce -gt A way of classifying new examples
Instances described by fixed set of features
attributes
Classes discrete or continuous classification
regression
Interested in
Results? (classifying new instances)
Model? (how the decision is made)
Clustering (unsupervised)
There are no classes
Association rules
Look for rules that relate features to other
features

3
Classification
4
Clustering
5
Clustering

It is expected that similarity among members of a
cluster should be high and similarity among
objects of different clusters should be low.
The objectives of clustering
knowing which data objects belong to which
cluster
understanding common characteristics of the
members of a specific cluster

6
Clustering vs Classification

There is some similarity between clustering and
classification.
Both classification and clustering are about
assigning appropriate class or cluster labels to
data records. However, clustering differs from
classification in two aspects.
First, in clustering, there are no pre-defined
classes. This means that the number of classes or
clusters and the class or cluster label of each
data record are not known before the operation.
Second, clustering is about grouping data rather
than developing a classification model.
Therefore, there is no distinction between data
records and examples. The entire data population
is used as input to the clustering process.

7
Association Mining
8
Overfitting

Memorization vs generalization
To fix, use
Training data to form rules
Validation data to decide on best rule
Test data to determine system performance
Cross-validation

9
Baseline Experiments

In order to evaluate the efficiency of the
classifiers used in experiments, we use
baselines
Majority based random classification (Kappa0)
Class distribution based random classification
(Kappa0)
Kappa statistics, is used as a measure to assess
the improvement of a classifiers accuracy over a
predictor employing chance as its guide.
P0 is the accuracy of the classifier and Pc is
the expected accuracy that can be achieved by a
randomly guessing classifier on the same data
set. Kappa statistics has a range between 1 and
1, where 1 is total disagreement (i.e., total
misclassification) and 1 is perfect agreement
(i.e., a 100 accurate classification).
Kappa score over 0.4 indicates a reasonable
agreement beyond chance.

10
Data Mining Process
11
WEKA the software

Machine learning/data mining software written in
Java (distributed under the GNU Public License)
Used for research, education, and applications
Complements Data Mining by Witten Frank
Main features
Comprehensive set of data pre-processing tools,
learning algorithms and evaluation methods
Graphical user interfaces (incl. data
visualization)
Environment for comparing learning algorithms

12
Wekas Role in the Big Picture
13
WEKA Terminology

Some synonyms/explanations for the terms used by
WEKA
Attribute feature
Relation collection of examples
Instance collection in use
Class category

14
WEKA only deals with flat files

_at_relation heart-disease-simplified
_at_attribute age numeric
_at_attribute sex female, male
_at_attribute chest_pain_type typ_angina, asympt,
non_anginal, atyp_angina
_at_attribute cholesterol numeric
_at_attribute exercise_induced_angina no, yes
_at_attribute class present, not_present
_at_data
63,male,typ_angina,233,no,not_present
67,male,asympt,286,yes,present
67,male,asympt,229,yes,present
38,female,non_anginal,?,no,not_present
...

numeric attribute
nominal attribute
15
(No Transcript)
16
Explorer pre-processing the data

Data can be imported from a file in various
formats ARFF, CSV, C4.5, binary
Data can also be read from a URL or from an SQL
database (using JDBC)
Pre-processing tools in WEKA are called filters
WEKA contains filters for
Discretization, normalization, resampling,
attribute selection, transforming and combining
attributes,

17
Explorer building classifiers

Classifiers in WEKA are models for predicting
nominal or numeric quantities
Implemented learning schemes include
Decision trees and lists, instance-based
classifiers, support vector machines, multi-layer
perceptrons, logistic regression, Bayes nets,
Meta-classifiers include
Bagging, boosting, stacking, error-correcting
output codes, locally weighted learning,

18
Classifiers - Workflow
LearningAlgorithm
Classifier
Predictions
19
Evaluation

Accuracy
Percentage of Predictions that are correct
Problematic for some disproportional Data Sets
Precision
Percent of positive predictions correct
Recall (Sensitivity)
Percent of positive labeled samples predicted as
positive
Specificity
The percentage of negative labeled samples
predicted as negative.

20
Confusion matrix

Contains information about the actual and the
predicted classification
All measures can be derived from it
accuracy (ad)/(abcd)
recall d/(cd) gt R
precision d/(bd) gt P
F-measure 2PR/(PR)
false positive (FP) rate b /(ab)
true negative (TN) rate a /(ab)
false negative (FN) rate c /(cd)

predicted predicted

true a b
true c d
21
Explorer clustering data

WEKA contains clusterers for finding groups of
similar instances in a dataset
Implemented schemes are
k-Means, EM, Cobweb, X-means, FarthestFirst
Clusters can be visualized and compared to true
clusters (if given)
Evaluation based on loglikelihood if clustering
scheme produces a probability distribution

22
Explorer finding associations

WEKA contains an implementation of the Apriori
algorithm for learning association rules
Works only with discrete data
Can identify statistical dependencies between
groups of attributes
milk, butter ? bread, eggs (with confidence 0.9
and support 2000)
Apriori can compute all rules that have a given
minimum support and exceed a given confidence

23
Explorer attribute selection

Panel that can be used to investigate which
(subsets of) attributes are the most predictive
ones
Attribute selection methods contain two parts
A search method best-first, forward selection,
random, exhaustive, genetic algorithm, ranking
An evaluation method correlation-based, wrapper,
information gain, chi-squared,
Very flexible WEKA allows (almost) arbitrary
combinations of these two

24
Explorer data visualization

Visualization very useful in practice e.g. helps
to determine difficulty of the learning problem
WEKA can visualize single attributes (1-d) and
pairs of attributes (2-d)
To do rotating 3-d visualizations (Xgobi-style)
Color-coded class values
Jitter option to deal with nominal attributes
(and to detect hidden data points)
Zoom-in function

25
Performing experiments

Experimenter makes it easy to compare the
performance of different learning schemes
For classification and regression problems
Results can be written into file or database
Evaluation options cross-validation, learning
curve, hold-out
Can also iterate over different parameter
settings
Significance-testing built in!

26
The Knowledge Flow GUI

New graphical user interface for WEKA
Java-Beans-based interface for setting up and
running machine learning experiments
Data sources, classifiers, etc. are beans and can
be connected graphically
Data flows through components e.g.,
data source -gt filter -gt classifier -gt
evaluator
Layouts can be saved and loaded again later

27
Beyond the GUI

How to reproduce experiments with the
command-line/API
GUI, API, and command-line all rely on the same
set of Java classes
Generally easy to determine what classes and
parameters were used in the GUI.
Tree displays in Weka reflect its Java class
hierarchy.

gt java -cp galley/weka/weka.jar
weka.classifiers.trees.J48 C 0.25 M 2 -t
lttrain_arffgt -T lttest_arffgt
28
Important command-line parameters

where options are
Create/load/save a classification model
-t ltfilegt training set
-l ltfilegt load model file
-d ltfilegt save model file
Testing
-x ltNgt N-fold cross validation
-T ltfilegt test set
-p ltSgt print predictions attribute selection S

gt java -cp galley/weka/weka.jar
weka.classifiers.ltclassifier_namegt
classifier_options options
29
Problem with Running Weka
Problem Out of memory for large data set
Solution java -Xmx1000m -jar weka.jar

Write a Comment

User Comments (0)