Title: WEKA and Machine Learning Algorithms
1WEKA and Machine Learning Algorithms
2Algorithm Types
- Classification (supervised)
- Given -gt A set of classified examples
instances - Produce -gt A way of classifying new examples
- Instances described by fixed set of features
attributes - Classes discrete or continuous classification
regression - Interested in
- Results? (classifying new instances)
- Model? (how the decision is made)
- Clustering (unsupervised)
- There are no classes
- Association rules
- Look for rules that relate features to other
features
3Classification
4Clustering
5Clustering
- It is expected that similarity among members of a
cluster should be high and similarity among
objects of different clusters should be low. - The objectives of clustering
- knowing which data objects belong to which
cluster - understanding common characteristics of the
members of a specific cluster
6Clustering vs Classification
- There is some similarity between clustering and
classification. - Both classification and clustering are about
assigning appropriate class or cluster labels to
data records. However, clustering differs from
classification in two aspects. - First, in clustering, there are no pre-defined
classes. This means that the number of classes or
clusters and the class or cluster label of each
data record are not known before the operation. - Second, clustering is about grouping data rather
than developing a classification model.
Therefore, there is no distinction between data
records and examples. The entire data population
is used as input to the clustering process.
7Association Mining
8Overfitting
- Memorization vs generalization
- To fix, use
- Training data to form rules
- Validation data to decide on best rule
- Test data to determine system performance
- Cross-validation
9Baseline Experiments
- In order to evaluate the efficiency of the
classifiers used in experiments, we use
baselines - Majority based random classification (Kappa0)
- Class distribution based random classification
(Kappa0) - Kappa statistics, is used as a measure to assess
the improvement of a classifiers accuracy over a
predictor employing chance as its guide. - P0 is the accuracy of the classifier and Pc is
the expected accuracy that can be achieved by a
randomly guessing classifier on the same data
set. Kappa statistics has a range between 1 and
1, where 1 is total disagreement (i.e., total
misclassification) and 1 is perfect agreement
(i.e., a 100 accurate classification). - Kappa score over 0.4 indicates a reasonable
agreement beyond chance.
10Data Mining Process
11WEKA the software
- Machine learning/data mining software written in
Java (distributed under the GNU Public License) - Used for research, education, and applications
- Complements Data Mining by Witten Frank
- Main features
- Comprehensive set of data pre-processing tools,
learning algorithms and evaluation methods - Graphical user interfaces (incl. data
visualization) - Environment for comparing learning algorithms
12Wekas Role in the Big Picture
13WEKA Terminology
- Some synonyms/explanations for the terms used by
WEKA - Attribute feature
- Relation collection of examples
- Instance collection in use
- Class category
14WEKA only deals with flat files
- _at_relation heart-disease-simplified
- _at_attribute age numeric
- _at_attribute sex female, male
- _at_attribute chest_pain_type typ_angina, asympt,
non_anginal, atyp_angina - _at_attribute cholesterol numeric
- _at_attribute exercise_induced_angina no, yes
- _at_attribute class present, not_present
- _at_data
- 63,male,typ_angina,233,no,not_present
- 67,male,asympt,286,yes,present
- 67,male,asympt,229,yes,present
- 38,female,non_anginal,?,no,not_present
- ...
numeric attribute
nominal attribute
15(No Transcript)
16Explorer pre-processing the data
- Data can be imported from a file in various
formats ARFF, CSV, C4.5, binary - Data can also be read from a URL or from an SQL
database (using JDBC) - Pre-processing tools in WEKA are called filters
- WEKA contains filters for
- Discretization, normalization, resampling,
attribute selection, transforming and combining
attributes,
17Explorer building classifiers
- Classifiers in WEKA are models for predicting
nominal or numeric quantities - Implemented learning schemes include
- Decision trees and lists, instance-based
classifiers, support vector machines, multi-layer
perceptrons, logistic regression, Bayes nets, - Meta-classifiers include
- Bagging, boosting, stacking, error-correcting
output codes, locally weighted learning,
18Classifiers - Workflow
LearningAlgorithm
Classifier
Predictions
19Evaluation
- Accuracy
- Percentage of Predictions that are correct
- Problematic for some disproportional Data Sets
- Precision
- Percent of positive predictions correct
- Recall (Sensitivity)
- Percent of positive labeled samples predicted as
positive - Specificity
- The percentage of negative labeled samples
predicted as negative.
20Confusion matrix
- Contains information about the actual and the
predicted classification - All measures can be derived from it
- accuracy (ad)/(abcd)
- recall d/(cd) gt R
- precision d/(bd) gt P
- F-measure 2PR/(PR)
- false positive (FP) rate b /(ab)
- true negative (TN) rate a /(ab)
- false negative (FN) rate c /(cd)
predicted predicted
true a b
true c d
21Explorer clustering data
- WEKA contains clusterers for finding groups of
similar instances in a dataset - Implemented schemes are
- k-Means, EM, Cobweb, X-means, FarthestFirst
- Clusters can be visualized and compared to true
clusters (if given) - Evaluation based on loglikelihood if clustering
scheme produces a probability distribution
22Explorer finding associations
- WEKA contains an implementation of the Apriori
algorithm for learning association rules - Works only with discrete data
- Can identify statistical dependencies between
groups of attributes - milk, butter ? bread, eggs (with confidence 0.9
and support 2000) - Apriori can compute all rules that have a given
minimum support and exceed a given confidence
23Explorer attribute selection
- Panel that can be used to investigate which
(subsets of) attributes are the most predictive
ones - Attribute selection methods contain two parts
- A search method best-first, forward selection,
random, exhaustive, genetic algorithm, ranking - An evaluation method correlation-based, wrapper,
information gain, chi-squared, - Very flexible WEKA allows (almost) arbitrary
combinations of these two
24Explorer data visualization
- Visualization very useful in practice e.g. helps
to determine difficulty of the learning problem - WEKA can visualize single attributes (1-d) and
pairs of attributes (2-d) - To do rotating 3-d visualizations (Xgobi-style)
- Color-coded class values
- Jitter option to deal with nominal attributes
(and to detect hidden data points) - Zoom-in function
25Performing experiments
- Experimenter makes it easy to compare the
performance of different learning schemes - For classification and regression problems
- Results can be written into file or database
- Evaluation options cross-validation, learning
curve, hold-out - Can also iterate over different parameter
settings - Significance-testing built in!
26The Knowledge Flow GUI
- New graphical user interface for WEKA
- Java-Beans-based interface for setting up and
running machine learning experiments - Data sources, classifiers, etc. are beans and can
be connected graphically - Data flows through components e.g.,
- data source -gt filter -gt classifier -gt
evaluator - Layouts can be saved and loaded again later
27Beyond the GUI
- How to reproduce experiments with the
command-line/API - GUI, API, and command-line all rely on the same
set of Java classes - Generally easy to determine what classes and
parameters were used in the GUI. - Tree displays in Weka reflect its Java class
hierarchy.
gt java -cp galley/weka/weka.jar
weka.classifiers.trees.J48 C 0.25 M 2 -t
lttrain_arffgt -T lttest_arffgt
28Important command-line parameters
-
- where options are
- Create/load/save a classification model
- -t ltfilegt training set
- -l ltfilegt load model file
- -d ltfilegt save model file
- Testing
- -x ltNgt N-fold cross validation
- -T ltfilegt test set
- -p ltSgt print predictions attribute selection S
gt java -cp galley/weka/weka.jar
weka.classifiers.ltclassifier_namegt
classifier_options options
29Problem with Running Weka
Problem Out of memory for large data set
Solution java -Xmx1000m -jar weka.jar