Title: What is data mining?
1What is data mining?
- Wlodzislaw Duch
- Dept. of Informatics, Nicholas Copernicus
University, Torun, Poland - http//www.phys.uni.torun.pl/duch
ISEP Porto, 8-12 July 2002
2What is it about?
- Data used to be precious! Now it is overwhelming
... - In many areas of science, business and commerce
people are drowning in data. - Ex astronomy super-telescope data mining in
existing databases. - Database technology allows to store and retrieve
large amounts of data of any kind. - There is knowledge hidden in data.
- Data analysis requires intelligence.
3Ancient history
- 1960 first databases, collections of data.
- 1970 RDBMS, relational data model most popular
today, large centralized systems. - 1980 application-oriented data models,
specialized for scientific, geographic,
engineering data, time series, text,
object-oriented models, distributed databases. - 1990 multimedia and Web databases, data
warehousing (subject-oriented DB for decision
support), and on-line analytical processing
(OLAP), deduction and verification of
hypothetical patterns. - Data mining first conference in 1989, book 1996,
discover something useful!
4Data Mining History
- 1989 IJCAI Workshop on Knowledge Discovery in
Databases (Piatetsky-Shapiro and W. Frawley 1991) - 1991-1994 Workshops on KDD
- 1996 Advances in Knowledge Discovery and Data
Mining (Fayyad et al.) - 1995-1998 International Conferences on Knowledge
Discovery in Databases and Data Mining
(KDD95-98) - 1997 Journal of Data Mining and Knowledge
Discovery - 1998 ACM SIGKDD, SIGKDD1999-2001 conferences,
and SIGKDD Explorations - Many conferences on data mining PAKDD, PKDD,
SIAM-Data Mining, (IEEE) ICDM, etc.
5References, papers
- KDD WWW Resources
- http//www.kdd.org
- http//www.kdnuggets.com
- http//www.the-data-mine.com
- http//www.acm.org/sigkdd/
ResearchIndex http//citeseer.nj.nec.com/cs AI
ML aspects http//www.phys.uni.torun.pl/kmk NN
Statistics http//www.phys.uni.torun.pl/kmk Compa
rison of results on many datasets http//www.phy
s.uni.torun.pl/kmk
6Data Mining and statistics
- Statisticians deal with data whats new in DM?
- Many DM methods have roots in statistics.
- Statistics used to deal with small, controlled
experiments, while DM deals with large, messy
collections of data. - Statistics is based on analytical probabilistic
models, DM is based on algorithms that find
patterns in data. - Many DM algorithms came from other sources and
slowly get some statistical justification. - Key factor for DM is the computer
cost/performance. - Sometimes DM is more art than science
7Types of Data
- Statistical data clean, numerical, controlled
experiments, vector space model. - Relational data marketing, finances.
- Textual data Web, NLP, search.
- Complex structures chemistry, economics.
- Sequence data bioinformatics.
- Multimedia data images, video.
- Signals dynamic data, biosignals.
- AI data logical problems, games, behavior
8What is DM?
- Discovering interesting patterns, finding useful
summaries of large databases. - DM is more than database technology, On-Line
Analitic Processing (OLAP) tools. - DM is more than statistical analysis, although it
includes classification, association, clustering,
outlier and trend analysis, decision rules,
prototype cases, multidimensional visualization
etc. Understanding of data has not been an
explicit goal of statistics, focusing on
predictive data models.
9DM applications
- Many applications, but spectacular new knowledge
is rarely discovered. Some examples - Diapers and beer correlation please them close
and put potato chips in between. - Mining astronomical catalogs (Skycat, Sloan Sky
survey) new subtype of stars has been
discovered! - Bioinformatics more precise characterization of
some diseases, many discoveries to be made? - Credit card fraud detection (HNC company).
- Discounts of air/hotel for frequent travelers.
10Important issues in data mining.
- Use of statistical and CI methods for KDD.
- What makes an interesting pattern?
- Handling uncertainty in the data.
- Handling noise, outliers and missing or unknown
data. - Finding linguistic variables, discretization of
continuous data, presentation and evaluation of
knowledge. - Knowledge representation for structural data,
heterogeneous information, textual databases
NLP. - Performance, scalability, distributed data,
incremental or on-line processing. - Best form of explanation depends on the
application.
11DM dangers
- If there are too many conclusions to draw some
inferences will be true by chance due to too
small data samples (Bonferronis
theorem).Example 1 David Rhine (Duke Univ) ESP
tests. 1 person in 1000 guessed correctly color
(red or black) of 10 cards is this evidence for
ESP?Retesting of these people gave average
results. Rhines conclusion telling people that
they have ESP interferes with their ability - Example 2 using m letters to form a random
sequence of the length N all possible
subsequences of logmN are found gt Bible code!
12Data Mining process
- Knowledge discovery in databases (KDD)
- a search process for understandable and useful
patterns in data.
Data Mining
most effort
13Stages of DM process
- Data gathering, data warehousing, Web crawling.
- Preparation of the data cleaning, removing
outliers and impossible values, removing wrong
records, finding missing data. - Exploratory data analysis visualization of
different aspects of data. - Finding relevant features for questions that are
asked, preparing data structures for predictive
methods, converting symbolic values to numerical
representation. - Pattern extraction, discovery, rules, prototypes.
- Evaluation of knowledge gained, finding useful
patterns, consultation with experts.
14Multidimensional Data Cuboids
- Data warehouses use multidimensional data model.
- Projections (views) of data on different
dimensions (attributes) form data cuboids. - In DB warehousing literature base cuboid
original data, N-Dim. apex cuboid 0-D cuboid,
highest-level summary data cube lattice of
cuboids. - Ex Sales data cube, viewed in multiple
dimensions - Dimension tables, ex. item (item_name, brand,
type), or time(day, week, month, quarter, year) - Fact tables, measures (such as cost), and keys to
each of the related dimension tables
15Data Cube A Lattice of Cuboids
time,item
time,item,location
16Forms of useful knowledge
AI/Machine Learning camp Neural nets are black
boxes. Unacceptable! Symbolic rules forever.
- But ... knowledge accessible to humans is in
- symbols,
- similarity to prototypes,
- images, visual representations.
- What type of explanation is satisfactory?
- Interesting question for cognitive scientists.
- Different answers in different fields.
17Forms of knowledge
- Humans remember examples of each category and
refer to such examples as similarity-based or
nearest-neighbors methods do. - Humans create prototypes out of many examples
as Gaussian classifiers, RBF networks, neurofuzzy
systems do. - Logical rules are the highest form of
summarization of knowledge.
- Types of explanation
- exemplar-based prototypes and similarity
- logic-based symbols and rules
- visualization-based exploratory data analysis,
maps, diagrams, relations ...
18Computational Intelligence
Soft computing
Computational IntelligenceData gt
KnowledgeArtificial Intelligence
19CI methods for data mining
- Provide non-parametric (universal), predictive
models of data. - Classify new data to pre-defined categories,
supporting diagnosis prognosis. - Discover new categories, clusters, patterns.
- Discover interesting associations, correlations.
- Allow to understand the data, creating fuzzy or
crisp logical rules, or prototypes. - Help to visualize multi-dimensional relationships
among data samples.
20Association rules
- Classification rules X gt C(X)
- Association rules looking for correlation
between components of X, i.e. probability
p(XiX1,Xi-1,Xi1,Xn). - Market basket problem many items selected from
an available pool to a basket what are the
correlations? - Only frequent items are interestingitemsets
with high support, i.e. appearing together in
many baskets. Search for rules above support
threshold gt 1.
21Association rules - related
- Related problems to market basket correlation
between documents high for plagiarism phrases
in documents high for semantically related
documents. - Causal relations matter, although may be
difficult to determine lower the price of
diapers, keep high beer price, or try the reverse
what will happen? - More general approach Bayesian belief networks,
causal networks, graphical models.
22Clustering
- Given points in multidimensional space divided
them into groups that are similar. - Ex if epidemic breaks, look for location of
cases on the map (cholera in London). Documents
in the space of words cluster according to their
topics. - How to measure similarity?
- Hierarchical approaches start from single cases,
join them forming clusters ex
dendrogram.Centroid approaches assume a few
centers and adapt their position ex k-means,
LVQ, SOM.
23Neural networks
- Inspired by neurobiology simple elements
cooperate changing internal parameters. - Large field, dozens of different models, over 500
papers on NN in medicine each year. - Supervised networks heteroassociative mapping
XgtY, symptoms gt diseases,universal
approximators. - Unsupervised networks clusterization,
competitive learning, autoassociation. - Reinforcement learning modeling behavior,
playing games, sequential data.
24Supervised learning
- Compare the desired with the achieved outputs
you cant always get what you want. - Examples MLP/RBF NN, kNN, SVM, LDA, DT
25Unsupervised learning
- Find interesting structures in data.
- SOM, many variants.
26Reinforcement learning
- Reward comes after the sequence of actions.
- Games, survival behavior, planning sequences of
actions.
27Unsupervised NN example
Clustering and visualization of the quality of
life index (UN data) by SOM map.
Poor classification, inaccurate visualization.
28Real and artificial neurons
Nodes artificial neurons
Dendrites
Signals
Synapses
Synapses
(weights)
Axon
29Neural network for MI diagnosis
Myocardial Infarction
p(MIX)
0.7
Outputweights
Inputweights
Sex
Age
Smoking
Elevation
Pain
ECG ST
Duration
30MI network function
- Training setting the values of weights and
thresholds, efficient algorithms exist.
Effect non-linear regression function
Such networks are universal approximators they
may learn any mapping X gt Y
31Knowledge from networks
- Simplify networks force most weights to 0,
quantize remaining parameters, be constructive!
- Regularization mathematical technique
improving predictive abilities of the network. - Result MLP2LN neural networks that are
equivalent to logical rules.
32MLP2LN
- Converts MLP neural networks into a network
performing logical operations (LN).
Input layer
Output one node per class.
Aggregation better features
Rule units threshold logic
Linguistic units windows, filters
33Learning dynamics
Decision regions shown every 200 training epochs
in x3, x4 coordinates borders are optimally
placed with wide margins.
34Neurofuzzy systems
Fuzzy m(x)0,1 (no/yes) replaced by a degree
m(x)?0,1. Triangular, trapezoidal, Gaussian ...
MF.
M.f-s in many dimensions
- Feature Space Mapping (FSM) neurofuzzy system.
- Neural adaptation, estimation of probability
density distribution (PDF) using single hidden
layer network (RBF-like) with nodes realizing
separable functions
35GhostMiner Philosophy
- GhostMiner, data mining tools from our lab.
http//www.fqspl.com.pl/ghostminer/ - Separate the process of model building and
knowledge discovery from model use gt
GhostMiner Developer GhostMiner Analyzer.
- There is no free lunch provide different type
of tools for knowledge discovery. Decision tree,
neural, neurofuzzy, similarity-based, committees. - Provide tools for visualization of data.
- Support the process of knowledge discovery/model
building and evaluating, organizing it into
projects.
36Heterogeneous systems
Homogenous systems one type of building
blocks, same type of decision borders. Ex
neural networks, SVMs, decision trees, kNNs
. Committees combine many models together, but
lead to complex models that are difficult to
understand.
- Discovering simplest class structures, its
inductive bias, requires heterogeneous adaptive
systems (HAS). - Ockham razor simpler systems are better.
- HAS examples
- NN with many types of neuron transfer functions.
- k-NN with different distance functions.
- DT with different types of test criteria.
37Wine data example
Chemical analysis of wine from grapes grown in
the same region in Italy, but derived from three
different cultivars.Task recognize the source
of wine sample.13 quantities measured,
continuous features
- alcohol content
- ash content
- magnesium content
- flavanoids content
- proanthocyanins phenols content
- OD280/D315 of diluted wines
- malic acid content
- alkalinity of ash
- total phenols content
- nonanthocyanins phenols content
- color intensity
- hue
- proline.
38Exploration and visualization
- General info about the data
39Exploration data
40Exploration data statistics
- Distribution of feature values
Proline has very large values, the data should be
standardized before further processing.
41Exploration data standardized
- Standardized data unit standard deviation, about
2/3 of all data should fall within
mean-std,meanstd
Other options normalize to fit in -1,1, or
normalize rejecting some extreme values.
42Exploration 1D histograms
- Distribution of feature values in classes
Some features are more useful than the others.
43Exploration 1D/3D histograms
- Distribution of feature values in classes, 3D
44Exploration 2D projections
- Projections (cuboids) on selected 2D
Projections on selected 2D
45Visualize data
Relations in more than 3D are hard to
imagine. SOM mappings popular for
visualization, but rather inaccurate, no measure
of distortions. Measure of topographical
distortions map all Xi points from Rn to xi
points in Rm, m lt n, and ask How well are Rij
D(Xi, Xj) distances reproduced by distances rij
d(xi,xj) ? Use m 2 for visualization, use
higher m for dimensionality reduction.
46Visualize data MDS
Multidimensional scaling invented in psychometry
by Torgerson (1952), re-invented by Sammon (1969)
and myself (1994) Minimize measure of
topographical distortions moving the x
coordinates.
47Visualize data Wine
3 clusters are clearly distinguished, 2D is fine.
The green outlier can be identified easily.
48Decision trees
Simplest things first use decision tree to find
logical rules.
Test single attribute, find good point to split
the data, separating vectors from different
classes. DT advantages fast, simple, easy to
understand, easy to program, many good
algorithms.
4 attributes used, 10 errors, 168 correct,
94.4 correct.
49Decision borders
Univariate trees test the value of a single
attribute x lt a.
Multivariate trees test on combinations of
attributes, hyperplanes.
Result feature space is divided into cuboids.
Wine data univariate decision tree borders for
proline and flavanoids
50Logical rules
Crisp logic rules for continuous x use
linguistic variables (predicate functions).
sk(x) s True XkL x L X'k, for example
small(x) Truexx lt 1 medium(x)
Truexx ĂŽ 1,2 large(x) Truexx gt
2 Linguistic variables are used in crisp
(prepositional, Boolean) logic rules IF
small-height(X) AND has-hat(X) AND has-beard(X)
THEN (X is a Brownie) ELSE IF ... ELSE ...
51Crisp logic decisions
Crisp logic is based on rectangular membership
functions
True/False values jump from 0 to 1. Step
functions are used for partitioning of the
feature space.
Very simple hyper-rectangular decision borders.
Sever limitation on the expressive power of
crisp logical rules!
52Logical rules - advantages
Logical rules, if simple enough, are preferable.
- Rules may expose limitations of black box
solutions. - Only relevant features are used in rules.
- Rules may sometimes be more accurate than NN and
other CI methods. - Overfitting is easy to control, rules usually
have small number of parameters. - Rules forever !? A logical rule about logical
rules is
53Logical rules - limitations
- Logical rules are preferred but ...
- Only one class is predicted p(CiX,M) 0 or 1
- black-and-white picture may be inappropriate in
many applications. - Discontinuous cost function allow only
non-gradient optimization. - Sets of rules are unstable small change in the
dataset leads to a large change in structure of
complex sets of rules. - Reliable crisp rules may reject some cases as
unclassified. - Interpretation of crisp rules may be misleading.
- Fuzzy rules are not so comprehensible.
54Rules - choices
Simplicity vs. accuracy. Confidence vs.
rejection rate.
p is a hit p- false alarm p- is a miss.
Accuracy (overall) A(M) p p--
Error rate L(M) p- p-
Rejection rate R(M)prp-r 1-L(M)-A(M)
Sensitivity S(M) p p /p
Specificity S-(M) p-- p-- /p-
55Rules error functions
- The overall accuracy is equal to a combination of
sensitivity and specificity weighted by the a
priori probabilities
A(M) pS(M)p-S-(M)
Optimization of rules for the C class large g
means no errors but high rejection rate.
E(Mg) gL(M)-A(M) g (p-p-) - (pp--)
minM E(Mg) ? minM (1g)L(M)R(M)
Optimization with different costs of errors
minM E(Ma) minM p- a p- minM
p(1-S(M)) - pr(M) a p-(1-S-(M)) -
p-r(M) ROC (Receiver Operating Curve) p
(p-), hit(false alarm).
56Wine example SSV rules
- Decision trees provide rules of different
complexity.
Simplest tree 5 nodes, corresponding to 3
rules 25 errors, mostly Class2/3 wines mixed.
57Wine SSV 5 rules
- Lower pruning leads to more complex tree.
7 nodes, corresponding to 5 rules 10 errors,
mostly Class2/3 wines mixed.
58Wine SSV optimal rules
What is the optimal complexity of rules? Use
crossvalidation to estimate generalization.
Various solutions may be found, depending on the
search 5 rules with 12 premises, making 6
errors, 6 rules with 16 premises and 3 errors,
8 rules, 25 premises, and 1 error.
if OD280/D315 gt 2.505 ? proline gt 726.5 ? color gt
3.435 then class 1 if OD280/D315 gt 2.505 ?
proline gt 726.5 ? color lt 3.435 then class 2 if
OD280/D315 lt 2.505 ? hue gt 0.875 ? malic-acid lt
2.82 then class 2 if OD280/D315 gt 2.505 ? proline
lt 726.5 then class 2 if OD280/D315 lt 2.505 ? hue
lt 0.875 then class 3 if OD280/D315 lt 2.505 ? hue
gt 0.875 ? malic-acid gt 2.82 then class 3
59Wine FSM rules
SSV hierarchical rules FSM density estimation
with feature selection.
Complexity of rules depends on desired
accuracy. Use rectangular functions for crisp
rules. Optimal accuracy may be evaluated using
crossvalidation.
FSM discovers simpler rules, for example if
proline gt 929.5 then class 1 (48 cases, 45
correct, 2 recovered by other rules). if color lt
3.79285 then class 2 (63 cases, 60 correct)
60Examples of interesting knowledge discovered!
- The most famous example of knowledge discovered
by data mining - correlation between beer, milk and diapers.
Other examples 2 subtypes of galactic spectra
forced astrophysicist to reconsider star
evolutionary processes. Several examples of
knowledge found by us in medical and other
datasets follow.
61Mushrooms
- The Mushroom Guide no simple rule for mushrooms
no rule like leaflets three, let it be for
Poisonous Oak and Ivy.
8124 cases, 51.8 are edible, the rest
non-edible. 22 symbolic attributes, up to 12
values each, equivalent to 118 logical features,
or 21183.1035 possible input vectors. Odor
almond, anise, creosote, fishy, foul, musty,
none, pungent, spicy Spore print color black,
brown, buff, chocolate, green, orange, purple,
white, yellow.
Safe rule for edible mushrooms
odor(almond.or.anise.or.none) U
spore-print-color R green 48 errors,
99.41 correct This is why animals have such a
good sense of smell! What does it tell us
about odor receptors?
62Mushrooms rules
- To eat or not to eat, this is the question! Not
any more ...
A mushroom is poisonous if R1) odor R (almond
Ăš anise Ăš none) 120 errors, 98.52 R2)
spore-print-color green 48 errors, 99.41
R3) odor none U stalk-surface-below-ring
scaly U stalk-color-above-ring R brown
8 errors, 99.90 R4) habitat leaves U
cap-color white no errors! R1 R2 are
quite stable, found even with 10 of data R3
and R4 may be replaced by other rules, ex R'3)
gill-sizenarrow U stalk-surface-above-ring(silky
Ăš scaly) R'4) gill-sizenarrow U
populationclustered Only 5 of 22 attributes
used! Simplest possible rules? 100 in CV tests
- structure of this data is completely clear.
63Recurrence of breast cancer
- Data from Institute of Oncology, University
Medical Center, Ljubljana, Yugoslavia.
286 cases, 201 no recurrence (70.3), 85
recurrence cases (29.7) no-recurrence-events,
40-49, premeno, 25-29, 0-2, ?, 2, left,
right_low, yes 9 nominal features age (9 bins),
menopause, tumor-size (12 bins), nodes involved
(13 bins), node-caps, degree-malignant (1,2,3),
breast, breast quad, radiation.
64Rules for breast cancer
- Data from Institute of Oncology, University
Medical Center, Ljubljana, Yugoslavia.
Many systems used, 65-78 accuracy reported.
Single rule IF (nodes-involved ? 0,2 Ă™
degree-malignant 3 THEN recurrence, ELSE
no-recurrence 76.2 accuracy, only trivial
knowledge in the data Highly malignant breast
cancer involving many nodes is likely to strike
back.
65Recurrence - comparison.
Method 10xCV accuracy MLP2LN 1
rule 76.2 SSV DT stable rules 75.7 ? 1.0
k-NN, k10, Canberra 74.1 ?1.2 MLPbackprop.
73.5 ? 9.4 (Zarndt)CART DT 71.4 ? 5.0
(Zarndt) FSM, Gaussian nodes 71.7 ? 6.8 Naive
Bayes 69.3 ? 10.0 (Zarndt) Other decision
trees lt 70.0
66Breast cancer diagnosis.
- Data from University of Wisconsin Hospital,
Madison, collected by dr. W.H. Wolberg.
699 cases, 9 features quantized from 1 to 10
clump thickness, uniformity of cell size,
uniformity of cell shape, marginal adhesion,
single epithelial cell size, bare nuclei, bland
chromatin, normal nucleoli, mitoses Tasks
distinguish benign from malignant cases.
67Breast cancer rules.
- Data from University of Wisconsin Hospital,
Madison, collected by dr. W.H. Wolberg.
Simplest rule from MLP2LN, large regularization
If uniformity of cell size lt 3 Then
benign Else malignant Sensitivity0.97,
Specificity0.85 More complex NN solutions, from
10CV estimate Sensitivity 0.98,
Specificity0.94
68Breast cancer comparison.
Method 10xCV accuracy k-NN, k3,
Manh 97.0 ? 2.1 (GM)FSM, neurofuzzy 96.9 ?
1.4 (GM) Fisher LDA 96.8 MLPbackprop.
96.7 (Ster, Dobnikar)LVQ 96.6 (Ster,
Dobnikar) IncNet (neural) 96.4 ? 2.1 (GM)Naive
Bayes 96.4 SSV DT, 3 crisp rules 96.0 ?
2.9 (GM) LDA (linear discriminant) 96.0
Various decision trees 93.5-95.6
69Melanoma skin cancer
- Collected in the Outpatient Center of Dermatology
in RzeszĂłw, Poland. - Four types of Melanoma benign, blue, suspicious,
or malignant.
- 250 cases, with almost equal class distribution.
- Each record in the database has 13 attributes
asymmetry, border, color (6), diversity (5). - TDS (Total Dermatoscopy Score) - single index
- Goal hardware scanner for preliminary diagnosis.
70Melanoma rules
R1 IF TDS 4.85 AND C-BLUE IS absent THEN
MELANOMA IS Benign-nevus R2 IF TDS 4.85 AND
C-BLUE IS present THEN MELANOMA IS
Blue-nevus R3 IF TDS gt 5.45 THEN MELANOMA IS
Malignant R4 IF TDS gt 4.85 AND TDS lt 5.45
THEN MELANOMA IS Suspicious 5 errors (98.0)
on the training set 0 errors (100 ) on the test
set. Feature aggregation is important! Without
TDS 15 rules are needed.
71Melanoma results
Method Rules Training Test MLP2LN,
crisp rules 4 98.0 all 100 SSV Tree,
crisp rules 4 97.50.3 100FSM,
rectangular f. 7 95.51.0 100 knn
prototype selection 13 97.50.0 100
FSM, Gaussian f. 15 93.71.0 953.6 knn
k1, Manh, 2 features -- 97.40.3 100 LERS,
rough rules 21 -- 96.2
72Summary
- Data mining is a large field only a few issues
have been mentioned here. - DM involves many steps, here only those related
to pattern recognition were stressed, but in
practice scalability and efficiency issues may be
most important.
Neural networks are used still mostly for
building predictive data models, but they may
also provide simplified description in form of
rules. Rules are not the only for of data
understanding. Rules may be a beginning for a
practical application. Some interesting
knowledge has been discovered.
73Challenges
- Fully automatic universal data analysis systems
press the button and wait for the truth
- Discovery of theories rather than data models
- Integration with image/signal analysis
- Integration with reasoning in complex domains
- Combining expert systems with neural networks
We are slowly getting there. More more
computational intelligence tools (including our
own) are available.
74Disclaimer
- A few slides/figures were taken from various
presentations found in the Internet
unfortunately I cannot identify original authors
at the moment, since these slides went through
different iterations. - I have to apologize for that.