Title: Data%20Mining:%20An%20Overview
1Data Mining An Overview
David Madigan dmadigan_at_rci.rutgers.edu http//stat
.rutgers.edu/madigan
2Overview
- Brief Introduction to Data Mining
- Data Mining Algorithms
- Specific Examples
- Algorithms Disease Clusters
- Algorithms Model-Based Clustering
- Algorithms Frequent Items and Association Rules
- Future Directions, etc.
3Of Laws, Monsters, and Giants
- Moores law processing capacity doubles every
18 months CPU, cache, memory - Its more aggressive cousin
- Disk storage capacity doubles every 9 months
4What is Data Mining?
- Finding interesting structure in data
- Structure refers to statistical patterns,
predictive models, hidden relationships - Examples of tasks addressed by Data Mining
- Predictive Modeling (classification, regression)
- Segmentation (Data Clustering )
- Summarization
- Visualization
5(No Transcript)
6(No Transcript)
7Ronny Kohavi, ICML 1998
8Ronny Kohavi, ICML 1998
9Ronny Kohavi, ICML 1998
10Stories Online Retailing
11Chapter 4 Data Analysis and Uncertainty
- Elementary statistical concepts random
variables, distributions, densities,
independence, point and interval estimation, bias
variance, MLE - Model (global, represent prominent structures)
vs. Pattern (local, idiosyncratic deviations) - Frequentist vs. Bayesian
- Sampling methods
12Bayesian Estimation
e.g. beta-binomial model
Predictive distribution
13Issues to do with p-values
- Using thresholds of 0.05 or 0.01 regardless of
sample size - Multiple testing (e.g. Friedman (1983) selecting
highly significant regressors from noise) - Subtle interpretation Jeffreys (1980) I have
always considered the arguments for the use of P
absurd. They amount to saying that a hypothesis
that may or may not be true is rejected because a
greater departure from the trial value was
improbable that is, that is has not predicted
something that has not happened.
14p-value as measure of evidence
Schervish (1996) if hypothesis H implies
hypothesis H', then there should be at least as
much support for H' as for H. - not satisfied by
p-values
Grimmet and Ridenhour (1996) one might expect
an outlying data point to lend support to the
alternative hypothesis in, for instance, a
one-way analysis of variance. - the value of the
outlying data point that minimizes the
significance level can lie within the range of
the data
15Chapter 5 Data Mining Algorithms
A data mining algorithm is a well-defined
procedure that takes data as input and produces
output in the form of models or patterns
well-defined can be encoded in
software algorithm must terminate after some
finite number of steps
16Data Mining Algorithms
A data mining algorithm is a well-defined
procedure that takes data as input and produces
output in the form of models or patterns
Hand, Mannila, and Smyth
well-defined can be encoded in
software algorithm must terminate after some
finite number of steps
17Algorithm Components
1. The task the algorithm is used to address
(e.g. classification, clustering, etc.) 2. The
structure of the model or pattern we are fitting
to the data (e.g. a linear regression model) 3.
The score function used to judge the quality of
the fitted models or patterns (e.g. accuracy,
BIC, etc.) 4. The search or optimization method
used to search over parameters and/or structures
(e.g. steepest descent, MCMC, etc.) 5. The data
management technique used for storing, indexing,
and retrieving data (critical when data too large
to reside in memory)
18(No Transcript)
19Backpropagation data mining algorithm
x1
h1
x2
y
x3
h2
x4
4
2
- vector of p input values multiplied by p ? d1
weight matrix - resulting d1 values individually transformed by
non-linear function - resulting d1 values multiplied by d1 ? d2 weight
matrix
1
20Backpropagation (cont.)
Parameters
Score
Search steepest descent search for structure?
21Models and Patterns
Models
Probability Distributions
Structured Data
Prediction
- Linear regression
- Piecewise linear
22Models
Probability Distributions
Structured Data
Prediction
- Linear regression
- Piecewise linear
- Nonparamatric regression
23(No Transcript)
24Models
Probability Distributions
Structured Data
Prediction
- Linear regression
- Piecewise linear
- Nonparametric regression
- Classification
logistic regression naïve bayes/TAN/bayesian
networks NN support vector machines Trees etc.
25Models
Probability Distributions
Structured Data
Prediction
- Linear regression
- Piecewise linear
- Nonparametric regression
- Classification
- Parametric models
- Mixtures of parametric models
- Graphical Markov models (categorical, continuous,
mixed)
26Models
Probability Distributions
Structured Data
Prediction
- Time series
- Markov models
- Mixture Transition Distribution models
- Hidden Markov models
- Spatial models
- Linear regression
- Piecewise linear
- Nonparametric regression
- Classification
- Parametric models
- Mixtures of parametric models
- Graphical Markov models (categorical, continuous,
mixed)
27Markov Models
First-order
e.g.
g linear ? standard first-order auto-regressive
model
yT
y1
y2
y3
28First-Order HMM/Kalman Filter
yT
y1
y2
y3
xT
x1
x2
x3
Note to compute p(y1,,yT) need to sum/integrate
over all possible state sequences...
29Bias-Variance Tradeoff
High Bias - Low Variance
Low Bias - High Variance overfitting - modeling
the random component
Score function should embody the compromise
30The Curse of Dimensionality
X MVNp (0 , I)
- Gaussian kernel density estimation
- Bandwidth chosen to minimize MSE at the mean
- Suppose want
Dimension data points 1 4
2 19 3 67
6 2,790 10
842,000
31Patterns
Local
Global
- Outlier detection
- Changepoint detection
- Bump hunting
- Scan statistics
- Association rules
- Clustering via partitioning
- Hierarchical Clustering
- Mixture Models
32Scan Statistics via Permutation Tests
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
The curve represents a road Each x marks an
accident Red x denotes an injury accident Black
x means no injury Is there a stretch of road
where there is an unusually large fraction of
injury accidents?
33Scan with Fixed Window
- If we know the length of the stretch of road
that we seek, e.g., we could
slide this window long the road and find the most
unusual window location
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
34How Unusual is a Window?
- Let pW and pW denote the true probability of
being red inside and outside the window
respectively. Let (xW ,nW) and (xW ,nW) denote
the corresponding counts - Use the GLRT for comparing H0 pW pW versus
H1 pW ? pW
- lambda measures how unusual a window is
-2 log l here has an asymptotic chi-square
distribution with 1df
35Permutation Test
- Since we look at the smallest l over all window
locations, need to find the distribution of
smallest-l under the null hypothesis that there
are no clusters - Look at the distribution of smallest-l over say
999 random relabellings of the colors of the xs
smallest-l
xx x xxx x xx x xx x 0.376 xx x xxx
x xx x xx x 0.233 xx x xxx x xx x
xx x 0.412 xx x xxx x xx x xx x
0.222
- Look at the position of observed smallest-l in
this distribution to get the scan statistic
p-value (e.g., if observed smallest-l is 5th
smallest, p-value is 0.005)
36Variable Length Window
- No need to use fixed-length window. Examine all
possible windows up to say half the length of the
entire road
O fatal accident O non-fatal accident
37Spatial Scan Statistics
- Spatial scan statistic uses, e.g., circles
instead of line segments
38(No Transcript)
39Spatial-Temporal Scan Statistics
- Spatial-temporal scan statistic use cylinders
where the height of the cylinder represents a
time window
40Other Issues
- Poisson model also common (instead of the
bernoulli model) - Covariate adjustment
- Andrew Moores group at CMU efficient algorithms
for scan statistics
41Software SaTScan others
http//www.satscan.org http//www.phrl.org
http//www.terraseer.com
42Association Rules Support and Confidence
Customer buys both
- Find all the rules Y ? Z with minimum confidence
and support - support, s, probability that a transaction
contains Y Z - confidence, c, conditional probability that a
transaction having Y Z also contains Z
Customer buys diaper
Customer buys beer
- Let minimum support 50, and minimum confidence
50, we have - A ? C (50, 66.6)
- C ? A (50, 100)
43Mining Association RulesAn Example
Min. support 50 Min. confidence 50
- For rule A ? C
- support support(A C) 50
- confidence support(A C)/support(A) 66.6
- The Apriori principle
- Any subset of a frequent itemset must be frequent
44Mining Frequent Itemsets the Key Step
- Find the frequent itemsets the sets of items
that have minimum support - A subset of a frequent itemset must also be a
frequent itemset - i.e., if AB is a frequent itemset, both A and
B should be a frequent itemset - Iteratively find frequent itemsets with
cardinality from 1 to k (k-itemset) - Use the frequent itemsets to generate association
rules.
45The Apriori Algorithm
- Join Step Ck is generated by joining Lk-1with
itself - Prune Step Any (k-1)-itemset that is not
frequent cannot be a subset of a frequent
k-itemset - Pseudo-code
- Ck Candidate itemset of size k
- Lk frequent itemset of size k
- L1 frequent items
- for (k 1 Lk !? k) do begin
- Ck1 candidates generated from Lk
- for each transaction t in database do
- increment the count of all candidates in
Ck1 that are
contained in t - Lk1 candidates in Ck1 with min_support
- end
- return ?k Lk
46The Apriori Algorithm Example
Database D
L1
C1
Scan D
C2
C2
L2
Scan D
C3
L3
Scan D
47Association Rule Mining A Road Map
- Boolean vs. quantitative associations (Based on
the types of values handled) - buys(x, SQLServer) buys(x, DMBook)
buys(x, DBMiner) 0.2, 60 - age(x, 30..39) income(x, 42..48K)
buys(x, PC) 1, 75 - Single dimension vs. multiple dimensional
associations (see ex. Above) - Single level vs. multiple-level analysis
- What brands of beers are associated with what
brands of diapers? - Various extensions (thousands!)
48(No Transcript)
49Model-based Clustering
Padhraic Smyth, UCI
50(No Transcript)
51(No Transcript)
52(No Transcript)
53(No Transcript)
54(No Transcript)
55(No Transcript)
56(No Transcript)
57Mixtures of Sequences, Curves,
Generative Model - select a component ck for
individual i - generate data according to p(Di
ck) - p(Di ck) can be very general - e.g.,
sets of sequences, spatial patterns, etc Note
given p(Di ck), we can define an EM algorithm
58Example Mixtures of SFSMs
- Simple model for traversal on a Web site
- (equivalent to first-order Markov with end-state)
- Generative model for large sets of Web users
- - different behaviors ltgt mixture of SFSMs
- EM algorithm is quite simple weighted counts
59WebCanvas Cadez, Heckerman, et al, KDD 2000
60(No Transcript)
61(No Transcript)
62(No Transcript)
63Discussion
- What is data mining? Hard to pin down who
cares? - Textbook statistical ideas with a new focus on
algorithms - Lots of new ideas too
64Privacy and Data Mining
Ronny Kohavi, ICML 1998
65Analyzing Hospital Discharge Data
David Madigan Rutgers University
66Comparing Outcomes Across Providers
- Florence Nightingale wrote in 1863
In attempting to arrive at the truth, I have
applied everywhere for information, but in
scarcely an instance have I been able to obtain
hospital records fit for any purposes of
comparisonI am fain to sum up with an urgent
appeal for adopting some uniform system of
publishing the statistical records of hospitals.
67Data
- Data of various kinds are now available e.g.
data concerning all medicare/medicaid hospital
admissions in standard format UB-92 covers gt95
of all admissions nationally - Considerable interest in using these data to
compare providers (hospitals, physician groups,
physicians, etc.) - In Pennsylvannia, large corporations such as
Westinghouse and Hershey Foods are a motivating
force and use the data to select providers.
68SYSID DCSTATUS PPXDOW CANCER1
YEAR LOS SPX1DOW CANCER2
QUARTER DCHOUR SPX2DOW MDCHC4
PAF DCDOW SPX3DOW MQSEV
HREGION ECODE SPX4DOW MQNRSP
MAID PDX SPX5DOW PROFCHG
PTSEX SDX1 REFID TOTALCHG
ETHNIC SDX2 ATTID NONCVCHG
RACE SDX3 OPERID ROOMCHG
PSEUDOID SDX4 PAYTYPE1 ANCLRCHG
AGE SDX5 PAYTYPE2 DRUGCHG
AGECAT SDX6 PAYTYPE3 EQUIPCHG
PRIVZIP SDX7 ESTPAYER SPECLCHG
MKTSHARE SDX8 NAIC MISCCHG
COUNTY PPX OCCUR1 APRMDC
STATE SPX1 OCCUR2 APRDRG
ADTYPE SPX2 BILLTYPE APRSOI
ADSOURCE SPX3 DRGHOSP APRROM
ADHOUR SPX4 PCMU MQGCLUST
ADMDX SPX5 DRGHC4 MQGCELL
ADDOW Â
Pennsylvannia Healthcare Cost Containment
Council. 2000-1, n800,000
69Risk Adjustment
- Discharge data like these allow for comparisons
of, e.g., mortality rates for CABG procedure
across hospitals. - Some hospitals accept riskier patients than
others a fair comparison must account for such
differences. - PHC4 (and many other organizations) use indirect
standardization - http//www.phc4.org
70(No Transcript)
71Hospital Responses
72(No Transcript)
73p-value computation
- n463 suppose actual number of deaths40
- e29.56
- p-value
p-value lt 0.05
74Concerns
- Ad-hoc groupings of strata
- Adequate risk adjustment for outcomes other than
mortality? Sensitivity analysis? Hopeless? - Statistical testing versus estimation
- Simpsons paradox
75Risk Cat. N Rate Actual Number Expected Number
Low 800 1 8 8 (1)
High 200 8 16 10 (5)
A
SMR 24/18 1.33 p-value 0.07
Low 200 1 2 2 (1)
High 800 8 64 40 (5)
B
SMR 66/42 1.57 p-value 0.0002
76Hierarchical Model
- Patients -gt physicians -gt hospitals
- Build a model using data at each level and
estimate quantities of interest
77Bayesian Hierarchical Model
MCMC via WinBUGS
78Goldstein and Spiegelhalter, 1996
79Discussion
- Markov chain Monte Carlo compute power enable
hierarchical modeling - Software is a significant barrier to the
widespread application of better methodology - Are these data useful for the study of disease?