Title: ICS 278: Data Mining Exploratory Data Analysis and Visualization
1ICS 278 Data MiningExploratory Data Analysis
and Visualization
2Lecture 4
- Project proposals due next Thursday
- Todays lecture
- Exploratory Data Analysis and Visualization
- Summary statistics
- 1 and 2 dimensional data visualization
- Higher dimensional visualization
- Examples of complex visualization
- Reading Chapter 3 in the text
- Note slides at end of last lecture on
covariance, Mahalanobis distance, etc, will be
revisited in next lecture
3Exploratory Data Analysis (EDA)
- get a general sense of the data
- interactive and visual
- (cleverly/creatively) exploit human visual power
to see patterns - 1 to 5 dimensions (e.g. spatial, color, time,
sound) - e.g. plot raw data/statistics, reduce dimensions
as needed - data-driven (model-free)
- especially useful in early stages of data mining
- detect outliers (e.g. assess data quality)
- test assumptions (e.g. normal distributions or
skewed?) - identify useful raw data transforms (e.g.
log(x)) - http//www.itl.nist.gov/div898/handbook/eda/eda.ht
m - Bottom line it is always well worth looking at
your data!
4Summary Statistics
- not visual
- sample statistics of data X
- mean ? ?i Xi / n ?
minimizes ?i (Xi - ?)2 - mode most common value in X
- median Xsort(X), median Xn/2 (half below,
half above) - quartiles of sorted X Q1 value X0.25n , Q3
value X0.75 n - interquartile range value(Q3) - value(Q1)
- range max(X) - min(X)
Xn - X1 - variance ?2 ?i (Xi - ?)2 / n
- skewness ?i (Xi - ?)3 / (?i (Xi - ?)2)3/2
- zero if symmetric right-skewed more common (e.g.
you v. Bill Gates) - number of distinct values for a variable (see
unique.m in MATLAB) - Note all of these are estimates based on the
sample at hand they may be different from the
true values (e.g., median age in US).
5Exploratory Data Analysis
- Tools for Displaying Single Variables
6Histogram
- Most common form split data range into
equal-sized bins Then for each bin, count the
number of points from the data set that fall into
the bin. - Vertical axis Frequency (i.e., counts for each
bin) - Horizontal axis Response variable
- The histogram graphically shows the following
- center (i.e., the location) of the data
- spread (i.e., the scale) of the data
- skewness of the data
- presence of outliers and
- presence of multiple modes in the data.
7Issues with Histograms
- For small data sets, histograms can be
misleading. Small changes in the data or to the
bucket boundaries can result in very different
histograms. - Interactive bin-width example (online applet)
- http//www.stat.sc.edu/west/javahtml/Histogram.ht
ml - For large data sets, histograms can be quite
effective at illustrating general properties of
the distribution. - Can smooth histogram using a variety of
techniques - E.g., kernel density estimation (pages 59-61 in
text) - Histograms effectively only work with 1 variable
at a time - Difficult to extend to 2 dimensions, not possible
for gt2 - So histograms tell us nothing about the
relationships among variables
8Histogram Example
classical bell-shaped, symmetric histogram with
most of the frequency counts bunched in the
middle and with the counts dying off out in the
tails. From a physical science/engineering point
of view, the Normal/Gaussian distribution often
occurs in nature (due in part to the central
limit theorem).
9ZipCode Data Population
10ZipCode Data Population
- MATLAB code X zipcode_data(,2)
second column from zipcode array
histogram(X, 50) histogram of X with
50 bins - histogram(X, 500) 500 bins
index X lt 5000 identify X values
lower than 5000 - histogram(X(index),50) now plot just
these X values
11Histogram Detecting Outlier (Missing Data)
12Right Skewness Example Credit Card Usage
similarly right-skewed are Power law
distributions (Pi 1/ia, where a gt 1) e.g. for
a 1 we have Zipfs law For word frequencies
in text
13Box (and Whisker) Plots Pima Indians Data
plots all data outside whiskers
Q3-Q1 box contains middle 50 of data
up to 1.5 x Q3-Q1 (or shorter, if no data that
far above Q3)
Q2 (median)
healthy
diabetic
14Time Series Example 1
annual fees introduced in UK (many users cutback
to 1 credit card)
15Time Series Example 2
summer bifurcations in air travel (favor
early/late)
summer peaks
steady growth trend
New Year bumps
16Time-Series Example 3
mean weight vs mean age for 10k control group
Scotland experiment ? milk in kid diet ? better
health ? 20,000 kids 5k raw, 5k pasteurize,
10k control (no supplement)
Possible explanations Grow less early in year
than later? No steps in height plots so
why height ? uniformly, weight ? spurts? Kids
weighed in clothes summer garb lighter than
winter?
Would expect smooth weight growth plot. Visually
reveals unexpected pattern (steps), not apparent
from raw data table.
17Non-Stationarity
- Stationarity
- (loose definition)
- A probability distribution p (x t) is
stationary with respect to t if - p (x t ) p (x)
for all t, - where x is the set of variables of
interest, and - t is some other varying quantity
(e.g., usually t time, but could
represent spatial information, group information,
etc) - Examples
- p(customer demographics today) p(customer
demographics next month)? - p(weights in Scotland) p(weights in US) ?
- p(income of customers in Bank 1) p(income of
customers in Bank 2)? - Non-stationarity is common in real data sets
- Solutions?
- Model stationarity (e.g., increasing trend over
time) and extrapolate - Build model only on most recent/most similar data
18Exploratory Data Analysis
- Tools for Displaying Pairs of Variables
192D Scatter Plots
- standard tool to display relation between 2
variables - e.g. y-axis response, x-axis suspected
indicator - useful to answer
- x,y related?
- no
- linearly
- nonlinearly
- variance(y) depend on x?
- outliers present?
- MATLAB
- plot(X(1,),X(2,),.)
credit card repayment low-low, high-high
20Scatter Plot No apparent relationship
21Scatter Plot Linear relationship
22Scatter Plot Quadratic relationship
23Scatter plot Homoscedastic
Variation of Y Does Not Depend on X
24Scatter plot Heteroscedastic
variation in Y differs depending on the value of
X e.g., Y annual tax paid, X income
25(from US Zip code data each point 1 Zip
code) units dollars
26Problems with Scatter Plots of Large Data
appears later apps older reality downward
slope (more apps, more variance)
96,000 bank loan applicants
scatter plot degrades into black smudge ...
27Contour Plots Can Help
recall
(same 96,000 bank loan apps as before)
shows variance(y) ? with x ? is indeed due to
horizontal skew in density
unimodal
skewed ?
skewed ?
28Problems with Scatter Plots of Large Data
weeks credit card buys gas vs groceries (10,000
customers)
actual correlation (0.48) higher than appears
(overprinting)
also demands explanation
29A simple data set
Data X 10.00 8.00 13.00 9.00 11.00
14.00 6.00 4.00 12.00 7.00 5.00 Y
8.04 6.95 7.58 8.81 8.33 9.96
7.24 4.26 10.84 4.82 5.68
Anscombe, Francis (1973), Graphs in Statistical
Analysis, The American Statistician, pp.
195-199.
30A simple data set
Data X 10.00 8.00 13.00 9.00 11.00
14.00 6.00 4.00 12.00 7.00 5.00 Y
8.04 6.95 7.58 8.81 8.33 9.96
7.24 4.26 10.84 4.82 5.68 Summary
Statistics N 11Mean of X 9.0Mean of Y
7.5Intercept 3Slope 0.5Residual standard
deviation 1.237Correlation 0.816
31A simple data set
Data X 10.00 8.00 13.00 9.00 11.00
14.00 6.00 4.00 12.00 7.00 5.00 Y
8.04 6.95 7.58 8.81 8.33 9.96
7.24 4.26 10.84 4.82 5.68
323 more data sets
X2 Y2 X3 Y3 X4
Y4 10.00 9.14 10.00 7.46 8.00
6.58 8.00 8.14 8.00 6.77 8.00
5.76 13.00 8.74 13.00 12.74 8.00
7.71 9.00 8.77 9.00 7.11 8.00
8.84 11.00 9.26 11.00 7.81 8.00
8.47 14.00 8.10 14.00 8.84 8.00
7.04 6.00 6.13 6.00 6.08 8.00
5.25 4.00 3.10 4.00 5.39 19.00
12.50 12.00 9.13 12.00 8.15 8.00
5.56 7.00 7.26 7.00 6.42 8.00
7.91 5.00 4.74 5.00 5.73 8.00
6.89
33Summary Statistics
Summary Statistics of Data Set 2 N 11Mean of
X 9.0Mean of Y 7.5Intercept 3Slope
0.5Residual standard deviation
1.237Correlation 0.816
34Summary Statistics
Summary Statistics of Data Set 2 N 11Mean of
X 9.0Mean of Y 7.5Intercept 3Slope
0.5Residual standard deviation
1.237Correlation 0.816
Summary Statistics of Data Set 3 N 11Mean of
X 9.0Mean of Y 7.5Intercept 3Slope
0.5Residual standard deviation
1.237Correlation 0.816
Summary Statistics of Data Set 4 N 11Mean of
X 9.0Mean of Y 7.5Intercept 3Slope
0.5Residual standard deviation
1.237Correlation 0.816
35Visualization really helps!
36Exploratory Data Analysis
- Tools for Displaying More than 2 Variables
37Multivariate Visualization
- Multivariate -gt multiple variables
- 2 variables scatter plots, etc
- 3 variables
- 3-dimensional plots
- Look impressive, but often not used
- Can be cognitively challenging to interpret
- Alternatives overlay color-coding (e.g.,
categorical data) on 2d scatter plot - 4 variables
- 3d with color or time
- Can be effective in certain situations, but
tricky - Higher dimensions
- Generally difficult
- Scatter plots, icon plots, parallel coordinates
all have weaknesses - Alternative map data to lower dimensions,
e.g., PCA or multidimensional scaling - Main problem high-dimensional structure may not
be apparent in low-dimensional views
38Scatter Plot Matrix
For interactive visualization the concept of
linked plots is generally useful
39Trellis Plot
Older
Younger
Male
Female
40Using Icons to Encode Information, e.g., Star
Plots
- Each star represents a single observation. Star
plots are used to examine the relative values for
a single data point - The star plot consists of a sequence of
equi-angular spokes, called radii, with each
spoke representing one of the variables. - Useful for small data sets with up to 10 or so
variables - Limitations?
- Small data sets, small dimensions
- Ordering of variables may affect perception
1 Price 2 Mileage (MPG) 3 1978 Repair Record (1
Worst, 5 Best) 4 1977 Repair Record (1
Worst, 5 Best)
5 Headroom 6 Rear Seat Room 7 Trunk Space 8
Weight 9 Length
41Chernoffs Faces
- described by ten facial characteristic
parameters head eccentricity, eye eccentricity,
pupil size, eyebrow slant, nose size, mouth
shape, eye spacing, eye size, mouth length and
degree of mouth opening - Chernoff faces applet
- http//people.cs.uchicago.edu/wiseman/chernoff
/ - more icon plots
- http//www.statsoft.com/textbook/glosi.html
42Parallel Coordinates
(epileptic seizure data from text)
1 (of n) cases
dimensions (possibly all p of them!) often
(re)ordered to better distinguish among
interesting subsets of n total cases
(this case is a brushed one, with a darker
line, to standout from the n-1 other cases)
interactive brushing is useful for seeing such
distinctions
43More elaborate parallel coordinates example (from
E. Wegman, 1999). 12,000 bank customers with 8
variables Additional dependent variable is
profit (green for positive, red for negative)
44Interactive Grand Tour Techniques
- Grand Tour idea
- Cycle continuously through multiple projections
of the data - Cycles through all possible projections
(depending on time constraints) - Projects can be 1, 2, or 3d typically (often 2d)
- Can link with scatter plot matrices (see
following example) - Asimov (1985)
- e.g. XGOBI visualization package (available on
the Web) - http//public.research.att.com/stat/xgobi/
- Example on following 2 slides
- 7dimensional physics data, color-coded by group,
shown with - Standard scatter matrix
- static snapshot of grand tour
45(No Transcript)
46(No Transcript)
47Example of displaying 4d categorical data,
e.g., as used in OLAP/databases
48Other aspects (not discussed)
- Cognitive and human-factors aspects of
visualization - In creating visualizations of data it is
important to be aware of how the human brain
perceives visual information - E.g., Rules and principles of scientific data
visualization - http//www.siggraph.org/education/materials/HyperV
is/percept/visrules.htm - Artistic aspects of visualization
- Classic books by Edward Tufte http//www.edwardtu
fte.com/tufte/ - Visualization of other data
- 2d, 3d, 4d volume data (fluid flow, brain
images, etc) - Network/graph data
- Issues graph layout/drawing, issues of graph
size - Many others., e.g.,
- http//www.cybergeography.org/
- CHI conference, etc
49Visualization of weatherstates for KenyaDaily
data from 20 year historyclustered into 3
differentweather statesMean image for each
state - wind direction (arrows) - wind
intensity (size of arrows) - rainfall (size of
circles) - pressure (contours)S. Kirshner, A.
Robertson, P. Smyth, 2004.
50Summary of 189k newsgroups and 257M
postings Green increase in postings in 2004
over 2003, red decrease Uses treemap
technique. Details at http//jcmc.indiana.edu/vol1
0/issue4/turner.html
51Time-line Visualization of Research
Fronts (Morris et al, JASIST, 2003)
52Interactive MultiTile Visualization
(Falko Kuesters HIPerWall
system, Calit2, UCI)
53Summary
- EDA and Visualization
- Can be very useful for
- data checking
- getting a general sense of individual or pairs of
variables - But
- do not necessarily reveal structure in high
dimensions - Reading Chapter 3
- Next lecture
- projecting/mapping data from high dimensions to
low dimensions