ICS 278: Data Mining Exploratory Data Analysis and Visualization

1 / 53
About This Presentation
Title:

ICS 278: Data Mining Exploratory Data Analysis and Visualization

Description:

Data Mining Lectures Lecture on EDA and Visualization Padhraic Smyth, UC Irvine ... Exploratory Data Analysis (EDA) get a general sense of the data. interactive ... –

Number of Views:233
Avg rating:3.0/5.0
Slides: 54
Provided by: Informatio367
Category:

less

Transcript and Presenter's Notes

Title: ICS 278: Data Mining Exploratory Data Analysis and Visualization


1
ICS 278 Data MiningExploratory Data Analysis
and Visualization

2
Lecture 4
  • Project proposals due next Thursday
  • Todays lecture
  • Exploratory Data Analysis and Visualization
  • Summary statistics
  • 1 and 2 dimensional data visualization
  • Higher dimensional visualization
  • Examples of complex visualization
  • Reading Chapter 3 in the text
  • Note slides at end of last lecture on
    covariance, Mahalanobis distance, etc, will be
    revisited in next lecture

3
Exploratory Data Analysis (EDA)
  • get a general sense of the data
  • interactive and visual
  • (cleverly/creatively) exploit human visual power
    to see patterns
  • 1 to 5 dimensions (e.g. spatial, color, time,
    sound)
  • e.g. plot raw data/statistics, reduce dimensions
    as needed
  • data-driven (model-free)
  • especially useful in early stages of data mining
  • detect outliers (e.g. assess data quality)
  • test assumptions (e.g. normal distributions or
    skewed?)
  • identify useful raw data transforms (e.g.
    log(x))
  • http//www.itl.nist.gov/div898/handbook/eda/eda.ht
    m
  • Bottom line it is always well worth looking at
    your data!

4
Summary Statistics
  • not visual
  • sample statistics of data X
  • mean ? ?i Xi / n ?
    minimizes ?i (Xi - ?)2
  • mode most common value in X
  • median Xsort(X), median Xn/2 (half below,
    half above)
  • quartiles of sorted X Q1 value X0.25n , Q3
    value X0.75 n
  • interquartile range value(Q3) - value(Q1)
  • range max(X) - min(X)
    Xn - X1
  • variance ?2 ?i (Xi - ?)2 / n
  • skewness ?i (Xi - ?)3 / (?i (Xi - ?)2)3/2
  • zero if symmetric right-skewed more common (e.g.
    you v. Bill Gates)
  • number of distinct values for a variable (see
    unique.m in MATLAB)
  • Note all of these are estimates based on the
    sample at hand they may be different from the
    true values (e.g., median age in US).

5
Exploratory Data Analysis
  • Tools for Displaying Single Variables

6
Histogram
  • Most common form split data range into
    equal-sized bins Then for each bin, count the
    number of points from the data set that fall into
    the bin.
  • Vertical axis Frequency (i.e., counts for each
    bin)
  • Horizontal axis Response variable
  • The histogram graphically shows the following
  • center (i.e., the location) of the data
  • spread (i.e., the scale) of the data
  • skewness of the data
  • presence of outliers and
  • presence of multiple modes in the data.

7
Issues with Histograms
  • For small data sets, histograms can be
    misleading. Small changes in the data or to the
    bucket boundaries can result in very different
    histograms.
  • Interactive bin-width example (online applet)
  • http//www.stat.sc.edu/west/javahtml/Histogram.ht
    ml
  • For large data sets, histograms can be quite
    effective at illustrating general properties of
    the distribution.
  • Can smooth histogram using a variety of
    techniques
  • E.g., kernel density estimation (pages 59-61 in
    text)
  • Histograms effectively only work with 1 variable
    at a time
  • Difficult to extend to 2 dimensions, not possible
    for gt2
  • So histograms tell us nothing about the
    relationships among variables

8
Histogram Example
classical bell-shaped, symmetric histogram with
most of the frequency counts bunched in the
middle and with the counts dying off out in the
tails. From a physical science/engineering point
of view, the Normal/Gaussian distribution often
occurs in nature (due in part to the central
limit theorem).
9
ZipCode Data Population
10
ZipCode Data Population
  • MATLAB code X zipcode_data(,2)
    second column from zipcode array
    histogram(X, 50) histogram of X with
    50 bins
  • histogram(X, 500) 500 bins
    index X lt 5000 identify X values
    lower than 5000
  • histogram(X(index),50) now plot just
    these X values

11
Histogram Detecting Outlier (Missing Data)
12
Right Skewness Example Credit Card Usage
similarly right-skewed are Power law
distributions (Pi 1/ia, where a gt 1) e.g. for
a 1 we have Zipfs law For word frequencies
in text
13
Box (and Whisker) Plots Pima Indians Data
plots all data outside whiskers
Q3-Q1 box contains middle 50 of data
up to 1.5 x Q3-Q1 (or shorter, if no data that
far above Q3)
Q2 (median)
healthy
diabetic
14
Time Series Example 1
annual fees introduced in UK (many users cutback
to 1 credit card)
15
Time Series Example 2
summer bifurcations in air travel (favor
early/late)
summer peaks
steady growth trend
New Year bumps
16
Time-Series Example 3
mean weight vs mean age for 10k control group
Scotland experiment ? milk in kid diet ? better
health ? 20,000 kids 5k raw, 5k pasteurize,
10k control (no supplement)
Possible explanations Grow less early in year
than later? No steps in height plots so
why height ? uniformly, weight ? spurts? Kids
weighed in clothes summer garb lighter than
winter?
Would expect smooth weight growth plot. Visually
reveals unexpected pattern (steps), not apparent
from raw data table.
17
Non-Stationarity
  • Stationarity
  • (loose definition)
  • A probability distribution p (x t) is
    stationary with respect to t if
  • p (x t ) p (x)
    for all t,
  • where x is the set of variables of
    interest, and
  • t is some other varying quantity
    (e.g., usually t time, but could
    represent spatial information, group information,
    etc)
  • Examples
  • p(customer demographics today) p(customer
    demographics next month)?
  • p(weights in Scotland) p(weights in US) ?
  • p(income of customers in Bank 1) p(income of
    customers in Bank 2)?
  • Non-stationarity is common in real data sets
  • Solutions?
  • Model stationarity (e.g., increasing trend over
    time) and extrapolate
  • Build model only on most recent/most similar data

18
Exploratory Data Analysis
  • Tools for Displaying Pairs of Variables

19
2D Scatter Plots
  • standard tool to display relation between 2
    variables
  • e.g. y-axis response, x-axis suspected
    indicator
  • useful to answer
  • x,y related?
  • no
  • linearly
  • nonlinearly
  • variance(y) depend on x?
  • outliers present?
  • MATLAB
  • plot(X(1,),X(2,),.)

credit card repayment low-low, high-high
20
Scatter Plot No apparent relationship
21
Scatter Plot Linear relationship
22
Scatter Plot Quadratic relationship
23
Scatter plot Homoscedastic
Variation of Y Does Not Depend on X
24
Scatter plot Heteroscedastic
variation in Y differs depending on the value of
X e.g., Y annual tax paid, X income
25
(from US Zip code data each point 1 Zip
code) units dollars
26
Problems with Scatter Plots of Large Data
appears later apps older reality downward
slope (more apps, more variance)
96,000 bank loan applicants
scatter plot degrades into black smudge ...
27
Contour Plots Can Help
recall
(same 96,000 bank loan apps as before)
shows variance(y) ? with x ? is indeed due to
horizontal skew in density
unimodal
skewed ?
skewed ?
28
Problems with Scatter Plots of Large Data
weeks credit card buys gas vs groceries (10,000
customers)
actual correlation (0.48) higher than appears
(overprinting)
also demands explanation
29
A simple data set
Data X 10.00 8.00 13.00 9.00 11.00
14.00 6.00 4.00 12.00 7.00 5.00 Y
8.04 6.95 7.58 8.81 8.33 9.96
7.24 4.26 10.84 4.82 5.68
Anscombe, Francis (1973), Graphs in Statistical
Analysis, The American Statistician, pp.
195-199.
30
A simple data set
Data X 10.00 8.00 13.00 9.00 11.00
14.00 6.00 4.00 12.00 7.00 5.00 Y
8.04 6.95 7.58 8.81 8.33 9.96
7.24 4.26 10.84 4.82 5.68 Summary
Statistics N 11Mean of X 9.0Mean of Y
7.5Intercept 3Slope 0.5Residual standard
deviation 1.237Correlation 0.816
31
A simple data set
Data X 10.00 8.00 13.00 9.00 11.00
14.00 6.00 4.00 12.00 7.00 5.00 Y
8.04 6.95 7.58 8.81 8.33 9.96
7.24 4.26 10.84 4.82 5.68
32
3 more data sets
X2 Y2 X3 Y3 X4
Y4 10.00 9.14 10.00 7.46 8.00
6.58 8.00 8.14 8.00 6.77 8.00
5.76 13.00 8.74 13.00 12.74 8.00
7.71 9.00 8.77 9.00 7.11 8.00
8.84 11.00 9.26 11.00 7.81 8.00
8.47 14.00 8.10 14.00 8.84 8.00
7.04 6.00 6.13 6.00 6.08 8.00
5.25 4.00 3.10 4.00 5.39 19.00
12.50 12.00 9.13 12.00 8.15 8.00
5.56 7.00 7.26 7.00 6.42 8.00
7.91 5.00 4.74 5.00 5.73 8.00
6.89
33
Summary Statistics
Summary Statistics of Data Set 2 N 11Mean of
X 9.0Mean of Y 7.5Intercept 3Slope
0.5Residual standard deviation
1.237Correlation 0.816
34
Summary Statistics
Summary Statistics of Data Set 2 N 11Mean of
X 9.0Mean of Y 7.5Intercept 3Slope
0.5Residual standard deviation
1.237Correlation 0.816
Summary Statistics of Data Set 3 N 11Mean of
X 9.0Mean of Y 7.5Intercept 3Slope
0.5Residual standard deviation
1.237Correlation 0.816
Summary Statistics of Data Set 4 N 11Mean of
X 9.0Mean of Y 7.5Intercept 3Slope
0.5Residual standard deviation
1.237Correlation 0.816
35
Visualization really helps!
36
Exploratory Data Analysis
  • Tools for Displaying More than 2 Variables

37
Multivariate Visualization
  • Multivariate -gt multiple variables
  • 2 variables scatter plots, etc
  • 3 variables
  • 3-dimensional plots
  • Look impressive, but often not used
  • Can be cognitively challenging to interpret
  • Alternatives overlay color-coding (e.g.,
    categorical data) on 2d scatter plot
  • 4 variables
  • 3d with color or time
  • Can be effective in certain situations, but
    tricky
  • Higher dimensions
  • Generally difficult
  • Scatter plots, icon plots, parallel coordinates
    all have weaknesses
  • Alternative map data to lower dimensions,
    e.g., PCA or multidimensional scaling
  • Main problem high-dimensional structure may not
    be apparent in low-dimensional views

38
Scatter Plot Matrix
For interactive visualization the concept of
linked plots is generally useful
39
Trellis Plot
Older
Younger
Male
Female
40
Using Icons to Encode Information, e.g., Star
Plots
  • Each star represents a single observation. Star
    plots are used to examine the relative values for
    a single data point
  • The star plot consists of a sequence of
    equi-angular spokes, called radii, with each
    spoke representing one of the variables.
  • Useful for small data sets with up to 10 or so
    variables
  • Limitations?
  • Small data sets, small dimensions
  • Ordering of variables may affect perception

1 Price 2 Mileage (MPG) 3 1978 Repair Record (1
Worst, 5 Best) 4 1977 Repair Record (1
Worst, 5 Best)
5 Headroom 6 Rear Seat Room 7 Trunk Space 8
Weight 9 Length
41
Chernoffs Faces
  • described by ten facial characteristic
    parameters head eccentricity, eye eccentricity,
    pupil size, eyebrow slant, nose size, mouth
    shape, eye spacing, eye size, mouth length and
    degree of mouth opening
  • Chernoff faces applet
  • http//people.cs.uchicago.edu/wiseman/chernoff
    /
  • more icon plots
  • http//www.statsoft.com/textbook/glosi.html

42
Parallel Coordinates
(epileptic seizure data from text)
1 (of n) cases
dimensions (possibly all p of them!) often
(re)ordered to better distinguish among
interesting subsets of n total cases
(this case is a brushed one, with a darker
line, to standout from the n-1 other cases)
interactive brushing is useful for seeing such
distinctions
43
More elaborate parallel coordinates example (from
E. Wegman, 1999). 12,000 bank customers with 8
variables Additional dependent variable is
profit (green for positive, red for negative)
44
Interactive Grand Tour Techniques
  • Grand Tour idea
  • Cycle continuously through multiple projections
    of the data
  • Cycles through all possible projections
    (depending on time constraints)
  • Projects can be 1, 2, or 3d typically (often 2d)
  • Can link with scatter plot matrices (see
    following example)
  • Asimov (1985)
  • e.g. XGOBI visualization package (available on
    the Web)
  • http//public.research.att.com/stat/xgobi/
  • Example on following 2 slides
  • 7dimensional physics data, color-coded by group,
    shown with
  • Standard scatter matrix
  • static snapshot of grand tour

45
(No Transcript)
46
(No Transcript)
47
Example of displaying 4d categorical data,
e.g., as used in OLAP/databases
48
Other aspects (not discussed)
  • Cognitive and human-factors aspects of
    visualization
  • In creating visualizations of data it is
    important to be aware of how the human brain
    perceives visual information
  • E.g., Rules and principles of scientific data
    visualization
  • http//www.siggraph.org/education/materials/HyperV
    is/percept/visrules.htm
  • Artistic aspects of visualization
  • Classic books by Edward Tufte http//www.edwardtu
    fte.com/tufte/
  • Visualization of other data
  • 2d, 3d, 4d volume data (fluid flow, brain
    images, etc)
  • Network/graph data
  • Issues graph layout/drawing, issues of graph
    size
  • Many others., e.g.,
  • http//www.cybergeography.org/
  • CHI conference, etc

49
Visualization of weatherstates for KenyaDaily
data from 20 year historyclustered into 3
differentweather statesMean image for each
state - wind direction (arrows) - wind
intensity (size of arrows) - rainfall (size of
circles) - pressure (contours)S. Kirshner, A.
Robertson, P. Smyth, 2004.
50
Summary of 189k newsgroups and 257M
postings Green increase in postings in 2004
over 2003, red decrease Uses treemap
technique. Details at http//jcmc.indiana.edu/vol1
0/issue4/turner.html
51
Time-line Visualization of Research
Fronts (Morris et al, JASIST, 2003)
52
Interactive MultiTile Visualization
(Falko Kuesters HIPerWall
system, Calit2, UCI)
53
Summary
  • EDA and Visualization
  • Can be very useful for
  • data checking
  • getting a general sense of individual or pairs of
    variables
  • But
  • do not necessarily reveal structure in high
    dimensions
  • Reading Chapter 3
  • Next lecture
  • projecting/mapping data from high dimensions to
    low dimensions
Write a Comment
User Comments (0)
About PowerShow.com