Visualizing and Exploring Data - PowerPoint PPT Presentation

About This Presentation
Title:

Visualizing and Exploring Data

Description:

(mean, median, mode, quartile, variance, skewnes) ... Scree plot ... amount of variance explained by each consecutive value. Example (Huba et al. 1981) ... – PowerPoint PPT presentation

Number of Views:196
Avg rating:3.0/5.0
Slides: 15
Provided by: Jir89
Category:

less

Transcript and Presenter's Notes

Title: Visualizing and Exploring Data


1
Visualizing and Exploring Data
  • Summary statistics for data(mean, median, mode,
    quartile, variance, skewnes)
  • Distribution of values for single variables
    (histogram, smoothing, box and whisker plot)
  • Relationships between pairs of variables
    (scatterplot, contour plot)
  • Relationship between multiple variables
    (scatterplot matrix, trellis plotting, star
    icons, parallel coordinates)
  • Projection pursuit methods (principal component
    analysis)
  • Parallel coordinates plots

2
Summarizing data
  • Mean ? 1/n ?i xi
  • Median value that has an equal number of data
    points above and below it.
  • Quartile first quartile value that is greater
    than a quarter of data points
  • Variance ?2 1/n ?i (xi - ?)2
  • Skewnes measures whether or not a distribution
    has a single long tail ?i (xi - ?)3 / (?i (xi -
    ?)2 )3/2

3
Histogram (Microsoft Excel)

4
Smoothing estimates
  • The contribution of a data point xi to the
    estimate at some point x depends on K((x - xi
    )/h)
  • K() kernel function ?i K(xi ) 1 e.g. normal
    (Gaussian) distribution
  • h bandwidth
  • Estimated density at point x is f(x) 1/n ?i
    K((x - xi )/h)
  • Example (Xgobi koule.txt -var2)

5
Box and whisker plots
  • Upper and lower boundaries of each box represent
    the first and third quartiles.
  • Horizontal line within each box represents the
    median.
  • The whiskers extend 1.5 times the interquartile
    range from the end of each box.
  • All data points outside the whiskers are plotted
    individually.






6
Scatterplot
  • Two variables at a time
  • One point for each data record
  • Example (Xgobi koule.txt )
  • Scatterplots can reveal anomalies and
    shortcomings in data. Example changes in
    measured weight of childern in summer and winter
    periods
  • Problems.
  • In case of many points we may get a black
    rectangle.
  • Overprinting can conceal the strength of
    correlation. A solution is the Contour plot
    with contour lines like in a topographic map.
  • Only two dimensional.

7
More than two variables
  • Scatterplot matrixMultivariate data are
    projected into two-dimensional plots (all other
    variables are ignored). Example Crystal Vision
    pollen.data
  • Trellis plot Series of scatterplots conditioned
    on levels of one or more other variables
  • Brushing Enables to highlight corresponding
    points
  • Star icons Different directions from the origin
    correspond to different variables. The lengths
    correspond to the magnitudes.

8
Interactive graphics
  • Rotating directions of projections in search for
    a structure
  • Random rotations
  • Manual rotations Example (Xgobi koule)
  • Projection pursuit methodsAllowing computer to
    search for interesting directions using a
    criteriaExample (Xgobi krychle)
  • a special case Principal component analysis

9
Principal component analysis
  • Assumptiondata lie in a two dimensional linear
    subspace spanned by a linear combinations of
    measured variables
  • Criteria for interesting directiona plane for
    which the sum of squared distances between the
    data points and their projections onto this plane
    is minimized
  • Solution in polynomial timethe plane is spanned
    by
  • the linear combination of variables that has
    maximum sample variance and
  • the linear combination that has maximum variance
    subject to being uncorrelated with the first
    linear combination

10
Principal component analysis
  • X n x p data matrix, rows are data cases
  • a p x 1 column vector of projection weights
  • aTx projection of a vector x
  • Xa projected values of all data vectors
  • ?a2 ( Xa )T ( Xa ) aT V a variance along a
  • Maximize variance under a normaliz. constraint
    aTa 1, i.e. max aT V a - ? ( aTa 1 ) It
    reduces to eigenvalue form (V - ?I) a 0
  • The first principal component a is the
    eigenvector associated with the largest
    eigenvalue ?. The second principal component a
    is the eigenvector associated with the second
    largest eigenvalue ?, etc.
  • Scree plot amount of variance explained by each
    consecutive value

11
Example (Huba et al. 1981)
  • Data on 1684 students in LA showing consumption
    of 13 legal and illegal psychoactive substances
  • The weights of the first principal components
    were cigarettes 0.278, beer 0.286, wine 0.265,
    spirits 0.318, cocaine 0.208, tranquilizers
    0.293, medications 0.176, heroin 0.202,
    marijuana 0.339, hashish 0.329,inhalants
    0.276,hallucinogens 0.248, amphetamines 0.329 a
    measure how often students use psychoactive
    substances, regardless of which substance they
    use.
  • The weights of the second principal components
    were 0.280, 0.396, 0.392, 0.325, -0.288,
    -0.259, -0.189, -0.315, 0.163,
    -0.050, -0.169, -0.329,
    -0.232it gives positive weights to legal
    substances and negative weights to illegal ones.
    Once the overall substance use is controlled, the
    major difference lies in the legal versus illegal
    use.

12
General Multidimensional Scaling
  • Crumbled piece of paper is two-dimensional but
    principle components analysis would fail.
  • The goal of scaling methods preserving distances
    in a lower dimensional space
  • Methods differ in
  • distances that are to be preserved ?jk
  • distances they map to djk
  • how the calculations are performed

13
General Multidimensional Scaling
  • Most common distance measure is Euclidean metric
  • Common score function is stress( ?j ?k (?jk2 -
    djk2 )2 / ?j ?k djk2 )1/2
  • The methods may start from distances between data
    vectors metric scalling or rank order or a
    monotonic relationship non-metric scalling
  • The methods can be iterative1) regression of
    distances and 2) minimization of the stress

14
Parallel coordinates plots
  • Variables as parallel axes
  • Each data case is a piecewise linear plot
    connecting the values of the case
  • Wegman, E. J. (1990), Hyperdimensional data
    analysis using parallel coordinates, J. American
    Statistical Association, 85, 664-675.
  • Example (Crystal Vision krychle.data)
Write a Comment
User Comments (0)
About PowerShow.com