Data Basics - PowerPoint PPT Presentation

About This Presentation

Title:

Data Basics

Description:

Data Basics The following list of s from Andrew Moore Measures of Dispersion (Inter-Quartile Range) Inter-Quartile Range (IQR): More robust Sample IQR: Measures ... – PowerPoint PPT presentation

Number of Views:119

Avg rating:3.0/5.0

Slides: 128

Provided by: temp208

Learn more at: https://www.cs.kent.edu

Category:

more less

Transcript and Presenter's Notes

Title: Data Basics

1
Data Basics
2
Data Matrix

Many datasets can be represented as a data
matrix.
Rows corresponding to entities
Columns represents attributes.
N size of the data
D dimensionality of the data
Univariate analysis the analysis of a single
attribute.
Bivariate analysis simultaneous analysis of two
attributes.
Multivariate analysis simultaneous analysis of
multiple attributes.

3
Example for Data Matrix
4
Attributes

Categorical Attributes
composed of a set of symbols
has a set-valued domain
E.g., Sex with domain(Sex) M, F, Education
with domain(Education) High School, BS, MS,
PhD.
Two types of categorical attributes
Nominal
values in the domain are unordered
Only equality comparisons are allowed
E.g. Sex
Ordinal
Values are ordered
Both equality and inequality comparisons are
allowed
E.g. Education

5
Attributes Cont.

Numeric Attributes
Has real-valued or integer-valued domain
E.g. Age with domain (Age) N, where N denotes
the set of natural numbers (non-negative
integers).
Two types of numeric attributes
Discrete values take on finite or countably
infinite set.
Continuous values take on any real value
Another Classification
Interval-scaled
for attributes only differences make sense
E.g. temperature.
Ratio-scaled
Both difference and ratios are meaningful
E.g. Age

6
Algebraic View of Data

If the d attributes in the data matrix D are all
numeric
each row can be considered as a d-dimensional
point
or equivalently, each row may be considered a
d-dimensional column vector
Linear combination of the standard basis vectors

7
Example of Algebraic View of Data
8
Geometric View of Data
9
Distance of Angle
10
(No Transcript)
11
(No Transcript)
12
Example of Distance and Angle
13
Mean and Total Variance
14
Centered Data Matrix

The centered data matrix is obtained by
subtracting the mean from all the points

15
Orthogonality

Two vectors a and b are said to be orthogonal if
and only if
It implies that the angle between them is 90? or
p/2 radians.

16
Orthogonal Projection
P orthogonal projection of b on the vector a R
error vector between points b and p
17
Example of Projection
18
Linear Independence and Dimensionality

the set of all
possible linear combinations of the vectors.
If then
we say that v1, , vk is a spanning set for
.

19
Row and Column Space

The column space of D, denoted col(D) is the set
of all linear combinations of the d column
vectors or attributes
The row space of D, denoted row(D), is the set of
all linear combinations of the n row vectors or
points
Note also that the row space of D is the column
space of

20
Linear Independence
21
Dimension and Rank

Let S be a subspace of Rm.
A basis for S a set of linearly independent
vectors v1, , vk , and span(v1, , vk)
S.
orthogonal basis for S If the vectors in the
basis are pair-wise orthogonal
If in addition they are also normalized to be
unit vectors, then they make up an orthonormal
basis for S.
For instance, the standard basis for Rm is an
orthonormal basis consisting of the vectors

Any two bases for S must have the same number of
vectors.
Dimension The number of vectors in a basis for
S, denoted as dim(S).
For any matrix, the dimension of its row and
column space are the same, and this dimension is
also called as the rank of the matrix.

23
Data Probabilistic View

Assumes that each numeric attribute Xj is a
random variable, defined as a function that
assigns a real number to each outcome of an
experiment.
Given as Xj O ? R, where O, the domain of Xj ,
called as the sample space
R, the range of Xj , is the set of real numbers.
If the outcomes are numeric, and represent the
observed values of the random variable, then Xj
O ? O is simply the identity function Xj (v) v
for all v ? O.

24
Data Probabilistic View

A random variable X is called a discrete random
variable if it takes on only a finite or
countably infinite number of values in its range.
X is called a continuous random variable if it
can take on any value in its range.

25
Example

Be default, consider the attribute X1 to be a
continuous random variable, given as the identity
function X1(v) v, since the outcomes are all
numeric.
On the other hand, if we want to distinguish
between iris flowers with short and long sepal
lengths, we define a discrete random variable A
as follows
In this case the domain of A is 4.3, 7.9. The
range of A is 0, 1, and thus A assumes non-zero
probability only at the discrete values 0 and 1.

26
(No Transcript)
27
Example Bernoulli and Binomial Distribution

only 13 irises have sepal length of at least 7cm
In this case we say that A has a Bernoulli
distribution with parameter p ? 0, 1. p denotes
the probability of a success, whereas 1- p
represents the probability of a failure

28
Example Bernoulli and Binomial Distribution

Let us consider another discrete random variable
B, denoting the number of irises with long sepal
lengths in m independent Bernoulli trials with
probability of success p.
B takes on the discrete values 0,m, and its
probability mass function is given by the
Binomial distribution
For example, taking p 0.087 from above, the
probability of observing exactly k 2 long sepal
length irises in m 10 trials is given as

29
full probability mass function for different
values of k
30
Probability Density Function

If X is continuous, its range is the entire set
of real numbers R.
probability density function specifies the
probability that the variable X takes on values
in any interval a, b ? R

31
Cumulative Distribution Function

For any random variable X, whether discrete or
continuous, we can define the cumulative
distribution function (CDF) F R ? 0, 1, that
gives the probability of observing a value at
most some given value x

32
(No Transcript)
33
(No Transcript)
34
(No Transcript)
35
(No Transcript)
36
(No Transcript)
37
(No Transcript)
38
(No Transcript)
39
(No Transcript)
40
(No Transcript)
41
(No Transcript)
42
(No Transcript)
43
(No Transcript)
44
(No Transcript)
45
Probability Density Function f(x)

What is P(Xx) when x is on a real domain
f(x) gt0 and

46
Normal Distribution

Let us assume that these values follow a Gaussian
or normal density function, given as

47
(No Transcript)
48
(No Transcript)
49
(No Transcript)
50
(No Transcript)
51
(No Transcript)
52
(No Transcript)
53
Bivariate Random Variables

considering a pair of attributes, X1 and X2, as a
bivariate random variable

54
(No Transcript)
55
(No Transcript)
56
(No Transcript)
57
In 2-Dimensions
58
(No Transcript)
59
(No Transcript)
60
(No Transcript)
61
(No Transcript)
62
(No Transcript)
63
(No Transcript)
64
(No Transcript)
65
(No Transcript)
66
(No Transcript)
67
(No Transcript)
68
(No Transcript)
69
(No Transcript)
70
(No Transcript)
71
(No Transcript)
72
(No Transcript)
73
(No Transcript)
74
(No Transcript)
75
(No Transcript)
76
(No Transcript)
77
(No Transcript)
78
Multivariate Random Variable
79
Multivariate Random Variable
80
Numeric Attribute Analysis

Sample and Statistics
Univariate Analysis
Bivariate Analysis
Multivariate Analysis
Normal Distribution

81
Random Sample and Statistics

Population is used to refer to the set or
universe of all entities under study.
However, looking at the entire population may not
be feasible, or may be too expensive.
Instead, we draw a random sample from the
population, and compute appropriate statistics
from the sample, that give estimates of the
corresponding population parameters of interest.

82
Univariate Sample

Let X be a random variable, and let xi (1 i
n) denote the observed values of attribute X in
the given data, where n is the data size.
Given a random variable X, a random sample of
size n from X is defined as a set of n
independent and identically distributed (IID)
random variables S1, S2, , Sn.
since the variables Si are all independent, their
joint probability function is given as

83
Multivariate Sample

xi the value of a d-dimensional vector random
variable Si (X1,X2, ,Xd ).
Si are independent and identically distributed,
and thus their joint distribution is given as
Assume d attributes X1,X2, ,Xd are
independent, (1.43) can be rewritten as

84
Statistic

Let Si denote the random variable corresponding
to data point xi , then a statistic ˆ? is a
function ˆ? (S1, S2, , Sn) ? R.
If we use the value of a statistic to estimate a
population parameter, this value is called a
point estimate of the parameter, and the
statistic is called as an estimator of the
parameter.

85
(No Transcript)
86
Numeric Attribute Analysis

Sample and Statistics
Univariate Analysis
Bivariate Analysis
Multivariate Analysis
Normal Distribution

87
Univariate Analysis
Univariate analysis focuses on a single attribute
at a time, thus the data matrix D can be thought
of as a n 1 matrix, or simply a column vector.
88
Univariate Analysis
X is assumed to be a random variable, and each
point xi (1 i n) is assumed to be the value
of a random variable Si , where the variables Si
are all independent and identically distributed
as X, i.e., they constitute a random sample drawn
from X. In the vector view, we treat the sample
as an n-dimensional vector, and write X ? Rn.
89
What can sample analysis do?

Unknown f(X) and F(X)
Parameters(µ,d)

90
Empirical Cumulative Distribution Function

Where

91
Inverse Cumulative Distribution Function
92
Empirical Probability Mass Function

Where

93
Measures of Central Tendency (Mean)

Population Mean

Sample Mean (Unbiased, not robust)
94
Measures of Central Tendency (Median)

Population Median

or
Sample Median
95
Measures of Central Tendency (Mode)

Sample Mode

may not be very useful
but not affected by the outliers too much

96
Example
97
Measures of Dispersion (Range)

Range

Sample Range

Not robust, sensitive to extreme values

98
Measures of Dispersion (Inter-Quartile Range)

Inter-Quartile Range (IQR)

Sample IQR

More robust

99
Measures of Dispersion (Variance and Standard
Deviation)
Variance
Standard Deviation
100
Measures of Dispersion (Variance and Standard
Deviation)
Variance
Standard Deviation
Sample Variance Standard Deviation
101
Normalization
Linear Normalization
Z-Score
102
Normalization Example
103
Topics

Sample and Statistics
Univariate Analysis
Bivariate Analysis
Multivariate Analysis
Normal Distribution

104
Bivariate Analysis
Bivariate analysis focuses on Two attributes at a
time, thus the data matrix D can be thought of as
a n 2 matrix, or two column vectors.
105
Empirical Joint Probability Mass Function
or
where
106
Measures of Central Tendency (Mean)

Population Mean

Sample Mean
107
Measures of Association (Covariance)
Covariance
Sample Covariance
108
Measures of Association (Correlation)
Correlation
Sample Correlation
109
Measures of Association (Correlation)
110
Correlation Example
111
Topics

Sample and Statistic
Univariate Analysis
Bivariate Analysis
Multivariate Analysis
Normal Distribution

112
Multivariate Analysis
Multivariate analysis focuses on multiple
attributes at a time, thus the data matrix D can
be thought of as a n d matrix, or d column
vectors.
113
Measures of Central Tendency (Mean)

Population Mean

Sample Mean
114
Measures of Association (Covariance Matrix)
115
Measures of Association (Correlation)
Correlation
Sample Correlation
116
Topics

Sample and Statistic
Univariate Analysis
Bivariate Analysis
Multivariate Analysis
Normal Distribution

117
Univariate Normal Distribution
118
Multivariate Normal Distribution
119

Thank You!

120
(No Transcript)
121
(No Transcript)
122
(No Transcript)
123
(No Transcript)
124
(No Transcript)
125
(No Transcript)
126
(No Transcript)
127
(No Transcript)

Write a Comment

User Comments (0)