Title: Data Basics
1Data Basics
2Data Matrix
- Many datasets can be represented as a data
matrix. - Rows corresponding to entities
- Columns represents attributes.
- N size of the data
- D dimensionality of the data
- Univariate analysis the analysis of a single
attribute. - Bivariate analysis simultaneous analysis of two
attributes. - Multivariate analysis simultaneous analysis of
multiple attributes.
3Example for Data Matrix
4Attributes
- Categorical Attributes
- composed of a set of symbols
- has a set-valued domain
- E.g., Sex with domain(Sex) M, F, Education
with domain(Education) High School, BS, MS,
PhD. - Two types of categorical attributes
- Nominal
- values in the domain are unordered
- Only equality comparisons are allowed
- E.g. Sex
- Ordinal
- Values are ordered
- Both equality and inequality comparisons are
allowed - E.g. Education
5Attributes Cont.
- Numeric Attributes
- Has real-valued or integer-valued domain
- E.g. Age with domain (Age) N, where N denotes
the set of natural numbers (non-negative
integers). - Two types of numeric attributes
- Discrete values take on finite or countably
infinite set. - Continuous values take on any real value
- Another Classification
- Interval-scaled
- for attributes only differences make sense
- E.g. temperature.
- Ratio-scaled
- Both difference and ratios are meaningful
- E.g. Age
6Algebraic View of Data
- If the d attributes in the data matrix D are all
numeric - each row can be considered as a d-dimensional
point - or equivalently, each row may be considered a
d-dimensional column vector - Linear combination of the standard basis vectors
7Example of Algebraic View of Data
8Geometric View of Data
9Distance of Angle
10(No Transcript)
11(No Transcript)
12Example of Distance and Angle
13Mean and Total Variance
14Centered Data Matrix
- The centered data matrix is obtained by
subtracting the mean from all the points
15Orthogonality
- Two vectors a and b are said to be orthogonal if
and only if - It implies that the angle between them is 90? or
p/2 radians.
16Orthogonal Projection
P orthogonal projection of b on the vector a R
error vector between points b and p
17Example of Projection
18Linear Independence and Dimensionality
- the set of all
possible linear combinations of the vectors. - If then
we say that v1, , vk is a spanning set for
.
19Row and Column Space
- The column space of D, denoted col(D) is the set
of all linear combinations of the d column
vectors or attributes - The row space of D, denoted row(D), is the set of
all linear combinations of the n row vectors or
points - Note also that the row space of D is the column
space of
20Linear Independence
21Dimension and Rank
- Let S be a subspace of Rm.
- A basis for S a set of linearly independent
vectors v1, , vk , and span(v1, , vk)
S. - orthogonal basis for S If the vectors in the
basis are pair-wise orthogonal - If in addition they are also normalized to be
unit vectors, then they make up an orthonormal
basis for S. - For instance, the standard basis for Rm is an
orthonormal basis consisting of the vectors
22- Any two bases for S must have the same number of
vectors. - Dimension The number of vectors in a basis for
S, denoted as dim(S). - For any matrix, the dimension of its row and
column space are the same, and this dimension is
also called as the rank of the matrix.
23Data Probabilistic View
- Assumes that each numeric attribute Xj is a
random variable, defined as a function that
assigns a real number to each outcome of an
experiment. - Given as Xj O ? R, where O, the domain of Xj ,
called as the sample space - R, the range of Xj , is the set of real numbers.
- If the outcomes are numeric, and represent the
observed values of the random variable, then Xj
O ? O is simply the identity function Xj (v) v
for all v ? O.
24Data Probabilistic View
- A random variable X is called a discrete random
variable if it takes on only a finite or
countably infinite number of values in its range. - X is called a continuous random variable if it
can take on any value in its range.
25Example
- Be default, consider the attribute X1 to be a
continuous random variable, given as the identity
function X1(v) v, since the outcomes are all
numeric. - On the other hand, if we want to distinguish
between iris flowers with short and long sepal
lengths, we define a discrete random variable A
as follows - In this case the domain of A is 4.3, 7.9. The
range of A is 0, 1, and thus A assumes non-zero
probability only at the discrete values 0 and 1.
26(No Transcript)
27Example Bernoulli and Binomial Distribution
- only 13 irises have sepal length of at least 7cm
- In this case we say that A has a Bernoulli
distribution with parameter p ? 0, 1. p denotes
the probability of a success, whereas 1- p
represents the probability of a failure
28Example Bernoulli and Binomial Distribution
- Let us consider another discrete random variable
B, denoting the number of irises with long sepal
lengths in m independent Bernoulli trials with
probability of success p. - B takes on the discrete values 0,m, and its
probability mass function is given by the
Binomial distribution - For example, taking p 0.087 from above, the
probability of observing exactly k 2 long sepal
length irises in m 10 trials is given as
29full probability mass function for different
values of k
30Probability Density Function
- If X is continuous, its range is the entire set
of real numbers R. - probability density function specifies the
probability that the variable X takes on values
in any interval a, b ? R
31Cumulative Distribution Function
- For any random variable X, whether discrete or
continuous, we can define the cumulative
distribution function (CDF) F R ? 0, 1, that
gives the probability of observing a value at
most some given value x
32(No Transcript)
33(No Transcript)
34(No Transcript)
35(No Transcript)
36(No Transcript)
37(No Transcript)
38(No Transcript)
39(No Transcript)
40(No Transcript)
41(No Transcript)
42(No Transcript)
43(No Transcript)
44(No Transcript)
45Probability Density Function f(x)
- What is P(Xx) when x is on a real domain
-
- f(x) gt0 and
46Normal Distribution
- Let us assume that these values follow a Gaussian
or normal density function, given as
47(No Transcript)
48(No Transcript)
49(No Transcript)
50(No Transcript)
51(No Transcript)
52(No Transcript)
53Bivariate Random Variables
- considering a pair of attributes, X1 and X2, as a
bivariate random variable
54(No Transcript)
55(No Transcript)
56(No Transcript)
57In 2-Dimensions
58(No Transcript)
59(No Transcript)
60(No Transcript)
61(No Transcript)
62(No Transcript)
63(No Transcript)
64(No Transcript)
65(No Transcript)
66(No Transcript)
67(No Transcript)
68(No Transcript)
69(No Transcript)
70(No Transcript)
71(No Transcript)
72(No Transcript)
73(No Transcript)
74(No Transcript)
75(No Transcript)
76(No Transcript)
77(No Transcript)
78Multivariate Random Variable
79Multivariate Random Variable
80Numeric Attribute Analysis
- Sample and Statistics
- Univariate Analysis
- Bivariate Analysis
- Multivariate Analysis
- Normal Distribution
81Random Sample and Statistics
- Population is used to refer to the set or
universe of all entities under study. - However, looking at the entire population may not
be feasible, or may be too expensive. - Instead, we draw a random sample from the
population, and compute appropriate statistics
from the sample, that give estimates of the
corresponding population parameters of interest.
82Univariate Sample
- Let X be a random variable, and let xi (1 i
n) denote the observed values of attribute X in
the given data, where n is the data size. - Given a random variable X, a random sample of
size n from X is defined as a set of n
independent and identically distributed (IID)
random variables S1, S2, , Sn. - since the variables Si are all independent, their
joint probability function is given as
83Multivariate Sample
- xi the value of a d-dimensional vector random
variable Si (X1,X2, ,Xd ). - Si are independent and identically distributed,
and thus their joint distribution is given as - Assume d attributes X1,X2, ,Xd are
independent, (1.43) can be rewritten as
84Statistic
- Let Si denote the random variable corresponding
to data point xi , then a statistic ˆ? is a
function ˆ? (S1, S2, , Sn) ? R. - If we use the value of a statistic to estimate a
population parameter, this value is called a
point estimate of the parameter, and the
statistic is called as an estimator of the
parameter.
85(No Transcript)
86Numeric Attribute Analysis
- Sample and Statistics
- Univariate Analysis
- Bivariate Analysis
- Multivariate Analysis
- Normal Distribution
87Univariate Analysis
Univariate analysis focuses on a single attribute
at a time, thus the data matrix D can be thought
of as a n 1 matrix, or simply a column vector.
88Univariate Analysis
X is assumed to be a random variable, and each
point xi (1 i n) is assumed to be the value
of a random variable Si , where the variables Si
are all independent and identically distributed
as X, i.e., they constitute a random sample drawn
from X. In the vector view, we treat the sample
as an n-dimensional vector, and write X ? Rn.
89What can sample analysis do?
- Unknown f(X) and F(X)
- Parameters(µ,d)
90Empirical Cumulative Distribution Function
91Inverse Cumulative Distribution Function
92Empirical Probability Mass Function
93Measures of Central Tendency (Mean)
Sample Mean (Unbiased, not robust)
94Measures of Central Tendency (Median)
or
Sample Median
95Measures of Central Tendency (Mode)
- may not be very useful
- but not affected by the outliers too much
96Example
97Measures of Dispersion (Range)
Sample Range
- Not robust, sensitive to extreme values
98Measures of Dispersion (Inter-Quartile Range)
- Inter-Quartile Range (IQR)
Sample IQR
99Measures of Dispersion (Variance and Standard
Deviation)
Variance
Standard Deviation
100Measures of Dispersion (Variance and Standard
Deviation)
Variance
Standard Deviation
Sample Variance Standard Deviation
101Normalization
Linear Normalization
Z-Score
102Normalization Example
103Topics
- Sample and Statistics
- Univariate Analysis
- Bivariate Analysis
- Multivariate Analysis
- Normal Distribution
104Bivariate Analysis
Bivariate analysis focuses on Two attributes at a
time, thus the data matrix D can be thought of as
a n 2 matrix, or two column vectors.
105Empirical Joint Probability Mass Function
or
where
106Measures of Central Tendency (Mean)
Sample Mean
107Measures of Association (Covariance)
Covariance
Sample Covariance
108Measures of Association (Correlation)
Correlation
Sample Correlation
109Measures of Association (Correlation)
110Correlation Example
111Topics
- Sample and Statistic
- Univariate Analysis
- Bivariate Analysis
- Multivariate Analysis
- Normal Distribution
112Multivariate Analysis
Multivariate analysis focuses on multiple
attributes at a time, thus the data matrix D can
be thought of as a n d matrix, or d column
vectors.
113Measures of Central Tendency (Mean)
Sample Mean
114Measures of Association (Covariance Matrix)
115Measures of Association (Correlation)
Correlation
Sample Correlation
116Topics
- Sample and Statistic
- Univariate Analysis
- Bivariate Analysis
- Multivariate Analysis
- Normal Distribution
117Univariate Normal Distribution
118Multivariate Normal Distribution
119 120(No Transcript)
121(No Transcript)
122(No Transcript)
123(No Transcript)
124(No Transcript)
125(No Transcript)
126(No Transcript)
127(No Transcript)