Title: Lecture 3: Exploratory Data Analysis
1Lecture 3 Exploratory Data Analysis
- Main topics
- Normal distribution and probability
- Non-normal distributions (other distributions)
- Measure of non-normality skewness and kurtosis
- Data transformation
2Statistical Analysis
- Normal distribution
- parametric tests e.g. t-tests, LSD, ANOVA,
regression etc. - Non-normal distributions (other distributions)
- non-parametric tests e.g. Wilcoxons test,
Kruskal-Wallis test, Mann-Whitey test
3Normal distribution
- Normal curve - bell shaped (Unimodal)
- Symmetrical around the mean (skewness 0)
- Mean, median and mode are the same (equal)
- Variance is less than mean
- The curve is neither too peaked nor too flat
(kurtosis 3)
? 50 95 99
0.67?
1.96?
2.58?
-0.67?
-1.96?
-2.58?
Standard normal deviate standard distance from ?
4Normal distribution probability
- Probability - basics
- 1. In a population of fish, there is equal chance
of being sampled for male and female. - Probability for male p 0.5, and female q
0.5 - p q 1 (or 100)
- 2. If a district has 20 rich families, 200 middle
class and 80 poor families, what is the
probability of sampling - Poor and rich families?
- Poor or middle class families?
5Normal distribution probability
- If population mean (?) 350 g
- standard deviation (SD) 15 g
- What is the probabilities or chances of obtaining
following measurements? - 360 g or bigger
- 380 g or bigger
- 500 g and higher
- Lower than 340 g and higher than 360 g?
6Normal distribution Probability
- 360 g,
- Z value (xi - ? ) / ?
- (360-350)/15 0.67
- Probability 0.2514 25.14 (from table
next slide) - b) 380 g
- Z value (380-350)/15 2.00
- probability 0.0228, p 2.28
- c) 500 g
- Z value (500-350)/15 10.00
- probability is lt 0.0001, p lt 0.01
- Probability from the Table of normal curve
350 360
7(No Transcript)
8Non-normality
- To know normality, you need to know non-normality
- Two words for non-normality
- Skewness asymmetry/pointedness
- Kurtosis - peakedness
First step Exploratory analysis to find whether
data are normal or not!
9Non-normality
10Non-normality
2. Kurtosis peakedness
Platykurtic
Leptokurtic
11Measure of non-normality
Skewness (?1) ?(xi-?)3 / n?3
Kurtosis (?2) ?(xi-?)4 / n?4
Note if sample, n should be n -1 and ? is
replaced by g
12Test of normality very important!
- Calculate the mean values and standard deviation,
then see whether 95 of the data are within the
range of Mean SD1.96 - We consider 95 confidence level for agriculture
or biological field (5 data can be due to random
factors). In medical field, 99 is confidence
level - 2. Draw a frequency diagram to see the shape of
the graph and the nature of variance (or SD) - 3. Perform statistical tests e.g. Chi-square,
Kolmogorov-Smirnov (K-S test) Lecture 7
13Frequency distribution and probability
- Frequency Count of repeated occurrence of a
particular event or object - Discrete variables Family size (family no. 1-20
respectively) - 5, 2, 3, 3, 4, 5, 3, 4, 4, 3, 4, 5, 2, 3, 2, 6,
4, 4, 6, 5
14Frequency distribution
- Frequency repetition of the same value or event
- Discrete variables
- Bar graph
15Frequency distribution
If a discrete variable also has large range (min
max) - grouping is necessary e.g.
No. of insects (Plant no. 1-100.. respectively)
- 25, 402, 203, 303, 204, 125, 38, 441, 200, 50,
112, 45, 200, 111, 0, 36, 14, 445, 60, 500, 1200,
300, 600, 20, 400, 30, 20, 22, 40, 300, 200,
1150, 300.
Rules Depends on purpose and nature of data. No.
of group or class should not be too many or too
few (maximum 20)
16Frequency distribution
Find out the range first and work out for class
interval For example Min 0 Max 1200 If
class interval is 100 No. of classes will be
12 But if these figures represent number of
small animals e.g. chicken or duck that farmers
have in your site and if you need to present
these data in terms of farm size small, medium,
large and very large farms you may need to find
suitable class interval, it could be 300 or 400
17Frequency distribution
- Continuous variables unlike discrete variables,
there is no repetition of the same value - Farm size (ha) (Family no. 1-20 respectively)
- 5.4, 2.3, 3.5, 3.2, 4.5, 5.6, 3.2, 4.0, 4.4, 3.6,
4.3, 5.2, 2.3, 3.5, 2.5, 6.3, 4.5, 4.2, 6.2, 5.3
18Frequency distribution
19Frequency Distribution
- Grouping classes
- Ungrouped distribution Observed values against
frequency of observation - Grouped distribution grouping data into a number
of classes - Class a data group
- Class limits extreme boundaries (be careful not
to overlap boundaries)
- Lower limit the left hand number in a class
limit - Upper limit the right hand number in a class
limit - Open classes e.g.
- 0-25, 25-50, 50-75, 75-100, 100 or higher (even
300 will included in this class) - Class interval (Class width) the difference
between the true or mathematical upper and lower
class limit (or difference in stated limits1)
20- Example 1 upper limit excluded
- 0 under 9 (from 0.999 right up to 8.999 but
not including 9) - 10 under 20 (up to 19.9)
- 20 under 30 (up to 29.9)
- Example 2 upper limit included
- 0 10 (-0.5 up to 10.4)
- 10 20 (9.5 up to 20.4)
- 20 - 30
21- Stated limits class limits shown in a Frequency
Distribution Table - True or mathematical limits true boundary of a
limit e.g. 0 10 (this is the age up to 10 years
and 5 months) - Mathematical or implied limits should be taken
into account when considering true limits
(especially when handling discrete data). - If an integer is used, mathematical limits,
usually extend a stated limit by 0.5 for both
lower and upper limits.
22- Class mark Mid-point of the true or
mathematical limits. If the class interval is
odd, class mark is the middle figure or equal to
(upper mark lower mark)/2. - Example
- Stated limits 0 9
- True limits -0.51 9.49
- Class interval 10 (9.5 (-0.5)
- Mid point 4.5 ( 0,1,2,3,4,5,6,7,8,9) (09)/2
4.5
23- Histogram - Bar diagram for continuous data. A
graph of frequency distribution with X axis
extending from one class limit to the other and
the observed frequency in the Y-axis. The area of
a rectangle is proportional to the observed
frequency in the class. If mathematical limits
are used, they should be used in the X-axis so
there is no gap. The vertical bars show the
frequency density. - Frequency polygon - A line graph with mid-point
(class mark) in the X axis and frequency in the Y
axis. Extend the curve downward so that the line
cuts the mid point of the class OUTSIDE of the
distribution.
24Bar graph for the number of females in groups
Discrete data
25Histogram for the height of Females in Statistics
Class - continuous data
26Frequency polygon/curve
Mid-points connected by lines If the frequency
polygon is smooth, it is called a frequency
curve. Â
Frequency polygon for Female
Height Â
27Cumulative frequency curve (Ogive)
A graph with X axis denoting UPPER class limit
and the Y axis with cumulative frequency from 0
to highest number
28Cumulative frequency and normality
Normal
Bimodal two populations
Right skewed/pointed
29Cumulative frequency and normality
Note Using the nature of cumulative frequency
curve, data can be tested whether they are in
normal distrubution or not you will learn in
Lecture 7 Kolmogorov-Smirnov Test (KS-Test)
30Variance heterogeneityVariance (s2) gt mean (µ)
means lack of normality
Homogenous variance
Homogenous variance
Heterogeneous variance
Heterogeneous variance
31Most of the statistical tests e.g. ANOVA, t-test
are based on normal distribution, If data are not
normally distributed, What will you do?
1. Data transformation 2. Non-parametric tests
(to learn later)
32- Square root vx
- If variance tends to be equal and proportional to
the mean - Mostly count data
- Percentage data with wide range e.g. lt 30 to gt70
- Add 0.5 to the data if values are 0 or lt10 e.g.
(v(x0.5) - 2. Log or Ln Log or Ln(x) or Ln/log(x1)
- Add 1 to the data if values are lt 10
- Standard deviation is proportional to the mean or
effects are multiplicative (e.g. SGR) - Skewed to the right
- Growth of organisms (SGR)
- Whole numbers with wide range
33- 3. ArcSine or Angular transformation Asin(x)
- Normally coupled with square root transformation
i.e. Asin(vx) - 0 should be replaced by 1/4n and 100 by 100
-1/4n - Notes
- Percentage data ranging from 30-70 are not
necessary to transform - Divide by 100 in case of percentage data before
transformation
34- Transformation is done to bring the data to
normality (reduce variation) - Once parametric test is carried out data have to
be converted to original scale and then presented
(but you should mention which transformation was
done while analyzing the data) - 1. Square root Square it (vx)2
- 2. Log Antilog (Logx) or
- Power (base, logx)
- 3. ArcSine or Angular transformation or Square
root-ArcSine sin (vx)2
35Practical session 3Frequency and data
transformationNext classMeasure of central
tendencyMean, median, mode etc.