Lecture 3: Exploratory Data Analysis - PowerPoint PPT Presentation

1 / 35
About This Presentation
Title:

Lecture 3: Exploratory Data Analysis

Description:

Kurtosis - peakedness. First step. Exploratory analysis to ... Kurtosis ( 2) = (xi- )4 / n. 4 Note: if sample, 'n' should be 'n -1' and ' ' is replaced by 'g' ... – PowerPoint PPT presentation

Number of Views:255
Avg rating:3.0/5.0
Slides: 36
Provided by: stweb
Category:

less

Transcript and Presenter's Notes

Title: Lecture 3: Exploratory Data Analysis


1
Lecture 3 Exploratory Data Analysis
  • Main topics
  • Normal distribution and probability
  • Non-normal distributions (other distributions)
  • Measure of non-normality skewness and kurtosis
  • Data transformation

2
Statistical Analysis
  • Normal distribution
  • parametric tests e.g. t-tests, LSD, ANOVA,
    regression etc.
  • Non-normal distributions (other distributions)
  • non-parametric tests e.g. Wilcoxons test,
    Kruskal-Wallis test, Mann-Whitey test

3
Normal distribution
  • Normal curve - bell shaped (Unimodal)
  • Symmetrical around the mean (skewness 0)
  • Mean, median and mode are the same (equal)
  • Variance is less than mean
  • The curve is neither too peaked nor too flat
    (kurtosis 3)

? 50 95 99
0.67?
1.96?
2.58?
-0.67?
-1.96?
-2.58?
Standard normal deviate standard distance from ?
4
Normal distribution probability
  • Probability - basics
  • 1. In a population of fish, there is equal chance
    of being sampled for male and female.
  • Probability for male p 0.5, and female q
    0.5
  • p q 1 (or 100)
  • 2. If a district has 20 rich families, 200 middle
    class and 80 poor families, what is the
    probability of sampling
  • Poor and rich families?
  • Poor or middle class families?

5
Normal distribution probability
  • If population mean (?) 350 g
  • standard deviation (SD) 15 g
  • What is the probabilities or chances of obtaining
    following measurements?
  • 360 g or bigger
  • 380 g or bigger
  • 500 g and higher
  • Lower than 340 g and higher than 360 g?

6
Normal distribution Probability
  • 360 g,
  • Z value (xi - ? ) / ?
  • (360-350)/15 0.67
  • Probability 0.2514 25.14 (from table
    next slide)
  • b) 380 g
  • Z value (380-350)/15 2.00
  • probability 0.0228, p 2.28
  • c) 500 g
  • Z value (500-350)/15 10.00
  • probability is lt 0.0001, p lt 0.01
  • Probability from the Table of normal curve

350 360
7
(No Transcript)
8
Non-normality
  • To know normality, you need to know non-normality
  • Two words for non-normality
  • Skewness asymmetry/pointedness
  • Kurtosis - peakedness

First step Exploratory analysis to find whether
data are normal or not!
9
Non-normality
  • Skewness asymmetry

10
Non-normality
2. Kurtosis peakedness
Platykurtic
Leptokurtic
11
Measure of non-normality
Skewness (?1) ?(xi-?)3 / n?3
Kurtosis (?2) ?(xi-?)4 / n?4
Note if sample, n should be n -1 and ? is
replaced by g
12
Test of normality very important!
  • Calculate the mean values and standard deviation,
    then see whether 95 of the data are within the
    range of Mean SD1.96
  • We consider 95 confidence level for agriculture
    or biological field (5 data can be due to random
    factors). In medical field, 99 is confidence
    level
  • 2. Draw a frequency diagram to see the shape of
    the graph and the nature of variance (or SD)
  • 3. Perform statistical tests e.g. Chi-square,
    Kolmogorov-Smirnov (K-S test) Lecture 7

13
Frequency distribution and probability
  • Frequency Count of repeated occurrence of a
    particular event or object
  • Discrete variables Family size (family no. 1-20
    respectively)
  • 5, 2, 3, 3, 4, 5, 3, 4, 4, 3, 4, 5, 2, 3, 2, 6,
    4, 4, 6, 5

14
Frequency distribution
  • Frequency repetition of the same value or event
  • Discrete variables
  • Bar graph

15
Frequency distribution
If a discrete variable also has large range (min
max) - grouping is necessary e.g.
No. of insects (Plant no. 1-100.. respectively)
- 25, 402, 203, 303, 204, 125, 38, 441, 200, 50,
112, 45, 200, 111, 0, 36, 14, 445, 60, 500, 1200,
300, 600, 20, 400, 30, 20, 22, 40, 300, 200,
1150, 300.
Rules Depends on purpose and nature of data. No.
of group or class should not be too many or too
few (maximum 20)
16
Frequency distribution
Find out the range first and work out for class
interval For example Min 0 Max 1200 If
class interval is 100 No. of classes will be
12 But if these figures represent number of
small animals e.g. chicken or duck that farmers
have in your site and if you need to present
these data in terms of farm size small, medium,
large and very large farms you may need to find
suitable class interval, it could be 300 or 400
17
Frequency distribution
  • Continuous variables unlike discrete variables,
    there is no repetition of the same value
  • Farm size (ha) (Family no. 1-20 respectively)
  • 5.4, 2.3, 3.5, 3.2, 4.5, 5.6, 3.2, 4.0, 4.4, 3.6,
    4.3, 5.2, 2.3, 3.5, 2.5, 6.3, 4.5, 4.2, 6.2, 5.3

18
Frequency distribution
  • Frequency Bar graph

19
Frequency Distribution
  • Grouping classes
  • Ungrouped distribution Observed values against
    frequency of observation
  • Grouped distribution grouping data into a number
    of classes
  • Class a data group
  • Class limits extreme boundaries (be careful not
    to overlap boundaries)
  • Lower limit the left hand number in a class
    limit
  • Upper limit the right hand number in a class
    limit
  • Open classes e.g.
  • 0-25, 25-50, 50-75, 75-100, 100 or higher (even
    300 will included in this class)
  • Class interval (Class width) the difference
    between the true or mathematical upper and lower
    class limit (or difference in stated limits1)

20
  • Example 1 upper limit excluded
  • 0 under 9 (from 0.999 right up to 8.999 but
    not including 9)
  • 10 under 20 (up to 19.9)
  • 20 under 30 (up to 29.9)
  • Example 2 upper limit included
  • 0 10 (-0.5 up to 10.4)
  • 10 20 (9.5 up to 20.4)
  • 20 - 30

21
  • Stated limits class limits shown in a Frequency
    Distribution Table
  • True or mathematical limits true boundary of a
    limit e.g. 0 10 (this is the age up to 10 years
    and 5 months)
  • Mathematical or implied limits should be taken
    into account when considering true limits
    (especially when handling discrete data).
  • If an integer is used, mathematical limits,
    usually extend a stated limit by 0.5 for both
    lower and upper limits.

22
  • Class mark Mid-point of the true or
    mathematical limits. If the class interval is
    odd, class mark is the middle figure or equal to
    (upper mark lower mark)/2.
  • Example
  • Stated limits 0 9
  • True limits -0.51 9.49
  • Class interval 10 (9.5 (-0.5)
  • Mid point 4.5 ( 0,1,2,3,4,5,6,7,8,9) (09)/2
    4.5

23
  • Histogram - Bar diagram for continuous data. A
    graph of frequency distribution with X axis
    extending from one class limit to the other and
    the observed frequency in the Y-axis. The area of
    a rectangle is proportional to the observed
    frequency in the class. If mathematical limits
    are used, they should be used in the X-axis so
    there is no gap. The vertical bars show the
    frequency density.
  • Frequency polygon - A line graph with mid-point
    (class mark) in the X axis and frequency in the Y
    axis. Extend the curve downward so that the line
    cuts the mid point of the class OUTSIDE of the
    distribution.

24
Bar graph for the number of females in groups
Discrete data
25
Histogram for the height of Females in Statistics
Class - continuous data
26
Frequency polygon/curve
Mid-points connected by lines If the frequency
polygon is smooth, it is called a frequency
curve.  
Frequency polygon for Female
Height  
27
Cumulative frequency curve (Ogive)
A graph with X axis denoting UPPER class limit
and the Y axis with cumulative frequency from 0
to highest number
28
Cumulative frequency and normality
Normal
Bimodal two populations
Right skewed/pointed
29
Cumulative frequency and normality
Note Using the nature of cumulative frequency
curve, data can be tested whether they are in
normal distrubution or not you will learn in
Lecture 7 Kolmogorov-Smirnov Test (KS-Test)
30
Variance heterogeneityVariance (s2) gt mean (µ)
means lack of normality
Homogenous variance
Homogenous variance
Heterogeneous variance
Heterogeneous variance
31
Most of the statistical tests e.g. ANOVA, t-test
are based on normal distribution, If data are not
normally distributed, What will you do?
1. Data transformation 2. Non-parametric tests
(to learn later)
32
  • Square root vx
  • If variance tends to be equal and proportional to
    the mean
  • Mostly count data
  • Percentage data with wide range e.g. lt 30 to gt70
  • Add 0.5 to the data if values are 0 or lt10 e.g.
    (v(x0.5)
  • 2. Log or Ln Log or Ln(x) or Ln/log(x1)
  • Add 1 to the data if values are lt 10
  • Standard deviation is proportional to the mean or
    effects are multiplicative (e.g. SGR)
  • Skewed to the right
  • Growth of organisms (SGR)
  • Whole numbers with wide range

33
  • 3. ArcSine or Angular transformation Asin(x)
  • Normally coupled with square root transformation
    i.e. Asin(vx)
  • 0 should be replaced by 1/4n and 100 by 100
    -1/4n
  • Notes
  • Percentage data ranging from 30-70 are not
    necessary to transform
  • Divide by 100 in case of percentage data before
    transformation

34
  • Transformation is done to bring the data to
    normality (reduce variation)
  • Once parametric test is carried out data have to
    be converted to original scale and then presented
    (but you should mention which transformation was
    done while analyzing the data)
  • 1. Square root Square it (vx)2
  • 2. Log Antilog (Logx) or
  • Power (base, logx)
  • 3. ArcSine or Angular transformation or Square
    root-ArcSine sin (vx)2

35
Practical session 3Frequency and data
transformationNext classMeasure of central
tendencyMean, median, mode etc.
Write a Comment
User Comments (0)
About PowerShow.com