QM1 Week 1 Descriptive Statistics

1 / 24

About This Presentation

Title:

QM1 Week 1 Descriptive Statistics

Description:

3.3 Kurtosis ... Measures of skewness and kurtosis of the normal distribution are equal to 0 and 3 ... of a Distribution (Mean, Variance, Skewness, and Kurtosis) ... –

Number of Views:120

Avg rating:3.0/5.0

Slides: 25

Provided by: sascha8

Category:

more less

Transcript and Presenter's Notes

Title: QM1 Week 1 Descriptive Statistics

1
QM1 Week 1 Descriptive Statistics

Dr Alexander Moradi
GPRG/CSAE Nuffield College
Dept. of Economics, University of Oxford
Email alexander.moradi_at_economics.ox.ac.uk

2
3. Descriptive Statistics

Def Statistics used to summarise/describe a set
of numbers
Absolute frequency Number of times a certain
value occurs
Relative frequency Ratio of the number of
observations in a statistical category (with a
certain value) to the total number of
observations
Cumulative frequency Number of observations
which are less than or equal to any specified
number
Histogram Visual representation of the number
or proportion of observations falling into each
of several categories or intervals

3
3. Frequencies
Example Frequencies of relief payments (in
intervals of 6 shillings)
4
3. Histogram
5
3. Quartiles, Deciles, and Percentiles

Values that devide the data into certain
fractions. Common divisions are
Quartiles divide the observations into four
equal quarters
1st quartile is the value that has 25 of the
observations below it
3rd quartile is the value that has 75 of the
observations below it
Deciles divide the data into 10 portions of
equal size
1st decile 10 of the observations have a lower
value than this
2nd decile 20 of the observations have a lower
value than this
Percentiles divide the data into 100 portions of
equal size
Quartiles are the 25th, 50th, 75th percentile
Deciles are the 10th, 20th, 30th, ..., 90th
percentile

6
3. Four Moments of a Distribution

Statistics to summarize the distribution of a
variable
Arithmetic mean
Variance
Skewness
Kurtosis

7
3.1 Measures of Central Tendency

1. Median Value that divides the higher half of
a sample from the lower half
Order the series in ascending array
Position(Number of observations1)/2
If position is even, then median is the value at
this position
If position is uneven, then median is the average
between the values at the adjacent positions
? The median is the 2nd quartile, the 5th decile,
and the 50th percentile
2. Mode Most frequent value
3. Arithmetic mean

In words Sum of all values divided by total
number of observations
8
3.1 Numeric Example

Random sample of whatsoever

What is the median, mode and mean?

5
5
6

What is the effect of adding case H?

5
6
9

What is the absolute, what the relative frequency
of value 5?

9
3.1 Weighted Average

The arithmetic mean gives the value of each case
the same weight
Sometimes it is unreasonable to give all
observations the same weight, e.g. data is
aggregated by regions that do not have a similar
size of population e.g. a data set consists of
compounds, villages, towns
We ascertain each observation a weight w

10
3.1 Exercise Descriptive Statistics

Data set 1699_RELIEF.dta
The file contains data of 311 parishes c.1831
that Boyer analysed in his study of the Old Poor
Law. A short introduction and variable
definitions can be found in FT, p. 496-498
Plot a histogram of the value of real property
(land and building) per head in 1815 (WEALTH)
Graphics/ Histogram
Alternatively Graphics/ Two-way graph
(Scatterplot, line, etc.)/ Plot type Histogram
Change the bin width of the histogram
Import the graph into a word processor
In STATA File/Save Graph as .png or .tif
In Word Insert/ Picture/ From File/
Try right-click the graph in STATA and Copy
Paste
Plot a bar graph of average wealth across English
counties
Graphics/Bar charts/ Summary Statistics/
Main Statistic mean, Variables wealth
Over groups Over1, Variable county

11
3.1 Exercise Descriptive Statistics

Assign value labels to the COUNTY variable
label define ccode 1 "Kent 2 Sussex 3 Essex
4 Suffolk
label values county ccode
Alternatively Use the data editor
What inferences can we draw about spatial
differences in wealth 1815?
Calculate percentiles, quartiles and the deciles
of WEALTH
Use the centile command
Calculate the absolute, relative and cumulative
frequency as well as median, mean and mode of
variable WEALTH
Statistics/Summaries, tables, tests/
Tables/Table of summary statistics (table)
Use the tabulate command
Use the mode command
(you need to download and install this command)

12
3.2 Measures of Dispersion Variance

Two variables with equal arithmetic mean, but
different spread

f(x)
f(y)
f(x)
f(y)
m
x,y

Variable x is more densely distributed around the
mean m than variable y
Variance

The variance is the arithmetic mean of the
squared deviations from the mean
13
3.2 Measures of Dispersion Standard Deviation

Standard deviation of variable x

f(x)
f(x)
sx
sx
m
x

Interpretation Average or typical deviation of
variable x from the arithmetic mean

14
3.2 Other Measures of Dispersion

Range Difference between minimum and maximum
Inter-quartile range Range of the central half
of observations/ distance between the first and
third quartile
Coefficient of variation Measure of relative
rather than absolute variation

15
3.3 Shape of the Distribution Skewness

Values need not be symmetrically distributed
around the central point distributions can be
skewed
Mean and standard deviation are insufficient to
describe the distribution

Frequency
This distribution is skewed to the right
(positively skewed)
Mode
Mean
x
Median
16
3.3 Kurtosis

Two variables with equal mean and standard
deviation, and symmetrically distributed, but a
different kurtosis

f(x)
f(y)
? Here, variable y has the larger kurtosis than
variable x
f(y)
sy
sx
f(x)
m
x,y
17
3.3 Skewness and Kurtosis

Measures of skewness and kurtosis of a
distribution

Skewness and kurtosis of a normal distributed
variable are zero and three, respectively
Skewness
a3 gt 0 distribution skewed to the right/
positively skewed
a3 lt 0 distribution skewed to the left/
negatively skewed
Kurtosis
a4 gt 3 thinner tails higher peak than a normal
distribution
a4 lt 3 thicker tails lower peak compared to a
normal distribution
For a meaningful and comparable measure of a4,
the distribution should be symmetrical

18
3.3 Consequences of a Skewed Distribution

Especially socio-economic data (wages, income,
wealth and related variables) is frequently
skewed
Skewed variables can lead to undesirable effects
in regressions
? Non-normal distributed residuals
(misspecification)
? Heteroscedasticity test statistics and
confidence intervals are biased
(Roughly) normal distributed variables help to
avoid these problems. Take a look at the variable
If the variable is not significantly skewed,
continue
If the variable is skewed, transform the
variable Ladder of Powers. For this reason you
often find the logarithm of income, the square
root of the mortality rate, etc.

19
3.4 Normal Distribution

The normal distribution is a symmetrical, smooth,
bell-shaped distribution that is fully described
by the arithmetic mean and standard deviation
Mode, median and mean are equal
Measures of skewness and kurtosis of the normal
distribution are equal to 0 and 3
Key role in inductive
statistics

20
3.4 Exercise The Four Moments of a Distribution
(Mean, Variance, Skewness, and Kurtosis)

Data set 1699_RELIEF.dta
Calculate the standard deviation, inter-quartile
range, skewness, and kurtosis of the WEALTH
variable
Statistics/Summaries, tables, tests/
Tables/Table of summary statistics (tabstat)
Is the distribution positively skewed? Has the
distribution thicker tails than a normal
distribution?
Plot a histogram of the variable WEALTH and add a
normal curve
Graphics/ Histogram
Density Plots/Add normal density plot
Does the visual test agree with the measures of
skewness and kurtosis?
Test whether the WEALTH variable can be
reasonably expected to be normal distributed
Use the command sktest

21
3. STATA commands
22
3. Homework Exercises

Read chapter 1 2 of FT
Replicate Figure 2.2, upper lower panel, in FT
(p. 38) using STATA Hints see next page
Calculate the weighted arithmetic mean of the
migration rate (CNTYMIG) in 1911 (Table 2.4, p.
45)
Do the following exercises from FT (p. 66-70)
2, 7, 8
Read The Economist, Sep 30th 2006, Soho Surprise
- What happened when drinking hours were
liberalized
Give a short outline of how data can be used to
analyse the effect of the relaxation of licensing
hours?
What are the difficulties?
Is the graph informative? Point to flaws in the
graph!
Try the UCLA online course at lthttp//www.ats.ucla
.edu/stat/stata/notes3/default.htmgt
!!!! Dont send me your log file !!!! Use a word
processor !!!! Include all answers in one file
!!!! Generate concise tables !!!! No more than 4
pages !!!!
DEADLINE 14 Oct, midnight

23
Hints Replicating the histograms in FT (p.38)

Use the relief data set - the 1699_RELIEF.dta
file
Open it in STATA
Choose from the menu bar Graphics/Histogram. A
window opens where you enter the instructions
Enter under Main/ Variable relief FT used
intervals, each 6 shillings wide
Tick "Width of bins" and enter 6
Tick "Lower limit of first bin" and enter 0
Click on Submit gt the shape of the histogram
should be exactly the same as in Figure 2.2,
upper panelThe rest is labelling the axes
appropriately
On the right hand side of the Main tab you see
Y-axis Tick Frequency (it gives you the absolute
frequency as units on the y-axis)
Click on the tab "Y-Axis" and enter as Title
"Number of parishes" (with or without
apostrophes)
Click on the tab "X-Axis" and enter as Title
"Relief payments (shillings)"
Tab "X-Axis" Under Ticks/ Lines. Enter in the
Custom field 3 "0-6" 9 "6-12" 15 "12-18" 21
"18-24" 27 "24-30" 33 "30-36" 39 "36-42"
45"42-48" The first number is where you want to
have a tick at the x-axis, i.e at 3 shillings
(the centre between 0 and 6) you want to have a
tick and below the tick a label "0-6", at 9
shillings you want to have another tick and below
the tick a label "6-12"... It is a very special
labelling of the bins, therefore "CustomTry
label the centres of the bins/bars. Try the
following
Delete what you entered under Custom. Enter in
the "Rule" field 3 (6) 45, i.e. you want to have
ticks starting from 3 to 45 in steps of 6. If you
dont like these labels, enter in the "Rule"
field 0 (6) 48 instead
For the lower panel, you need to change the width
of the bins, the lower limit and adjust the
labels, i.e. you have to change the entries of
steps 5, 6 and 11