Title: QM1 Week 1 Descriptive Statistics
1QM1 Week 1 Descriptive Statistics
- Dr Alexander Moradi
- GPRG/CSAE Nuffield College
- Dept. of Economics, University of Oxford
- Email alexander.moradi_at_economics.ox.ac.uk
23. Descriptive Statistics
- Def Statistics used to summarise/describe a set
of numbers - Absolute frequency Number of times a certain
value occurs - Relative frequency Ratio of the number of
observations in a statistical category (with a
certain value) to the total number of
observations - Cumulative frequency Number of observations
which are less than or equal to any specified
number - Histogram Visual representation of the number
or proportion of observations falling into each
of several categories or intervals
33. Frequencies
Example Frequencies of relief payments (in
intervals of 6 shillings)
43. Histogram
53. Quartiles, Deciles, and Percentiles
- Values that devide the data into certain
fractions. Common divisions are - Quartiles divide the observations into four
equal quarters - 1st quartile is the value that has 25 of the
observations below it - 3rd quartile is the value that has 75 of the
observations below it - Deciles divide the data into 10 portions of
equal size - 1st decile 10 of the observations have a lower
value than this - 2nd decile 20 of the observations have a lower
value than this - Percentiles divide the data into 100 portions of
equal size - Quartiles are the 25th, 50th, 75th percentile
- Deciles are the 10th, 20th, 30th, ..., 90th
percentile
63. Four Moments of a Distribution
- Statistics to summarize the distribution of a
variable - Arithmetic mean
- Variance
- Skewness
- Kurtosis
73.1 Measures of Central Tendency
- 1. Median Value that divides the higher half of
a sample from the lower half - Order the series in ascending array
- Position(Number of observations1)/2
- If position is even, then median is the value at
this position - If position is uneven, then median is the average
between the values at the adjacent positions - ? The median is the 2nd quartile, the 5th decile,
and the 50th percentile - 2. Mode Most frequent value
- 3. Arithmetic mean
In words Sum of all values divided by total
number of observations
83.1 Numeric Example
- Random sample of whatsoever
- What is the median, mode and mean?
5
5
6
- What is the effect of adding case H?
5
6
9
- What is the absolute, what the relative frequency
of value 5?
93.1 Weighted Average
- The arithmetic mean gives the value of each case
the same weight - Sometimes it is unreasonable to give all
observations the same weight, e.g. data is
aggregated by regions that do not have a similar
size of population e.g. a data set consists of
compounds, villages, towns - We ascertain each observation a weight w
103.1 Exercise Descriptive Statistics
- Data set 1699_RELIEF.dta
- The file contains data of 311 parishes c.1831
that Boyer analysed in his study of the Old Poor
Law. A short introduction and variable
definitions can be found in FT, p. 496-498 - Plot a histogram of the value of real property
(land and building) per head in 1815 (WEALTH) - Graphics/ Histogram
- Alternatively Graphics/ Two-way graph
(Scatterplot, line, etc.)/ Plot type Histogram - Change the bin width of the histogram
- Import the graph into a word processor
- In STATA File/Save Graph as .png or .tif
- In Word Insert/ Picture/ From File/
- Try right-click the graph in STATA and Copy
Paste - Plot a bar graph of average wealth across English
counties - Graphics/Bar charts/ Summary Statistics/
- Main Statistic mean, Variables wealth
- Over groups Over1, Variable county
113.1 Exercise Descriptive Statistics
- Assign value labels to the COUNTY variable
- label define ccode 1 "Kent 2 Sussex 3 Essex
4 Suffolk - label values county ccode
- Alternatively Use the data editor
- What inferences can we draw about spatial
differences in wealth 1815? - Calculate percentiles, quartiles and the deciles
of WEALTH - Use the centile command
- Calculate the absolute, relative and cumulative
frequency as well as median, mean and mode of
variable WEALTH - Statistics/Summaries, tables, tests/
Tables/Table of summary statistics (table) - Use the tabulate command
- Use the mode command
- (you need to download and install this command)
123.2 Measures of Dispersion Variance
- Two variables with equal arithmetic mean, but
different spread
f(x)
f(y)
f(x)
f(y)
m
x,y
- Variable x is more densely distributed around the
mean m than variable y - Variance
The variance is the arithmetic mean of the
squared deviations from the mean
133.2 Measures of Dispersion Standard Deviation
- Standard deviation of variable x
f(x)
f(x)
sx
sx
m
x
- Interpretation Average or typical deviation of
variable x from the arithmetic mean
143.2 Other Measures of Dispersion
- Range Difference between minimum and maximum
- Inter-quartile range Range of the central half
of observations/ distance between the first and
third quartile - Coefficient of variation Measure of relative
rather than absolute variation
153.3 Shape of the Distribution Skewness
- Values need not be symmetrically distributed
around the central point distributions can be
skewed - Mean and standard deviation are insufficient to
describe the distribution
Frequency
This distribution is skewed to the right
(positively skewed)
Mode
Mean
x
Median
163.3 Kurtosis
- Two variables with equal mean and standard
deviation, and symmetrically distributed, but a
different kurtosis
f(x)
f(y)
? Here, variable y has the larger kurtosis than
variable x
f(y)
sy
sx
f(x)
m
x,y
173.3 Skewness and Kurtosis
- Measures of skewness and kurtosis of a
distribution
- Skewness and kurtosis of a normal distributed
variable are zero and three, respectively - Skewness
- a3 gt 0 distribution skewed to the right/
positively skewed - a3 lt 0 distribution skewed to the left/
negatively skewed - Kurtosis
- a4 gt 3 thinner tails higher peak than a normal
distribution - a4 lt 3 thicker tails lower peak compared to a
normal distribution - For a meaningful and comparable measure of a4,
the distribution should be symmetrical
183.3 Consequences of a Skewed Distribution
- Especially socio-economic data (wages, income,
wealth and related variables) is frequently
skewed - Skewed variables can lead to undesirable effects
in regressions - ? Non-normal distributed residuals
(misspecification) - ? Heteroscedasticity test statistics and
confidence intervals are biased - (Roughly) normal distributed variables help to
avoid these problems. Take a look at the variable - If the variable is not significantly skewed,
continue - If the variable is skewed, transform the
variable Ladder of Powers. For this reason you
often find the logarithm of income, the square
root of the mortality rate, etc.
193.4 Normal Distribution
- The normal distribution is a symmetrical, smooth,
bell-shaped distribution that is fully described
by the arithmetic mean and standard deviation - Mode, median and mean are equal
- Measures of skewness and kurtosis of the normal
distribution are equal to 0 and 3 - Key role in inductive
- statistics
203.4 Exercise The Four Moments of a Distribution
(Mean, Variance, Skewness, and Kurtosis)
- Data set 1699_RELIEF.dta
- Calculate the standard deviation, inter-quartile
range, skewness, and kurtosis of the WEALTH
variable - Statistics/Summaries, tables, tests/
Tables/Table of summary statistics (tabstat) - Is the distribution positively skewed? Has the
distribution thicker tails than a normal
distribution? - Plot a histogram of the variable WEALTH and add a
normal curve - Graphics/ Histogram
- Density Plots/Add normal density plot
- Does the visual test agree with the measures of
skewness and kurtosis? - Test whether the WEALTH variable can be
reasonably expected to be normal distributed - Use the command sktest
213. STATA commands
223. Homework Exercises
- Read chapter 1 2 of FT
- Replicate Figure 2.2, upper lower panel, in FT
(p. 38) using STATA Hints see next page - Calculate the weighted arithmetic mean of the
migration rate (CNTYMIG) in 1911 (Table 2.4, p.
45) - Do the following exercises from FT (p. 66-70)
2, 7, 8 - Read The Economist, Sep 30th 2006, Soho Surprise
- What happened when drinking hours were
liberalized - Give a short outline of how data can be used to
analyse the effect of the relaxation of licensing
hours? - What are the difficulties?
- Is the graph informative? Point to flaws in the
graph! - Try the UCLA online course at lthttp//www.ats.ucla
.edu/stat/stata/notes3/default.htmgt - !!!! Dont send me your log file !!!! Use a word
processor !!!! Include all answers in one file
!!!! Generate concise tables !!!! No more than 4
pages !!!! - DEADLINE 14 Oct, midnight
23Hints Replicating the histograms in FT (p.38)
- Use the relief data set - the 1699_RELIEF.dta
file - Open it in STATA
- Choose from the menu bar Graphics/Histogram. A
window opens where you enter the instructions - Enter under Main/ Variable relief FT used
intervals, each 6 shillings wide - Tick "Width of bins" and enter 6
- Tick "Lower limit of first bin" and enter 0
- Click on Submit gt the shape of the histogram
should be exactly the same as in Figure 2.2,
upper panelThe rest is labelling the axes
appropriately - On the right hand side of the Main tab you see
Y-axis Tick Frequency (it gives you the absolute
frequency as units on the y-axis) - Click on the tab "Y-Axis" and enter as Title
"Number of parishes" (with or without
apostrophes) - Click on the tab "X-Axis" and enter as Title
"Relief payments (shillings)" - Tab "X-Axis" Under Ticks/ Lines. Enter in the
Custom field 3 "0-6" 9 "6-12" 15 "12-18" 21
"18-24" 27 "24-30" 33 "30-36" 39 "36-42"
45"42-48" The first number is where you want to
have a tick at the x-axis, i.e at 3 shillings
(the centre between 0 and 6) you want to have a
tick and below the tick a label "0-6", at 9
shillings you want to have another tick and below
the tick a label "6-12"... It is a very special
labelling of the bins, therefore "CustomTry
label the centres of the bins/bars. Try the
following - Delete what you entered under Custom. Enter in
the "Rule" field 3 (6) 45, i.e. you want to have
ticks starting from 3 to 45 in steps of 6. If you
dont like these labels, enter in the "Rule"
field 0 (6) 48 instead - For the lower panel, you need to change the width
of the bins, the lower limit and adjust the
labels, i.e. you have to change the entries of
steps 5, 6 and 11
24Appendix County codes in the Relief data set
(1699_RELIEF.dta)