Title: ARCH 21266126
1ARCH 2126/6126
2To recap, the purpose of statistics..
- To provide insight into situations and problems
by means of numbers - How is this provided?
- Data are available or are collected
- Data are organized, summarized, analysed and
results presented - Conclusions are drawn, in context
- Whole process is often guided by critical
appraisal of similar work already done
3From variation to variables
- Variables that can be analysed numerically are of
several different sorts - Categorical/qualitative/nominal variables
- Ranked/ordered/ordinal variables
- Numerical/quantitative/metric variables
- Different kinds of variables allow different
kinds of numerical analysis - This applies to the method of description or
measurement, not the basic property
4Data sets
- Usually data do not come singly they come in,
or are collected in, sets - We collect them because we want to test some idea
fairly against them - E.g. we might want to test whether the stone
artefacts from one site differ in size from stone
artefacts from another - For this, we measure artefact sizes
systematically consistently
5What belongs in a data-set?
- We have considered it prudent to adopt the years
1919-1925, excluding the drought year of 1926, as
a fair standard for the future Queensland Land
Settlement Advisory Board, 1927 - The tacit assumption that drought is an
exceptional visitation to the inland country has
shaped and infected public thought and official
policy alike Francis Ratcliffe, 1937
6Making a measurement
- A variable is a measured property of a case
measuring assigns numbers representing each
cases value for that variable - Variables must be exactly defined measurements
reliably carried out - Some variables are relatively simple but still
need explicit specification, e.g. length - Some are more complex and/or depend on
non-obvious definitions, e.g. unemployment
7Measurement is never perfectly accurate but
- Our measurement of scraper length is valid, to
the extent that it measures what it is supposed
to measure - Our measurement is reliable, to the extent that
repetitions of the same measurement give the same
result - Our measurement is unbiased, to the extent that
it does not tend to under-state or over-state the
true value of the variable
8Recording a measurement
- Rare important observations deserve recording as
insight-giving anecdotes - But in many fields the bread butter of research
are common observations where the issue is
varying frequency - Importance of a recording system
- Unsystematic recording is likely to lead to
omissions or inconsistencies - Limits to the benefits of precision
9Recording technology
- Pen paper still have their place
- Complex technology has its traps its
vulnerability, your dependence - But early, direct or automatic data entry into
computers can bring big benefits in efficient
use of time labour error reduction
cross-checks - Importance of duplicates back-ups
10How much data to collect?
- Limits to the benefit from measuring variables to
many significant figures - Limits to the benefit from increasing sample size
indefinitely - Limits to the benefit from increasing number of
variables how many will you analyse? - Attention to limits can save lots of time
- Limits not fixed, but depend on the situation
under study the ideas under test
11Spreadsheets (e.g. Excel) databases (e.g.
Access)
- End point of data collection is often a matrix or
table a column for each variable, a row for each
case - Often convenient to enter these into a
spreadsheet or database (linkable, searchable) - These can store, check, transform, calculate,
apply conditions, select, test statistically,
output to statpack
12Study design experiment versus observation
- How do we define? Variously but element of
control often the key - For practical, ethical etc. reasons, experiments
rare in our subjects - But experimental design important
- Dependent variable response variable under study
- Independent variable explanatory variable or
factor
13Contexts and confounds
- Treatment a combination of specific conditions
(levels of experimental factors) - Extraneous variables ones not being studied but
which may influence dependent variable thus
part of relevant context - Effects of different (independent or extraneous)
variables are said to be confounded if they
cannot be distinguished - Good study design requires data on context
14Observational studies the risks of confounding
- Well designed experiments minimize confounding by
appropriate choice of variables, cases and
treatments random sequence of treatments
random allocation of cases to groups - Observational surveys lack this control
- Groups may be self-selected
- Differences in groups may have causes other than
the variables under study - But much can be done despite limitations
15ARCH 2126/6126
16Examples of presentation
- Even the simplest forms of stating findings
percentages, averages and the simplest
graphical presentations emphasize selected
aspects - This can be legitimate can also be misleading
much depends on honesty clarity with which
procedure is described - What as a percentage of what?
- Does the graph have linear scales? A zero?
- Please bring in examples yourselves
17How can we see patterns inherent in our data-set?
- Start simple e.g. frequency tables
- Frequency or
- Relative frequency
- Value of mental arithmetic cross-checks do
figures make sense? - Note rounding errors
- Keep an eye on sample sizes do they change?
18Frequency absolute relative
- Frequency of any value of a variable is the
number of times that value is found i.e. it is a
count, a whole number - Relative frequency of any value is its frequency,
expressed as a proportion of all observations
(often a percent)
19Rates and ratios
- Ratio the size of a number relative to another
number - Proportion a ratio in which the second number
includes the first - Percentage a proportion multiplied by 100
- Rate a ratio of the number of events to the
number of cases at risk of experiencing that event
20Good to graph results but graphs can also mislead
- Graphical depictions include line graphs, bar
charts, histograms, pie charts, stem--leaf
plots, scatterplots - Line graphs usually used to plot variable against
time (on horizontal axis) show seasons trends - Different scales give different impressions, e.g
non-zero base to vertical axis, unequal units,
log scale
21Bar charts and histograms
- Bar charts compare the values of different
variables, often categorical - Histograms display frequency or relative
frequency distributions of one variable at a time - Width of histogram bars has meaning
- Eyes respond to impressions of area symbols,
unequal widths, pseudo-3D can give a misleading
impression
22Scatterplots
- Show the distributions of two variables at once
(i.e. bivariate data) - If one variable is independent and one dependent,
independent goes on horizontal axis - Essence of any relationship between them is
apparent visually ve or -ve, strong or weak,
simple or complex - This can affect future statistical testing
23Measures of central tendency
- The arithmetic mean (average) add all
observations together, divide the total by the
number of observations - (Also geometric harmonic mean)
- The median arrange all observations in order,
find the middle one or the mid-point of the
middle two - The mode find the commonest value
24Central tendencies and distributions
- In a normal distribution, graph is symmetrical
mean, median mode are similar - But distributions may be different, e.g may be
skewed to left or right - Mean is often convenient but is strongly affected
by outliers - To avoid this, can use median less affected by
outliers and skews
25Measures of central tendency are useful but
- If average income is above the poverty line, is
poverty abolished? - If the average child is at a weight//age which is
thought to indicate healthy growth, are all
growing healthily? - If first agriculturalists diffused into Europe at
an average rate of 1 km / year, does that imply
rate was constant? - Variation is ubiquitous sample is not fully
characterized by its mean/median
26So we also need measures of dispersion
(variability)
- Range maximum minimum outliers
- Percentiles median is 50th percentile we can
also find 25th 75th percentile (dividing sample
into quartiles) or 20th, 40th, 60th 80th
(quintiles) or 3rd 97th percentiles etc. - Interquartile range (75th percentile 25th) is
more stable than range
27Box plots
- A simple box--whisker plot consists of a box
(interquartile range) with a central line
(median) and a further line each side of the box
(to the extremes) - More elaborate versions represent outliers (gt1½ x
box length from box) by dots not joined to the
main whisker, far outliers (gt3 x box length)
28Dispersion around the mean
- Variance and standard deviation
- Standard deviation ? variance
- A little more complex to calculate but has some
very useful properties - Main component of variance is sum of squares,
i.e. subtract mean from each observation, square
result, add them up then divide SoS by sample
size - 1
29Means, standard deviations normal
distributions
- Normal curves are symmetric, bell-shaped, drop
off quickly, few outliers - Mean median mode
- There are many normal curves mean standard
deviation specify shape - In a normal curve, point where tails flatten
out is 1 SD from mean - SD/mean coefficient of variation
30Properties of normal curve
- 68 of observations fall within 1 SD of mean
- 95 fall within 2 SDs of mean
- 99.7 fall within 3 SDs of mean
- A transformation may help to make a distribution
approximately normal - A raw observation can be converted into a
standardized (z) score, to find the probability
of its occurrence, with mean 0 SD 1