Title: ARCH 21266126
1ARCH 2126/6126
- Session 3 Summarizing data visually
2To recap, the purpose of statistics..
- To provide insight into situations and problems
by means of numbers - How is this provided?
- Data are available or are collected
- Data are organized, summarized, analysed and
results presented - Conclusions are drawn, in context
- Whole process is often guided by critical
appraisal of similar work already done
3Data sets
- Usually data do not come singly they come in,
or are collected in, sets - We collect them because we want to test some idea
fairly against them - E.g. we might want to test whether the stone
artefacts from one site differ in size from stone
artefacts from another - For this, we measure artefact sizes
systematically consistently
4Some issues implied by this
- Definition of data-set what belongs in it and
what does not - Performance of each measurement accurate and
repeatable - Methods of summarizing and analysing patterns in
the data-set as whole-? visual ? numerical - Today mainly defining data-sets, measurement
visual summary
5What belongs in a data-set?
- We have considered it prudent to adopt the years
1919-1925, excluding the drought year of 1926, as
a fair standard for the future Queensland Land
Settlement Advisory Board, 1927 - The tacit assumption that drought is an
exceptional visitation to the inland country has
shaped and infected public thought and official
policy alike Francis Ratcliffe, 1937
6Making a measurement
- A variable is a measured property of a case
measuring assigns numbers representing each
cases value for that variable - Variables must be exactly defined measurements
reliably carried out - Some variables are relatively simple but still
need explicit specification, e.g. length - Some are more complex and/or depend on
non-obvious definitions, e.g. unemployment
7Measurement is never perfectly accurate but
- Our measurement of scraper length is valid, to
the extent that it measures what it is supposed
to measure - Our measurement is reliable, to the extent that
repetitions of the same measurement give the same
result - Our measurement is unbiased, to the extent that
it does not tend to under-state or over-state the
true value of the variable
8Recording a measurement
- Rare important observations deserve recording as
insight-giving anecdotes - But in many fields the bread butter of research
are common observations where the issue is
varying frequency - Importance of a recording system
- Unsystematic recording is likely to lead to
omissions or inconsistencies - Limits to the benefits of precision
9Recording technology
- Pen paper still have their place
- Complex technology has its traps its
vulnerability, your dependence - But early, direct or automatic data entry into
computers can bring big benefits in efficient
use of time labour error reduction
cross-checks - Importance of duplicates back-ups
10How much data to collect?
- Limits to the benefit from measuring variables to
many significant figures - Limits to the benefit from increasing sample size
indefinitely - Limits to the benefit from increasing number of
variables how many will you analyse? - Attention to limits can save lots of time
- Limits not fixed, but depend on the situation
under study the ideas under test
11Spreadsheets (e.g. Excel) databases (e.g.
Access)
- End point of data collection is often a matrix or
table a column for each variable, a row for each
case - Often convenient to enter these into a
spreadsheet or database (linkable, searchable) - These can store, check, transform, calculate,
apply conditions, select, test statistically,
output to statpack
12Study design experiment versus observation
- How do we define? Variously but element of
control often the key - For practical, ethical etc. reasons, experiments
rare in our subjects - But experimental design important
- Dependent variable response variable under study
- Independent variable explanatory variable or
factor
13Contexts and confounds
- Treatment a combination of specific conditions
(levels of experimental factors) - Extraneous variables ones not being studied but
which may influence dependent variable thus
part of relevant context - Effects of different (independent or extraneous)
variables are said to be confounded if they
cannot be distinguished - Good study design requires data on context
14Observational studies the risks of confounding
- Well designed experiments minimize confounding by
appropriate choice of variables, cases and
treatments random sequence of treatments
random allocation of cases to groups - Observational surveys lack this control
- Groups may be self-selected
- Differences in groups may have causes other than
the variables under study - But much can be done despite limitations
15Examples of presentation
- Even the simplest forms of stating findings
numerically (percentages, averages), and the
simplest graphical presentations, emphasize
selected aspects - This can be legitimate can also be misleading
much depends on honesty clarity with which
procedure is described - What as a percentage of what?
- Please bring in examples yourselves
16(No Transcript)
17Value of examining data visually first
- Even if you will eventually do sophisticated
statistical testing - Start clear and simple
- This familiarizes the researcher with the
characteristics of the data set - At the end of the process, it also helps the
researcher to communicate the patterns found
18(No Transcript)
19Graphical displays should
- Show the data
- Lead viewer to think about content, not graphic
technology itself - Avoid distorting data
- Present much info concisely coherently
- Encourage eye to compare
- Show both overview detail
- Serve clear purpose
- Be integrated with text and/or numerical
descriptions
20Graphical depictions include
- line graphs
- bar charts
- histograms
- pie charts
- stem--leaf plots
- scatterplots
- An important principle is data density
21Line graphs
- Usually used to plot a variable against time (on
horizontal axis) - Shows seasons trends
- Does the graph have linear scales? A zero?
- Different scales give different impressions, e.g
non-zero base to vertical axis, unequal units,
log scale
22(No Transcript)
23Bar charts and histograms
- Bar charts compare the values of different
variables, often categorical - Histograms display frequency or relative
frequency distributions of one variable at a time - Width of histogram bars has meaning
- Eyes respond to impressions of area symbols,
unequal widths, pseudo-3D can give a misleading
impression
24(No Transcript)
25(No Transcript)
26Pie charts
- Circular symbols divided into sections according
to divisions of a category into sub-categories - Pies sometimes also vary in size
- And sometimes are presented in pseudo-3D manner
- Will return to pie charts a bit later
27(No Transcript)
28Stem-and-leaf plots
- A simple way of showing the pattern in a set of
numbers - Truncate the numbers at an appropriate point,
write out the truncated numbers in an even
systematic way (forming stem) - For each number, add the amount left next to the
truncated one (forming leaf)
29(No Transcript)
30Scatterplots
- Show the distributions of two variables at once
(i.e. bivariate data) - If one variable is independent and one dependent,
independent goes on horizontal axis - Essence of any relationship between them is
apparent visually ve or -ve, strong or weak,
simple or complex - This can affect future statistical testing
31(No Transcript)
32Graphics can also mislead or even deliberately
deceive
- The principle should be show data variation, not
design variation - Perception depends on individual, experience,
context - Perception of circle area grows more slowly than
its actual physical area - Computer packages increase ease of making both
good bad images
33So
- Beware irregular scales
- Beware symbols with changing area or volume,
including pie charts - Beware pseudo-3-dimensional depictions
- Beware excessively busy images which distract
attention from the information to be conveyed
34Ed Tufte (1983) on pie charts
- The only worse design than a pie chart is
several of them, for then the viewer is asked to
compare quantities in spatial disarray both
within and between pies Given their low
data-density and failure to order numbers along a
visual dimension, pie charts should never be
used.
35(No Transcript)
36Lie factor (effect in graphic/effect in data)
14.8 (should not be less than 0.95 or greater
than 1.05)
37(No Transcript)
385 different vertical scales 2 different
horizontal scales Lie factor 15.1
39(No Transcript)
40Lie factor of 2.8 plus effects of perspective and
horizontal spacing
41The pie chart problem