Title: Dealing with Data
1Dealing with Data
- Coding
- Descriptive Statistics
- Measures of Central Tendency
- Measures of Variability
2Dealing with data
- Analysis of quantitative data is a complex field
of knowledge - Analysis starts from coding and cleaning data
- Coding- reorganizing raw data into a format that
is machine readable (easy to analyze using
computers)
3Coding
- Can be simple clerical task when the data are
recorded as numbers on well-organized recording
sheets - Can be difficult when a researcher wants to code
answers to open-ended survey questions
4(No Transcript)
5(No Transcript)
6Open-ended questions
- Open-ended questions are questions that encourage
people to talk about whatever is important to
them. They are the opposite of closed-ended
questions that typically require a simple brief
response such yes or no. - Open-ended questions invite others to tell their
story in their own words.
7Closed-ended vs. Open-ended
- Did you have a good relationship with your
parents? (yes/no) - Tell me about your relationship with your parents.
8Codebook
- Set of rules stating that the certain numbers are
assigned to variable attributes - Codebook is a document describing the coding and
the location of data variables in a format that
computers can use - For example, a researchers codes males as 1 and
females as 2
9(No Transcript)
10(No Transcript)
11Computer file
12(No Transcript)
13The first thing to do
- Descriptive analysis
- Possible Outliers/Entry Errors/Missing cases
14(No Transcript)
15(No Transcript)
16Outliers
- A scatterplot can show any outliers in the data
set
17Outliers
- "Rare" event syndrome. Another reason for
outliers is the "rare" event syndrome--extreme
observations that for some legitimate reason do
not fit within the typical range of other data
values. Such unusual observations might include - a 70 degree day in January in Oregon
- a 500 point rise/drop in a stock market index
- an unusually high score on an aggressiveness
scale for a troubled child - All these events may be quite unusual, but
they're still part of the overall picture
18What Should You Do About Them?
- Effectively working with outliers in numerical
data can be a rather difficult and frustrating
experience - Neither ignoring nor deleting them at will are
good solutions - If you do nothing, you will end up with a model
that describes essentially none of the
data--neither the bulk of the data nor the
outliers
19What Should You Do About Them?
- Transformation
- Deletion
- Accommodation
20Transformation
- Transforming data is one way to soften the impact
of outliers since the most commonly used
expressions, square roots and logarithms, shrink
larger values to a much greater extent than they
shrink smaller values - However, transformations may not fit into the
theory of the model or they may affect its
interpretation. Taking the log of a variable does
more than make a distribution less skewed it
changes the relationship between the original
variable and the other variables in your model.
In addition, most commonly used transformations
require non-negative data or data that is greater
than zero, so they are not always the answer
21Deletion
- Only as a last resort should you delete outliers,
and then only if you find they are legitimate
errors that can't be corrected, or lie so far
outside the range of the remainder of the data
that they distort statistical inferences. When in
doubt, you can report model results both with and
without outliers to see how much they change
22Accommodation
- One very effective plan is to use methods that
are robust in the presence of outliers - Nonparametric statistical methods fit into this
category and should be more widely applied to
continuous or interval data
23Descriptive Statistics
- Describe numerical data
- Can be categorized by the number of the variables
involved - Univariate
- Bivariate
- Multivariate
24Univariate statistics (males and females)
25Males only
26Females only
27Using graphs
- A graph is a visual representation of a
relationship between, but not restricted to, two
variables - A graph generally takes the form of a
two-dimensional figure - Although, there are three-dimensional graphs
available, they are usually considered too
complex to understand
28What is a graph?
- A graph commonly consists of two axes called the
x-axis (horizontal) and y-axis (vertical) - Each axis corresponds to one variable.
- The axes are labeled with different names
- The place where the two axes intersect is called
the origin. The origin is also identified as the
point (0,0).
29Parts of a graph
30A good graph
- accurately shows the facts
- grabs the reader's attention
- has a title and labels
- is simple and uncluttered
- clearly shows any trends or differences in the
data - is visually accurate (i.e., if one chart value is
15 and another 30, then 30 should appear to be
twice the size of 15).
31Why use graphs when presenting data?
- Graphs
- are quick and direct
- highlight the most important facts
- facilitate understanding of the data
- can convince readers
- can be easily remembered
32When is it not appropriate to use a graph?
- The data are very dispersed Division of votes
for the major political parties, in a federal
election, Anytowne
33When is it not appropriate to use a graph?
- there are too few data (one, two or three data
points) Figure 12. Number of students enrolled
in Greenfield Secondary School
34When is it not appropriate to use a graph?
- the data are very numerous Figure 13. Number of
students taking English as a second language at
West High School,
35When is it not appropriate to use a graph?
- The data show little or no variations Figure
14. Number of young adults who exercise at least
once weekly, by age, 1996 to 2002
36Types of graphs
- Histograms
- Bar charts
- Pie charts
- Dot charts
- Line graphs
- Scatterplots
37Histogram
- A histogram is the graphical version of a table
which shows what proportion of cases fall into
each of several or many specified categories of
one variable
38(No Transcript)
39(No Transcript)
40Histographs
- A histograph, or frequency polygon, is a graph
formed by joining the midpoints of histogram
column tops - These graphs are used only when depicting data
from the continuous variables shown on a
histogram - A histograph smoothes out the abrupt changes that
may appear in a histogram, and is useful for
demonstrating continuity of the variable being
studied
41Distribution of salaries for the Acme Corporation
42Bar charts
- A bar chart is used to graphically summarize and
display the differences between groups of data
(or several variables)
43(No Transcript)
44(No Transcript)
45Disadvantage of vertical bar graph
- One disadvantage of vertical bar graphs, is that
they lack space for text labeling at the foot of
each bar - When category labels in the graph are too long,
you might find a horizontal bar graph better for
displaying information
46Horizontal bar graphs
- The horizontal bar graph uses the y-axis
(vertical line) for labeling - There is more room to fit text labels for
categorical variables on the y-axis.
47A double or group horizontal bar graph
- Similar to a double or group vertical bar graph,
and it would be used when the labels are too long
to fit on the x-axis
48Stacked bar graphs
- The stacked bar graph is a preliminary data
analysis tool used to show segments of totals - The stacked bar graph can be very difficult to
analyze if too many items are in each stack - It can contrast values, but not necessarily in
the simplest manner
49Example
- Triathlon, percentage of time spent on each
event, by competitor
50A split bar graph
- Is a better choice for displaying information
than a double pie chart - The key point in preparing this type of graph is
to ensure that you are using the same scale for
both sides of the bar graph
Earnings in Utopia, by sex
51Pie Charts
- A pie chart is a circle graph divided into
pieces, each displaying the size of some related
piece of information - Pie charts are used to display the sizes of parts
that make up some whole.
52(No Transcript)
53Example
- The pie chart below shows the ingredients used to
make a sausage and mushroom pizza. The fraction
of each ingredient by weight is shown in the pie
chart below - Note that the sum of the decimal sizes of each
slice is equal to 1 (the "whole" pizza")
54Area Chart
55Dot graphs
- The simplest ways to represent information
pictorially
56Line graphs
- Line graphs are more popular than all other
graphs combined because their visual
characteristics reveal data trends clearly and
these graphs are easy to create - Line graphs, especially useful in the fields of
statistics and science, are one of the most
common tools used to present data
57Line graphs
- A line graph shows how two variables are related
by drawing a continuous line between all the
points on a grid
58(No Transcript)
59Using correct scale
- When drawing a line, it is important that you use
the correct scale. Otherwise, the line's shape
can give readers the wrong impression about the
data
Number of guilty crime offenders, Grishamville
60Scatterplots
- In science, the scatterplot is widely used to
present measurements of two or more related
variables - It is particularly useful when the variables of
the y-axis are thought to be dependent upon the
values of the variable of the x-axis (usually an
independent variable).
61Scatterplots
- Car ownership in Anytowne, by household income
62Scattered data points
63Data widely spread
64Measures of Central Tendency
- Measure of the center of the frequency
distribution - Mean
- Median
- Mode
65Mean
- The mean of a list of numbers is also called the
average. It is found by adding all the numbers in
the list and dividing by the number of numbers in
the list. - Example Find the mean of 3, 6, 11, and 8.
- We add all the numbers, and divide by the number
of numbers in the list, which is 4. - (3 6 11 8) 4 7
- So the mean of these four numbers is 7.
66Mean
- Mean is strongly affected by change in extreme
values - 3, 6, 11, 8, and 50
- Mean 15.6
67Median
- Is the middle point
- It is also the 50th percentile, or the point at
which half the cases are above it and half below
it - The median of a list of numbers is found by
ordering them from least to greatest - If the list has an odd number of numbers, the
middle number in this ordering is the median - If there is an even number of numbers, the median
is the sum of the two middle numbers, divided by 2
68Median
- Example
- The students in Bjorn's class have the following
ages 29, 4, 3, 4, 11, 16, 14, 17, 3. Find the
median of their ages. Placed in order, the ages
are 3, 3, 4, 4, 11, 14, 16, 17, 29 - Median11
69Median
- The students in Bjorn's class have the following
ages 4, 29, 4, 3, 4, 11, 16, 14, 17, 3 - Find the median of their ages. Placed in order,
the ages are 3, 3, 4, 4, 4, 11, 14, 16, 17, 29 - The number of ages is 10, so the middle numbers
are 4 and 11, which are the 5th and 6th entries
on the ordered list. The median is the average of
these two numbers - (4 11)/2 15/2 7.5
70Mode
- The mode in a list of numbers is the number that
occurs most often, if there is one. - Example The students in Bjorn's class have the
following ages 5, 9, 1, 3, 4, 6, 6, 6, 7, 3 - Find the mode of their ages
- The most common number to appear on the list is
6, which appears three times. - The mode of their ages is 6.
71Measures of Variation
- Another characteristic of a distribution
- Spread, dispersion, or variability around the
center - Two distributions can have identical measure of
central tendency but differ in their spread about
the center
72Example
- Seven people are at the bus stop in front of a
bar - Their ages are 25 26 27 30 33 34 35
- Bothe median and mean are 30
- At a bus stop n front of an ice-cream store,
seven people have identical median and mean, but
their ages are 5 10 20 30 40 50 55 - The ages in the second group are spread more from
the center, or distribution of ages has more
variability
73Variability
- In city X, the median and mean family income is
25,000 and it has zero variation (every family
in this city has income exactly 25,000) - City Y has the same median and mean family
income, but 95 percent of its families have
income of 12, 000 per year and 5 percent have
incomes of 300,000 per year - City X has perfect income equality, while there
is great inequality in city Y.
74Measures of Variation
- Range
- Percentiles
- Standard Deviation
75Range
- It consists of the largest and smallest scores
- In our examples with people at the bus stop
- Range 1 35-2510
- Range2 55-540
76Percentiles
- Tells the score at a specific place within the
distribution - Median is the 50th percentile
- 25th and 75th percentiles are often used
- 25th percentile is the score at which 25 percent
of the distribution have either that score or a
lower one
77Standard Deviation (SD)
- It is based on the mean that gives an average
distance between all scores and the mean - People rarely compute SD by hand
78Results with two variables
- Bivariate relationship
- First step Seeing the relationship
- The Scattegram is graph with points plotted on a
coordinate plane - Correlation (association)
79Correlation
- A correlation is a single number that describes
the degree of relationship between two variables
80Example
81(No Transcript)
82Bivariate Tables
- Contingency table is formed by cross-tabulating
two or more variables - Constructing percentaged tables
- Usually computers do that
- We need to learn how to read them
- The row and column percentages let a researcher
address different questions
83(No Transcript)