Dealing with Data - PowerPoint PPT Presentation

1 / 83
About This Presentation
Title:

Dealing with Data

Description:

Dealing with Data Coding Descriptive Statistics Measures of Central Tendency Measures of Variability Dealing with data Analysis of quantitative data is a complex ... – PowerPoint PPT presentation

Number of Views:148
Avg rating:3.0/5.0
Slides: 84
Provided by: ArinaGe4
Category:

less

Transcript and Presenter's Notes

Title: Dealing with Data


1
Dealing with Data
  1. Coding
  2. Descriptive Statistics
  3. Measures of Central Tendency
  4. Measures of Variability

2
Dealing with data
  • Analysis of quantitative data is a complex field
    of knowledge
  • Analysis starts from coding and cleaning data
  • Coding- reorganizing raw data into a format that
    is machine readable (easy to analyze using
    computers)

3
Coding
  • Can be simple clerical task when the data are
    recorded as numbers on well-organized recording
    sheets
  • Can be difficult when a researcher wants to code
    answers to open-ended survey questions

4
(No Transcript)
5
(No Transcript)
6
Open-ended questions
  • Open-ended questions are questions that encourage
    people to talk about whatever is important to
    them. They are the opposite of closed-ended
    questions that typically require a simple brief
    response such yes or no.
  • Open-ended questions invite others to tell their
    story in their own words.

7
Closed-ended vs. Open-ended
  • Did you have a good relationship with your
    parents? (yes/no)
  • Tell me about your relationship with your parents.


8
Codebook
  • Set of rules stating that the certain numbers are
    assigned to variable attributes
  • Codebook is a document describing the coding and
    the location of data variables in a format that
    computers can use
  • For example, a researchers codes males as 1 and
    females as 2

9
(No Transcript)
10
(No Transcript)
11
Computer file
12
(No Transcript)
13
The first thing to do
  • Descriptive analysis
  • Possible Outliers/Entry Errors/Missing cases

14
(No Transcript)
15
(No Transcript)
16
Outliers
  • A scatterplot can show any outliers in the data
    set

17
Outliers
  • "Rare" event syndrome. Another reason for
    outliers is the "rare" event syndrome--extreme
    observations that for some legitimate reason do
    not fit within the typical range of other data
    values. Such unusual observations might include
  • a 70 degree day in January in Oregon
  • a 500 point rise/drop in a stock market index
  • an unusually high score on an aggressiveness
    scale for a troubled child
  • All these events may be quite unusual, but
    they're still part of the overall picture

18
What Should You Do About Them?
  • Effectively working with outliers in numerical
    data can be a rather difficult and frustrating
    experience
  • Neither ignoring nor deleting them at will are
    good solutions
  • If you do nothing, you will end up with a model
    that describes essentially none of the
    data--neither the bulk of the data nor the
    outliers

19
What Should You Do About Them?
  • Transformation
  • Deletion
  • Accommodation

20
Transformation
  • Transforming data is one way to soften the impact
    of outliers since the most commonly used
    expressions, square roots and logarithms, shrink
    larger values to a much greater extent than they
    shrink smaller values
  • However, transformations may not fit into the
    theory of the model or they may affect its
    interpretation. Taking the log of a variable does
    more than make a distribution less skewed it
    changes the relationship between the original
    variable and the other variables in your model.
    In addition, most commonly used transformations
    require non-negative data or data that is greater
    than zero, so they are not always the answer

21
Deletion
  • Only as a last resort should you delete outliers,
    and then only if you find they are legitimate
    errors that can't be corrected, or lie so far
    outside the range of the remainder of the data
    that they distort statistical inferences. When in
    doubt, you can report model results both with and
    without outliers to see how much they change

22
Accommodation
  • One very effective plan is to use methods that
    are robust in the presence of outliers
  • Nonparametric statistical methods fit into this
    category and should be more widely applied to
    continuous or interval data

23
Descriptive Statistics
  • Describe numerical data
  • Can be categorized by the number of the variables
    involved
  • Univariate
  • Bivariate
  • Multivariate

24
Univariate statistics (males and females)

25
Males only
26
Females only
27
Using graphs
  • A graph is a visual representation of a
    relationship between, but not restricted to, two
    variables
  • A graph generally takes the form of a
    two-dimensional figure
  • Although, there are three-dimensional graphs
    available, they are usually considered too
    complex to understand

28
What is a graph?
  • A graph commonly consists of two axes called the
    x-axis (horizontal) and y-axis (vertical)
  • Each axis corresponds to one variable.
  • The axes are labeled with different names
  • The place where the two axes intersect is called
    the origin. The origin is also identified as the
    point (0,0).

29
Parts of a graph
30
A good graph
  • accurately shows the facts
  • grabs the reader's attention
  • has a title and labels
  • is simple and uncluttered
  • clearly shows any trends or differences in the
    data
  • is visually accurate (i.e., if one chart value is
    15 and another 30, then 30 should appear to be
    twice the size of 15).

31
Why use graphs when presenting data?
  • Graphs
  • are quick and direct
  • highlight the most important facts
  • facilitate understanding of the data
  • can convince readers
  • can be easily remembered

32
When is it not appropriate to use a graph?
  • The data are very dispersed Division of votes
    for the major political parties, in a federal
    election, Anytowne

33
When is it not appropriate to use a graph?
  • there are too few data (one, two or three data
    points) Figure 12. Number of students enrolled
    in Greenfield Secondary School

34
When is it not appropriate to use a graph?
  • the data are very numerous Figure 13. Number of
    students taking English as a second language at
    West High School,

35
When is it not appropriate to use a graph?
  • The data show little or no variations Figure
    14. Number of young adults who exercise at least
    once weekly, by age, 1996 to 2002

36
Types of graphs
  • Histograms
  • Bar charts
  • Pie charts
  • Dot charts
  • Line graphs
  • Scatterplots

37
Histogram
  • A histogram is the graphical version of a table
    which shows what proportion of cases fall into
    each of several or many specified categories of
    one variable

38
(No Transcript)
39
(No Transcript)
40
Histographs
  • A histograph, or frequency polygon, is a graph
    formed by joining the midpoints of histogram
    column tops
  • These graphs are used only when depicting data
    from the continuous variables shown on a
    histogram
  • A histograph smoothes out the abrupt changes that
    may appear in a histogram, and is useful for
    demonstrating continuity of the variable being
    studied

41
Distribution of salaries for the Acme Corporation
42
Bar charts
  • A bar chart is used to graphically summarize and
    display the differences between groups of data
    (or several variables)

43
(No Transcript)
44
(No Transcript)
45
Disadvantage of vertical bar graph
  • One disadvantage of vertical bar graphs, is that
    they lack space for text labeling at the foot of
    each bar
  • When category labels in the graph are too long,
    you might find a horizontal bar graph better for
    displaying information

46
Horizontal bar graphs
  • The horizontal bar graph uses the y-axis
    (vertical line) for labeling
  • There is more room to fit text labels for
    categorical variables on the y-axis.

47
A double or group horizontal bar graph
  • Similar to a double or group vertical bar graph,
    and it would be used when the labels are too long
    to fit on the x-axis

48
Stacked bar graphs
  • The stacked bar graph is a preliminary data
    analysis tool used to show segments of totals
  • The stacked bar graph can be very difficult to
    analyze if too many items are in each stack
  • It can contrast values, but not necessarily in
    the simplest manner

49
Example
  • Triathlon, percentage of time spent on each
    event, by competitor

50
A split bar graph
  • Is a better choice for displaying information
    than a double pie chart
  • The key point in preparing this type of graph is
    to ensure that you are using the same scale for
    both sides of the bar graph

Earnings in Utopia, by sex
51
Pie Charts
  • A pie chart is a circle graph divided into
    pieces, each displaying the size of some related
    piece of information
  • Pie charts are used to display the sizes of parts
    that make up some whole.

52
(No Transcript)
53
Example
  • The pie chart below shows the ingredients used to
    make a sausage and mushroom pizza. The fraction
    of each ingredient by weight is shown in the pie
    chart below
  • Note that the sum of the decimal sizes of each
    slice is equal to 1 (the "whole" pizza")

54
Area Chart
55
Dot graphs
  • The simplest ways to represent information
    pictorially

56
Line graphs
  • Line graphs are more popular than all other
    graphs combined because their visual
    characteristics reveal data trends clearly and
    these graphs are easy to create
  • Line graphs, especially useful in the fields of
    statistics and science, are one of the most
    common tools used to present data

57
Line graphs
  • A line graph shows how two variables are related
    by drawing a continuous line between all the
    points on a grid

58
(No Transcript)
59
Using correct scale
  • When drawing a line, it is important that you use
    the correct scale. Otherwise, the line's shape
    can give readers the wrong impression about the
    data

Number of guilty crime offenders, Grishamville
60
Scatterplots
  • In science, the scatterplot is widely used to
    present measurements of two or more related
    variables
  • It is particularly useful when the variables of
    the y-axis are thought to be dependent upon the
    values of the variable of the x-axis (usually an
    independent variable).

61
Scatterplots
  • Car ownership in Anytowne, by household income

62
Scattered data points
63
Data widely spread
64
Measures of Central Tendency
  • Measure of the center of the frequency
    distribution
  • Mean
  • Median
  • Mode

65
Mean
  • The mean of a list of numbers is also called the
    average. It is found by adding all the numbers in
    the list and dividing by the number of numbers in
    the list.
  • Example Find the mean of 3, 6, 11, and 8.
  • We add all the numbers, and divide by the number
    of numbers in the list, which is 4.
  • (3  6  11  8)  4  7
  • So the mean of these four numbers is 7.

66
Mean
  • Mean is strongly affected by change in extreme
    values
  • 3, 6, 11, 8, and 50
  • Mean 15.6

67
Median
  • Is the middle point
  • It is also the 50th percentile, or the point at
    which half the cases are above it and half below
    it
  • The median of a list of numbers is found by
    ordering them from least to greatest
  • If the list has an odd number of numbers, the
    middle number in this ordering is the median
  • If there is an even number of numbers, the median
    is the sum of the two middle numbers, divided by 2

68
Median
  • Example
  • The students in Bjorn's class have the following
    ages 29, 4, 3, 4, 11, 16, 14, 17, 3. Find the
    median of their ages. Placed in order, the ages
    are 3, 3, 4, 4, 11, 14, 16, 17, 29
  • Median11

69
Median
  • The students in Bjorn's class have the following
    ages 4, 29, 4, 3, 4, 11, 16, 14, 17, 3
  • Find the median of their ages. Placed in order,
    the ages are 3, 3, 4, 4, 4, 11, 14, 16, 17, 29
  • The number of ages is 10, so the middle numbers
    are 4 and 11, which are the 5th and 6th entries
    on the ordered list. The median is the average of
    these two numbers
  • (4  11)/2  15/2  7.5

70
Mode
  • The mode in a list of numbers is the number that
    occurs most often, if there is one.
  • Example The students in Bjorn's class have the
    following ages 5, 9, 1, 3, 4, 6, 6, 6, 7, 3
  • Find the mode of their ages
  • The most common number to appear on the list is
    6, which appears three times.
  • The mode of their ages is 6.

71
Measures of Variation
  • Another characteristic of a distribution
  • Spread, dispersion, or variability around the
    center
  • Two distributions can have identical measure of
    central tendency but differ in their spread about
    the center

72
Example
  • Seven people are at the bus stop in front of a
    bar
  • Their ages are 25 26 27 30 33 34 35
  • Bothe median and mean are 30
  • At a bus stop n front of an ice-cream store,
    seven people have identical median and mean, but
    their ages are 5 10 20 30 40 50 55
  • The ages in the second group are spread more from
    the center, or distribution of ages has more
    variability

73
Variability
  • In city X, the median and mean family income is
    25,000 and it has zero variation (every family
    in this city has income exactly 25,000)
  • City Y has the same median and mean family
    income, but 95 percent of its families have
    income of 12, 000 per year and 5 percent have
    incomes of 300,000 per year
  • City X has perfect income equality, while there
    is great inequality in city Y.

74
Measures of Variation
  • Range
  • Percentiles
  • Standard Deviation

75
Range
  • It consists of the largest and smallest scores
  • In our examples with people at the bus stop
  • Range 1 35-2510
  • Range2 55-540

76
Percentiles
  • Tells the score at a specific place within the
    distribution
  • Median is the 50th percentile
  • 25th and 75th percentiles are often used
  • 25th percentile is the score at which 25 percent
    of the distribution have either that score or a
    lower one

77
Standard Deviation (SD)
  • It is based on the mean that gives an average
    distance between all scores and the mean
  • People rarely compute SD by hand

78
Results with two variables
  • Bivariate relationship
  • First step Seeing the relationship
  • The Scattegram is graph with points plotted on a
    coordinate plane
  • Correlation (association)

79
Correlation
  • A correlation is a single number that describes
    the degree of relationship between two variables

80
Example
81
(No Transcript)
82
Bivariate Tables
  • Contingency table is formed by cross-tabulating
    two or more variables
  • Constructing percentaged tables
  • Usually computers do that
  • We need to learn how to read them
  • The row and column percentages let a researcher
    address different questions

83
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com