Looking at real data - PowerPoint PPT Presentation

1 / 40
About This Presentation
Title:

Looking at real data

Description:

combining knowledge of the background of the data with the ability to use ... The first quantile Q1 is the median of the observations whose position in the ... – PowerPoint PPT presentation

Number of Views:51
Avg rating:3.0/5.0
Slides: 41
Provided by: AgaW3
Category:
Tags: data | looking | quantile | real

less

Transcript and Presenter's Notes

Title: Looking at real data


1
Looking at real data
  • Lecture Statistic of time series

2
Introduction
  • Statistics uses data to gain understanding.
    Understanding comes from
  • combining knowledge of the background of the data
    with the ability to use
  • graphs and calculations to see what the data tell
    us. The numbers in a medical
  • study, for example, mean little without some
    knowledge of the goals of the
  • study and what blood pressure, heart rate, and
    other measurements contribute
  • to those goals. On the other hand, measurements
    from the studys several
  • hundread subjects are of little value to even the
    most knowledgable medical
  • expert until the tools of statistics organize,
    display and summarize them.
  • We begin our study of statistics by mastering the
    art of examining data.

3
Individuals and Variables
  • Any set of data contains information about some
    group of individuals. The
  • information is organized in variables.
  • Individuals are the objects described by a set of
    data. Individuals may be
  • people, animals, things.
  • A variable is any characteristic of an
    individual. A variable can take different
  • values for different individuals.
  • Example A student data base can include about
    every currently enrolled
  • student. The students are the individuals
    desribed by the data set. For each
  • individual, the data contain the values of
    variables such as date of birth,
  • gender (male or female), choice of major, and
    grade point average.
  • In practice, any set of data is accompanied by
    background information that
  • helps us understand the data.

4
Individuals and Variables
5
Statistical study -the main questions
  • Why? What puropse do the data have? Do we hope
    the answer some specific questions? Do we want to
    draw conclusions about individuals other than
    those for whom we actually have data?
  • Who? What individuals do the data describe? How
    many individuals appear in the data?
  • What? How many variables do the data contain?What
    are the exact definitions of these variables? In
    what units of measurement is each variable
    recorded?Weights, for examples, might be recorded
    in pounds or in kilograms.

6
Categorical and Quantitative Variables
  • Some variables, like gender and college major,
    simply place individuals into
  • categories. Others, like height and grade point
    average, take numerical values
  • for which we can do arithmetic.
  • A cathegorical variable places an individual into
    several gropus of categories.
  • A quantitative variable takes numerical values
    for which arithmetic
  • operations such as adding and averaging make
    sense.
  • The distribution of a variable tells us what
    values it takes and how often it
  • takes these values.
  • Most statistical software (i.e. Microsoft Excel)
    use following format to enter
  • the data each row is an individual, each column
    is a variable.

7
Displaying Distributions with Graphs
  • Statistical tools and ideas help us examine data
    in order to describe their main
  • features. The examination is called exploratory
    data analysis. Like an explorer
  • crossing unknown lands, we want first to simply
    desribe what we see.
  • Two basic strategies that help us organize our
    exploration of a set of data
  • Begin by examining each variable by itself. The
    move on to study the relationships among the
    variables.
  • Begin with a graph or graphs. Then add numerical
    summaries of specific aspects of the data.

8
Graphs for cathegorical variables
  • The distribution of cathegorical variable
  • lists the cathegories and gives either
  • the count or the percent of individuals of
  • each cathegory.
  • Example Question How well educated
  • are 30-something young adults? Here
  • is the distribution of the highest level of
  • education for people aged 25 to 34?
  • excel1.xls

9
Graphs for cathegorical variablesBar graph
10
Graphs for cathegorical variablesPie chart of
education data
11
Examining distributions
  • Making a statistical graph is not an end itself.
    The purpose of the graph is to
  • help us understand the data. After you make a
    graph, always ask, What do I
  • see? Once you have displayed a distribution, you
    can see it important features
  • as follows
  • In any graph of data, look for the overall
    pattern and for striking devations
  • from that pattern.
  • You can describe the overall pattern of a
    distribution by its shape, center, and
  • spread.
  • An important kind of deviation is an outlier, an
    individual value that falls outside the
  • overall pattern.

12
Histogram
  • A histogram breaks the range of values of a
    variable into intervals and displays
  • only the count or percent of the observations
    that fall into each interval.
  • You can choose any convenient number of intervals
    but you should always
  • choose intervals of equal width. Histograms are
    slow to construct by hand and
  • do not display the actual values observed. The
    construction of the histogram is
  • described in the next example.

13
Histogram
14
Histogram
  • To make a histogram of given data (see table on
    previous slide) proced as
  • follows
  • Divide the range of the data into classes of
    equal width. Our data range from 0.6 to 38. Be
    sure to specify the classes precisely so that
    each individuals fall into exactly one class. A
    state with 5 Hispanic residents would fall into
    the first class, but 5,1 falls into the second
  • Count the number of individuals in each class.
    These counts are called frequencies and a table
    of frequencies for all classes is a frequency
    table.
  • Draw the histogram. First mark the scale for the
    variable whose distribution you are displaying on
    the horizontal axis. Thats the percent of adults
    who are Hispanic. The vertical axis contains the
    scale of counts. The base of the bar covers the
    class, and the bar height is the class count.
    There is no horizontal space between the bars
    unless a class is empty, so that its bar has
    height zero.

15
Histogram
  • Large sets of data are often reported in the form
    of frequency tables when it is
  • no practical to publish the individual
    observations. In addition to the frequency
  • (count) for each class, the fraction or percent
    of the observations that fall in
  • each class can be reported. These fractions are
    sometimes called relative
  • frequencies. Use histograms of relative
    frequencies for comparing several
  • distributions with different numbers of
    observations.

16
Histogram
17
Histogram
18
Time plot
  • A time plot of a variable plots each observation
    against the time at which it
  • was measured. Always put time on the horizontal
    scale of your plot and the
  • variable you are measuring on the vertical scale.
    Connecting the data points by
  • lines helps emphasize any change over time.
    Whenever data are collected over
  • time, it is good idea to plot the observations in
    time order. Summaries of the
  • distribution of a variable that ignore time
    order, such as histogram, can be
  • misleading when there is systematic charge over
    time.
  • Many interesting data sets are time series,
    measurements of a variable taken at
  • regular intervals over time. Goverment economic
    and social data are often
  • published as a time series.
  • Time plots can reveal the main features of a time
    series. We look first for
  • overall patterns and then for striking deviations
    from those patterns.

19
Time plot
20
Seasonal Variation and Trend
  • A pattern in a time series that repeats itself at
    known regular intervals of time is called
    seasonal variation.
  • A trend in a time series is a persistent,
    log-term rise or fall.

21
Trend
22
Seasonally adjusted
  • Because many economic time series show strong
    seasonal variation, before the
  • further analysis we often adjust for this
    variation before releasing such data.
  • The data are then said seasonally adjusted.
    Seasonal adjustement helps avoid
  • misinterpretational.

23
Seasonality
24
Seasonally adjusted
25
Describing Distributions with Numbers- measuring
center (the mean)
  • To find the mean of a set of observations, add
    their values and divide by the
  • number of observations. If the n observations are
  • their mean is

26
Measuring center- the median
  • The median M is the midpoint of a distribution,
    the number such that half the observations are
    smaller and the other half are larger. To find
    the median of observations
  • Arrange all observations in order of size, from
    smallest to largest.
  • If the number of observations n is odd, the
    median M is the center observation in the order
    list. Find the location of the median by counting
    (n1)/2 observations up from the bottom of the
    list.
  • If the number of observations n is even, then
    median M is the mean of observations in the order
    list. The location of the median is again (n1)/2
    from the bottom of the list.

27
Mean versus median
  • The median and mean are the most common measures
    of the center of a
  • distribution. The mean and median of a symmetric
    distribution are close
  • together. If the distribution is exactly
    symmetric, the mean and median are
  • exactly the same. In a skewed distribution, the
    mean is farther out in the long
  • tail than is the median.

28
Measuring spread the quartiles
  • To calculate the quantiles Q1 and Q3
  • Arrange the observations in increasing order and
    locate the median M in the ordered list of
    observations.
  • The first quantile Q1 is the median of the
    observations whose position in the order list is
    to the left of the location of the overall
    median.
  • The third quantile Q3 is the median of the
    observations whose position in the order list is
    to the right of the location of the overall
    median.

29
The five number summary
  • The five-number summary of a set of observations
    consists of the smallest
  • observation, the quartile, the median, the third
    quartile, and the largest
  • observations, written in order from smallest to
    largest. In symbols, the five-
  • number summary is
  • Minimum, Q1, M,
    Q3, Maximum

30
Boxplot
  • A boxplot is a graph of a five-number summary
  • A cental box spand the quartiles Q1 and Q3.
  • A line in the box marks the median.
  • Lines extend from the box out to the smallest and
    largest observations.

31
Boxplot
32
Measuring spread-the standard deviation
  • The variance of a set of observations is the
    average of the squares of the
  • deviations of the observations from their mean.
    In symbols, the variance of n
  • observations
  • is
  • The standard deviation s is the square root of
    the variance.

33
Properties of the standard deviations
  • Standard deviations measures spread about the
    mean and should be used only when the mean is
    chosen as the measure of center.
  • s0 only when there is no spred. This happends
    only when all observations have the same value.
    Othervise, sgt0. As the observations become more
    spread out about their mean, s gets larger.
  • s, like the mean, is not resistant. A few
    outliers can make s very large.

34
Changing the unit of measurement
  • The same variable can be recorded in different
    units of measurement.
  • American commonly record distances in miles and
    temperatures in degrees
  • Fahrenheit, while the rest of the world measures
    distances in kilometers and
  • temperatures in degrees Celsius. Fortunately, it
    is easy to convert numerical
  • desriptions of a distribution form one unit of
    measurement to another. This is
  • true because a change in the measurement unit is
    a linear transformation of the
  • measurements.

35
Linear transformation
  • A linear transformation changes the original
    variable x into the new variable
  • xnew given by an equation of the form
  • Adding the constant a shifts all values of x
    upward or downward by the same
  • amount. In particular, such a shift changes the
    origin (zero point) of the
  • variable. Multiplying by the positive constant b
    changes the size of the unit of
  • measurement.
  • Linear transformation do not change the shape of
    a distribution!!!!
  • Linear transformation change the parameters of
    distribution.

36
Effect of a linear transformation
  • To see the effect of linear transformation on
    measures of center and spreda,
  • apply these rules
  • Multiplying each observation by a positive number
    b multiplies both measure of center (mean and
    median) and standard deviation by b.
  • Adding the same number a (either positive or
    negative) to each observation adds a to measures
    of center and to quartiles but does not change
    the standard deviation.

37
The normal distributions
38
The standard normal distribution
  • The standard normal distribution is the normal
    distribution N(0,1) with mean 0
  • and standard deviation 1.
  • If a variable X has any normal distribution
    then, the standarized
  • variable
  • has the standard normal distribution.

39
Standard normal distribution
40
Non-standard normal distribution N(2,1)
Write a Comment
User Comments (0)
About PowerShow.com