Statistical Review - PowerPoint PPT Presentation

About This Presentation
Title:

Statistical Review

Description:

Otis would like each rail to have a diameter of 1 inch. The company has obtained samples of ten elevator rails from each supplier. ... – PowerPoint PPT presentation

Number of Views:61
Avg rating:3.0/5.0
Slides: 139
Provided by: christoph182
Category:

less

Transcript and Presenter's Notes

Title: Statistical Review


1
Statistical Review
  • Measures of Central Location

2
SALARY.XLS
  • Lists starting salaries for 190 graduates from an
    undergraduate school of business.
  • The data is in the range named Salary on a sheet
    called Data.

3
The Mean
  • We calculate the mean salary by entering the
    formula AVERAGE(Salary) in cell B6 of the
    Excel Functions worksheet.
  • The mean salary is 29,762.
  • The mean in this example is a representative
    measure because the distribution of salaries is
    nearly symmetric.
  • The mean can be misleading due to skewness.

4
The Median
  • The median is the middle observation when the
    data are listed from smallest to largest.
  • If there is an odd number of observations, the
    median is the middle observation.
  • If there is an even number of observations, we
    take the median to be the average of the two
    middle observations.

5
The Median -- continued
  • We calculate the median salary in Example 3.1 by
    entering the formula MEDIAN(Salary) in cell B7
    of the Excel Functions worksheet.
  • The median in this example is 29,850.
  • In this case, the mean and the median values are
    nearly the same because the distribution is
    approximately symmetric.

6
The Median -- continued
  • If the salary distribution were skewed (for
    example, a few graduates received abnormally
    large salaries), the mean would be biased upward
    while the median would not be affected by the
    unusual values.
  • Thus, it is better to use the median in
    characterizing the center of a distribution when
    that distribution is skewed.

7
The Mode
  • The mode is the most frequently occurring value.
  • If the values are essentially continuous, as with
    the salaries in Example 3.1, then the mode is
    essentially irrelevant. There is typically no
    single value that occurs more than once.
  • Thus, the mode is not likely to provide much
    information.

8
  • The Mode

9
SHOES.XLS
  • This file lists shoe sizes purchased at a shoe
    store.
  • We seek to find the best-selling shoe size at
    this store.

10
The Mode
  • The mode is the most frequently occurring value.
  • If the values are essentially continuous, then
    the mode is usually not relevant. There is
    typically no single value that occurs more than
    once.

11
Why is the mode relevant here?
  • Shoe sizes come in discrete increments, rather
    than a continuum so it makes sense to find the
    mode, the size that is requested most often, in
    this example.

12
Computing the Mode in EXCEL
  • The mode can be found in Excel by entering the
    formula MODE(Range of Data).
  • Applying this formula in the present example, we
    find that size 11 is the most frequently
    purchased shoe size.
  • This is also apparent form the histogram on the
    next slide.

13
Distribution of Shoe Sizes
14
  • Measures of Variability Variance and Standard
    Deviation

15
OTIS4.XLS
  • Suppose that Otis Elevator is going to stop
    manufacturing elevator rails. Instead, it is
    going to buy them from an outside supplier.
  • Otis would like each rail to have a diameter of 1
    inch.
  • The company has obtained samples of ten elevator
    rails from each supplier. They are listed in
    columns A and B of this Excel file.

16
Which should Otis prefer?
  • Observe that the mean, median, and mode are all
    exactly 1 inch for each of the two suppliers.
  • Based on these measures, the two suppliers are
    equally good and right on the mark. However, we
    when we consider measures of variability,
    supplier 1 is somewhat better than supplier 2.
    Why?

17
Explanation
  • The reason is that supplier 2s rails exhibit
    more variability about the mean than do supplier
    1s rails.
  • If we want rails to have a diameter of 1 inch,
    then more variability around the mean is very
    undesirable!

18
Variance
  • The most commonly used measures of variability
    are the variance and standard deviation.
  • The variance is essentially the average of the
    squared deviations from the mean.
  • We say essentially because there are two
    versions of the variance the population variance
    and the sample variance.

19
More on the Variance
  • The variance tends to increase when there is more
    variability around the mean.
  • Indeed, large deviations from the mean contribute
    heavily to the variance because they are squared.
  • One consequence of this is that the variance is
    expressed in squared units (squared dollars, for
    example) rather than original units.

20
Standard Deviation
  • A more intuitive measure of variability is the
    standard deviation.
  • The standard deviation is defined to be the
    square root of the variance.
  • Thus, the standard deviation is measured in
    original units, such as dollars, and it is much
    easier to interpret.

21
Computing Variance and Standard Deviation in Excel
  • Excel has built-in functions for computing these
    measures of variability.
  • The sample variances and standard deviations of
    the rail diameters from the suppliers in the
    present example can be found by entering the
    following formulas VAR(Supplier1) in cell E8
    and STDEV(Supplier1) in cell E9.

22
Computing Variances Standard Deviations --
continued
  • Of course, enter similar formulas for supplier 2
    in cells F8 and F9.
  • As we mentioned earlier, it is difficult to
    interpret the variances numerically because they
    are expressed in squared inches, not inches.
  • All we can say is that the variance from supplier
    2 is considerably larger than the variance from
    supplier 1.

23
Interpretation of the Standard Deviation
  • The standard deviations, on the other hand, are
    expressed in inches. The standard deviation for
    supplier 1 is approximately 0.012 inch, and
    supplier 2s standard deviation is approximately
    three times this large.
  • This is quite a disparity. Hence, Otis will
    prefer to buy rails from supplier 1.

24
  • Interpretation of the Standard Deviation Rules
    of Thumb

25
DOW.XLS
  • This file contains monthly closing prices for the
    Dow Jones Index from January 1947 through January
    1993.
  • The monthly returns from the index are also shown
    starting with February 1947. Each return is the
    monthly percentage change (expressed) as a
    decimal) in the index.
  • How well do the rules of thumb work for these
    data?

26
Rules of Thumb
  • Many data sets follow rules of thumb.
  • Approximately 68 of the observations are within
    one standard deviation of the mean.
  • Approximately 95 of the observations are within
    two standard deviations of the mean.
  • Approximately 99.7 - almost all - of the
    observations are within three standard deviations
    of the mean.

27
Index Time Series Plot
  • A time series plot of the index show that the
    index has been increasingly fairly steadily over
    the period.
  • Whenever a series indicates a clear trend such as
    the index does, most of the measures we have been
    discussing are less relevant.
  • For example, the mean of the index for this
    period has at most historical interest. We are
    probably more interested in predicting the future
    of the Dow, and the historical mean has little
    relevance for predicting the future.

28
Time Series Plot of Dow Closing Index
29
Time Series Plot of Dow Returns
30
Return Time Series Plot
  • A time series plot of the returns show no obvious
    trend over the period.
  • The measures we have been discussing are relevant
    in discussing the series of returns, which
    fluctuate around a stable mean.
  • We first calculate the mean and standard
    deviation of the returns by using the Excel
    functions AVERAGE and STDEV in cells B4 and B5.
    See the table on the next slide.

31
Rules of Thumb for Dow Jones Data
32
Returns -- continued
  • The average return is 0.59 and the standard
    deviation of about 3.37.
  • Therefore, the rules of thumb (if they apply)
    imply, for example, that about 2/3 of all returns
    are within the interval 0.59 3.37, that is
    from -2.78 to 3.95.
  • In order to determine if the rules of thumb apply
    to these returns, we can use a frequency table.

33
Creating the Frequency Table
  • We first enter the upper limits of the suitable
    categories in the range A8A15.
  • Any categories can be chosen but it is convenient
    to choose categories in which each breakpoint is
    one standard deviation higher than the previous
    one with the open-ended categories on either end
    are more than 3 standard deviations from the
    mean.
  • Next we use the FREQUENCY function to fill in
    column C. FREQUENCY(Returns,Bins)

34
Frequency Table continued
  • Finally, we use the frequencies in column C to
    calculate the actual percentage of return within
    k standard deviations of the mean for k1, k2
    and k3 and we compare these with the percentages
    from the rules of thumb.
  • The agreement between these percentages is not
    perfect - there are a few more observations
    within one standard deviation of the mean than
    the rule of thumb predicts - but in general the
    rules of thumb work quite well.

35
  • Obtaining Summary Measures with Add-Ins

36
SALARY.XLS
  • Lists starting salaries for 190 graduates from an
    undergraduate school of business.
  • The data is in the range named Salary on a sheet
    called Data.
  • We need to find a set of useful summary measures
    for the salaries.

37
Summary Statistics
  • To find the summary statistics of a set of data
    we can use the Stat-Pro Add-In or Excels
    Analysis ToolPak. In this example we use the
    Stat-Pro Add-In .
  • Begin by placing the cursor anywhere within the
    data range. Then select StatPro/Summary
    Stats/One-Variable Summary Stats menu item.
  • Select all variables you want to summarize, and
    select the summary measures you want to find from
    the Available Summary Measures dialog box shown
    on the next slide.

38
Available Summary Measures
39
About the Measures
  • Four measures are selected by default. These are
    mean, median, Standard Deviation and Count. You
    can override these.
  • A typical output appears here.
  • It includes many of the measures we have
    discussed plus a few more.

40
About the Measures -- continued
  • The mean absolute deviation is similar to the
    variance except that it is an average of the
    absolute (note squared) deviations from the mean.
  • The kurtosis and skewness indicate the relative
    peakedness of the distribution and its skewness.
  • By clicking on any of the cells containing the
    measures (Column B), youll see that StatPro
    provides the formulas for the outputs. (Analysis
    ToolPak does not do so.)

41
About the Measures -- continued
  • The effect of this is that if any of the data
    changes the summary measures we produced change
    automatically.
  • All output is formatted as numerical to three
    decimal places by default. You can reformat them
    in a more appropriate manner if you would like.

42
  • Measures of Association Covariance and
    Correlation

43
EXPENSES.XLS
  • A survey questions members of 100 households
    about their spending habits.
  • The data in this file represent the salary,
    expense for cultural activities, expense for
    sports-related activities, and the expense for
    dining-out for each household over the past year.
  • Do these variables appear to be related linearly?

44
Covariance and Correlation
  • When we need to summarize the relationship
    between two variables we can use the measures
    covariance and correlation. We summarize the type
    of behavior observed in a scatterplot.
  • Each measures the strength (and direction) of a
    linear relationship between two numerical
    variables.
  • The relationship is strong if the points in a
    scatterplot cluster tightly around some straight
    line. If this line rises form left to right then
    the relationship is positive. If it falls from
    left to right then the relationship is negative.

45
Determining Linear Relationships
  • Scatterplots of each variable versus each other
    would provide the answer to the question but six
    scatterplots would be required, one for each
    pair.
  • To get a quick indication of possible linear
    relationships we can use Stat-Proto obtain a
    table of correlations and/or covariances.

46
Table of Correlations and Covariances
  • To get the table, place the cursor anywhere in
    the data set and use the StatPro/Summary
    Stats/Correlations, Covariances menu item and
    proceed in the obvious way.

47
Relationships
  • The only relationships that stand out are the
    positive relationships between salary and
    cultural expenses and between salary and dining
    expenses.
  • The negative relationships are between cultural
    and sports-related expenses.
  • To confirm these graphically we show scatterplots
    of Salary versus Culture and Culture versus
    Sports

48
Scatterplot Indicating Positive Relationship
49
Scatterplot Indicating Negative Relationship
50
Correlation and Covariance Properties
  • In general, the following properties are evident
    from the Table of correlations and covariances.
  • The correlation between a variable and itself is
    1.
  • The correlation between X and Y is the same as
    the correlation between Y and X. Therefore, it is
    sufficient to list the correlations below (or
    above) the diagonal in the table. (The same is
    true for the covariances).
  • The covariance between a variable and itself is
    the variance of the variable. We indicate this in
    the heading of the covariance table.

51
Correlation and Covariance Properties -- continued
  • It is difficult to interpret the magnitudes of
    covariances. These depend on the fact that the
    data are measured in dollars rather than, say,
    thousands of dollars. It is such easier to
    interpret the magnitudes of the correlations
    because they are scaled to be between -1 and 1.

52
  • Describing Data Sets with Boxplots

53
DOW.XLS
  • This file lists the monthly returns on the Dow
    from February 1947 through January 1993.
  • Use a boxplot to summarize the distribution of
    these returns.

54
Boxplots
  • A boxplot is a very useful graphical method for
    summarizing data.
  • Boxplots can be used in two ways either to
    describe a single variable in a data set or to
    compare two (or more) variables.
  • Excel has no boxplot option, but we included this
    option in the StatPro add-in.

55
Creating Boxplots
  • Place the cursor anywhere within the data set,
    use the StatPro/Charts/BoxPlots(s) menu item and
    proceed in the obvious way.
  • Eventually two sheets will be added to your
    workbook. One has a the boxplot chart, while the
    other contains summary measures used to form the
    boxplot.
  • The following slides show the chart and the
    summary measure information.

56
Boxplot Chart
57
Boxplot Summary Measures
58
Keys to Understanding Boxplots
  • The right and left of the box are at the third
    and first quartiles. Therefore, the length of the
    box equals the interquartile range (IQR), and the
    box itself represents the middle 50 of the
    observations. The height of the box has no
    significance.
  • The vertical line inside the box indicates the
    location of the median. The point inside the box
    indicates the location of the mean.

59
Keys to Understanding Boxplots -- continued
  • Horizontal lines are drawn from each side of the
    box. They extend to the most extreme observations
    tat are no farther than 1.5 IQRs from the box.
    They are useful for indicating variability and
    skewness.
  • Observations farther than 1.5 IQRs from the box
    are shown as individual points. If they are
    between 1.5 IRQs and 3 IQRs from the box, they
    are call mild outliers and are hollow. Otherwise,
    they are called extreme outliers and are solid.

60
Dow Returns Boxplot
  • The boxplot for this example summarizes the
    distribution of the returns.
  • It implies that the Dow returns are approximately
    symmetric on each side of the median, although
    the mean is a bit below the median.
  • Also there are a few mild outliers but no extreme
    outliers.

61
  • Describing Data Sets with Boxplots

62
ACTORS.XLS
  • Recall that the salaries of famous actors and
    actresses are listed in this file.
  • Use side-by-side boxplots to compare the salaries
    of male and female actors and actresses.

63
Side-by-Side Boxplots
  • Boxplots are probably most useful for comparing
    two populations graphically. This is done using
    side-by-side boxplots.
  • The data setup for this type of comparison can
    be in one of two forms stacked or unstacked.
  • Data are stacked if there is a code variable
    that designates which category each observation
    is in, and there is a single measurement
    variable that contains the data for both
    categories.

64
Side-by-Side Boxplots -- continued
  • The data are unstacked if there is a separate
    data column for each category.
  • In this example the data are stacked because
    Gender designates the gender associated with each
    observation and Salary is the single measurement
    variable.
  • If the data were unstacked, for example, actors
    salaries would be in one column and actresses
    salaries would be in another.

65
Creating a Side-by-Side Boxplot
  • Since the data are stacked the following must be
    steps must be followed to create the boxplots.
  • Place cursor within the data set and select the
    StatPro/Charts/Boxplot(s) menu item.
  • In the dialog box that opens check the stacked
    option.
  • Then choose Gender as the code variable and
    Salary as the measurement variable.
  • The resulting data in stacked form and the
    side-by-side boxplot appears.

66
Actor Data in Stacked Form
67
Side-by-Side Boxplot Chart
68
Reading the Boxplot
  • It is clear the the female salary box is
    considerably to the left of the male salary box,
    although both have about the same IQR.
  • Each boxplot has three indications that the
    salary distributions are skewed to the right
  • the means are larger than the medians
  • the medians are closer to the left sides of the
    boxes than to the right sides
  • the horizontal lines extend farther to the right
    than to the left of the boxes. However, there are
    no outliers.

69
  • Applying the Tools

70
Background Information
  • The Spring Mills Company produces and distributes
    a wide variety of manufactured goods. Due to its
    variety, it has a a large number of customers.
  • Spring Mills classifies these customers as small,
    medium and large, depending on the volume of
    business each does with them.
  • Recently they have noticed a problem with
    accounts receivable. They are not getting paid by
    their customers in as timely a manner as they
    would like. This obviously costs them money.

71
RECEIVE.XLS
  • Spring Mills has gathered data on 280 customer
    accounts.
  • For each of these accounts the data set lists
    three variables
  • Size, the size of the customer (coded 1 for
    small, 2 for medium, 3 for large)
  • Days, the number of days since the customer was
    billed
  • Amount, the amount the customer owes
  • What information can we obtain from this data?

72
Analysis
  • It is best to start by getting a good sense of
    the data. To do this we
  • Calculate several summary measures for the Days
    and Amount
  • Create a histogram of Amount
  • Create a scatterplot of Amount versus Days
  • From these we determine
  • a positive skewness in the Amount variable (The
    mean is considerably larger than the median and
    the standard deviation of Amount is quite large).
  • The scatterplot suggests some suspicious
    behavior, with two distinct groups of points.

73
Analysis -- continued
  • The next step is to see whether the different
    customer sizes have any effect on either Days,
    Amount, or the relationship between Days and
    Amount.
  • To do this, it is useful to unstack the Days and
    Amount variables - that is to create a new Days
    and Amounts variable for each group of customer
    sizes. For example, the Days and Amount variables
    for customers of size 1 are named Days1 and
    Amount1. This can be accomplished by using
    StatPros Unstack procedure but copying and
    pasting also work.

74
Analysis -- continued
  • Once unstacked we need to calculate summary
    measures and a variety of charts on these
    unstacked variables.
  • The charts include
  • Histograms of Amount for each size customer
  • Boxplots of days owed by different size customers
  • Boxplots of amounts owed by different size
    customers
  • Scatterplots of Amount versus Days for each size
    customer

75
Summary Measures for Combined Data
76
Histogram of All Amounts Owed
77
Scatterplot of Amount versus Days for All
Customers
78
Summary Measures Broken Down by Size
79
Histogram for Small Customers
80
Histogram of Amount for Medium Customers
81
Histogram of Amount for Large Customers
82
Boxplots of Days Owed by Different Size Customers
83
Boxplots of Amounts Owed by Different Size
Customers
84
Scatterplot of Amount versus Days for Small
Companies
85
Scatterplot of Amount versus Days for Medium
Companies
86
Scatterplot of Amount versus Days for Large
Companies
87
Analysis -- continued
  • There is obviously a lot going on here and it is
    evident form the charts. We point out the
    following
  • there are considerably fewer large customers than
    small or medium customers.
  • the large customers tend to owe considerably more
    than small or medium customers.
  • the small customers do not tend to be as long
    overdue as the small and medium customers.
  • there is no relationship between Days and Amount
    for the small customers, but there is a definite
    positive relationship between these variables for
    the medium and large customers.

88
More Analysis
  • We have done the obvious but here is still much
    more we can do.
  • For example, suppose Spring Mills wants a
    breakdown of customers who owe at least 500.
  • We first create a new variable called Large?
    next to the original variables that equals 1 for
    all amounts greater than 500 and equals 0
    otherwise.
  • We do this by entering a formula in D6 and
    copying down.The next slide shows a sample of
    data and the formula to be used.

89
More Analysis - -continued
  • We enter the formula
  • IF(C6gtB3,1,0)
  • in cell D6 and copy it down.
  • We can then use a pivot table to create a count
    of the number of 1s in this new variable for
    each value of the Size variable.

90
Pivot Tables for Counts of Customers Who Owe More
than 500
91
More Analysis -- continued
  • We created the pivot table twice, once showing
    the counts as percentages of each column, and
    once showing them as percentages of each row.
  • One table shows that 73 of all customers with
    amounts less than 500 are small customers.The
    other table shows that 45 of all medium-sized
    customers owe more than 500.
  • This type of analysis is often referred to as
    slicing and dicing the data. They are based on
    the same counts but portray them in different
    ways. Neither way is better they both provide
    useful information.

92
More Analysis -- continued
  • Finally, we investigate the amount of interest
    Spring Mills is losing by the delays in its
    customers payments.
  • We assume that the company can make 12 annual
    interest on excess cash. Then we can create a
    Lost variable for each customer size that
    indicates the amount of interest Spring Mills
    loses on each customer group.

93
More Analysis
  • The formula entered in cell C10 to calculate
    Lost1 is B10A10C7/365. This is the amount
    owed by the number of days owed multiplied by the
    interest rate, divided by the number of days in a
    year. Then we copy this formula down and to the
    columns to calculate Lost2 and Lost3.
  • Then we calculate the sums of these amounts in
    row 5.
  • The next slide shows a sample of data at this
    point.

94
More Analysis -- continued
  • Although Spring Mills is losing more per customer
    from the large customers, it is losing more total
    from the medium-sized customers - because there
    are more of them.

95
More Analysis -- continued
  • This is shown graphically by a pie chart of the
    sums in row 5.

96
Findings
  • If Spring Mills really wants to decrease
    receivables, it might want to target the
    medium-sized customer group, from which it is
    losing the most interest.
  • Or it could target the large customers because
    they owe the most on average.
  • The most appropriate action depends on the cost
    and effectiveness of targeting any particular
    customer group. However, the analysis presented
    here gives the company a much better picture of
    whats currently going on.

97
  • Applying the Tools

98
Background Information
  • The RP Supermarket is open 24 hours a day, 7
    days a week. Lately it has been receiving a lot
    of complaints from customers about excessive
    waiting in line for checking out.
  • RP has decided to investigate this situation by
    gathering data on arrivals, departures, and line
    lengths at the checkout stations.
  • It has collected data in half-hour increments for
    an entire week - 336 observations - starting at 8
    am on Monday and ending at 8 am on the following
    Monday.

99
CHECKOUT.XLS
  • This file includes the data collected on the
    following variables
  • InitialWaiting, the number waiting or being
    checked out at the beginning of a half-hour
    period
  • Arrivals, the number of arrivals to the checkout
    stations during a period
  • Departures, the number of checkout stations open
    during a period
  • Checkers, the number of checkout stations open
    during a period

100
CHECKOUT.XLS -- continued
  • The data set also includes time variables
  • Day, day of week
  • StartTime, clock time at the beginning of each
    half-hour period
  • TimeInterval, a descriptive term for the time of
    day such as Lunch rush for 1130 a.m. to 130
    p.m.
  • Finally the data set includes a calculated
    variable
  • EndWaiting, the number waiting or being checked
    out at the end of a half-hour period This
    variable for any time period equals
    InitialWaiting plus Arrivals minus Departures it
    also equals InitialWaiting for the next period.

101
The Data
102
Analysis Information
  • The manager of RP wants to analyze the data to
    discover any trends, particularly in the pattern
    of arrivals throughout the day or across the
    entire week.
  • Also the store currently uses a
    seat-of-the-pants approach to opening and
    closing checkout stations each half hour. The
    manager would like to see how well the current
    approach is working.
  • Of course, she would love to know the best
    strategy for opening and closing checkout
    stations - but this is beyond her (and our)
    capabilities at this point

103
Analysis
  • Obviously time plays a crucial role in this data,
    so a good place to start is to create one or more
    time series plots.
  • The time series plot of InitialWaiting and
    Arrivals Variables shown on the next slide shows
    that
  • Fridays and Saturdays are the busiest days
  • the time pattern of arrivals is somewhat
    different - more spread out - during the weekends
    than during the weekdays
  • there are fairly regular peak arrival periods
    during the weekdays
  • the number waiting is sometimes as large as 10 or
    20, and the largest of these tend to be around
    peak arrival times

104
Time Series Plot of Initial Waiting and Arrivals
Variables
105
Analysis -- continued
  • A similar time series plot shown below shows
    Arrivals and Departures.

106
Analysis -- continued
  • On this plot it is difficult to separate the two
    time series they are practically on top of each
    other.
  • Perhaps this is not so bad because this tells us
    that the store is checking out customers
    approximately as quickly as they are arriving.
  • A somewhat more efficient way to understand the
    time series behavior is to use pivot tables.

107
Analysis - continued
  • A pivot table for Average InitialWaiting by Hour
    of Day is generated. This table can be seen on
    the next slide.
  • To create this table, we
  • drag the InitialWaiting variable to the Data
    area, express it as an average
  • drag StartTime variable to the Row area
  • drag the Day variable to the Page area
  • Finally, we use the data in the pivot table to
    create a time series plot.

108
Average Initial Waiting by Hour of Day
109
Analysis -- continued
  • Note how the variable in the Page area works.
    Depending on which day we select in the page area
    the corresponding table and time series plot
    changes.
  • Similarly, a pivot table can be created,
    accompanied by a column chart. Average Arrivals
    by TimeInterval of Day can be analyzed with a
    pivot table by dragging Arrival to the Data area,
    expressing it as an average, dragging the
    TimeInterval variable to the Row area, and
    dragging the day variable to the Page area. This
    table is is on the next slide. You can check for
    the patterns on each day.

110
Average Arrivals by Time Interval of Days
111
More Analysis
  • The manager of RP is ultimately interested in
    whether the right number of checkout stations
    are available throughout the day.
  • We can create two scatterplots to provide some
    evidence. The plots follow this slide.
  • The first of these would be a scatterplot of
    Checkers versus TotalCustomers. The
    TotalCustomers variable is calculated as the sum
    of InitialWaiting and the Arrivals to measure the
    total amount of work presented to the checkout
    stations in any half-hour period.

112
Scatterplot of Checkers versus Total Customers
113
Scatterplot of End Waiting versus Checkers
114
More Analysis -- continued
  • From the scatterplot we can see an obvious
    positive relationship between these two
    variables.
  • Evidently management is reacting as they should -
    it is opening more checkout stations when there
    is more traffic.
  • The second scatterplot shows EndWaiting versus
    Checkers. There is again a definite upward trend.
    Periods when more checkout stations are open tend
    to be associated with periods where more
    customers still remain in the checkout process.

115
More Analysis - continued
  • Presumably, management is reacting with more open
    checkout stations in busy periods, but it is not
    reacting strongly enough.
  • Just fiddling with the numbers in the Checker
    column will not solve the manager's problems.

116
Conclusions
  • Two problems exist. First, there is a trade-off
    between the cost of having customers wait in
    line and the cost of paying extra checkout
    people. This is a difficult tradeoff for any
    supermarket manager.
  • Second, the number of departures is clearly
    related to the number of checkout stations open.
    Therefore it doesnt make sense to change the
    numbers in the Checkers column without changing
    the numbers in the Departures (and hence
    InitialWaiting and EndWaiting) column in an
    appropriate way. This is not an easy problem.

117
  • Applying the Tools

118
Background Information
  • The HyTex Company is a direct marketer of
    stereophonic equipment, personal computers and
    other electronic products.
  • HyTex advertises entirely by mailing catalogs to
    its customers, and all of its orders are taken
    over the telephone.
  • The company spends a great deal of money on its
    catalog mailings, and it wants to be sure that
    this is paying off in sales.

119
CATALOGS.XLS
  • This file contains the data that has been
    collected on 1000 customers at the end of the
    current year.
  • For each customer it has data on the following
    variables
  • Age coded as 1 for 30 or younger, 2 for 31 to
    55, 3 for 56 and older.
  • Gender coded as 1 for males, 2 for females
  • OwnHome coded as 1 if customer owns a home, 2
    otherwise
  • Married coded as 1 if customer is currently
    married, 2 otherwise

120
CATALOGS.XLS -- continued
  • Close coded as 1 if customers lives reasonably
    close to a shopping area that sells similar
    merchandise, 2 otherwise
  • Salary combined annual salary of customer and
    spouse (if any)
  • Children number of children living with customer
  • History coded as NA if customer had no
    dealings with the company before this year, 1 if
    customer was a low spending customer last year, 2
    if medium-spending, 3 if high-spending
  • Catalogs Number of catalogs sent to the customer
    this year
  • AmountSpent Total amount of purchases made by
    the customer this year

121
Analysis Desired
  • HyTex wants to analyze these data carefully to
    understand its customers better.
  • They want to see whether they are sending the
    catalogs to the right customers.
  • Each customer either receives 6, 12, 18, or 24
    catalogs through the mail. Currently which
    customer receives which amount is not thought out
    carefully. They want to know if the current
    distribution of catalogs is effective. Is there
    room for improvement?

122
Analysis
  • HyTex is obviously interested in the AmountSpent
    variable. Therefore, it makes sense to create
    scatterplots of AmountSpent versus selected
    explanatory variables.
  • First we create the scatterplot shown on the next
    slide showing AmountSpent versus Salary.
  • It is clear that customers with higher salaries
    tend to spend more, although the variability in
    amounts spent increases significantly as salary
    increases.

123
Scatterplot of Amount Spent versus Salary
124
Analysis -- continued
  • Second, we create a scatterplot of AmountSpent
    versus Catalogs
  • This plot shows that there is some tendency
    toward higher spending among customers who
    receive more catalogs.
  • But do the catalogs cause more spending, or are
    more catalogs sent to customers who would tend to
    spend more anyway? There is no way to answer this
    with this data.
  • Next is the scatterplot of AmountSpent versus
    Children
  • This plot shows the interesting tendency of
    customers with more children to spend less.

125
Scatterplot of Amount Spent versus Catalogs
126
Scatterplot of Amount Spent versus Children
127
More Analysis
  • Pivot tables and accompanying charts are very
    useful in this type of situation. We can begin by
    using this technique to get a better
    understanding of the demographics of the
    customers.
  • The first pivot table and chart shows the
    percentage of an age group who own homes. Using
    the pivot table we can see how these percentages
    change for women, unmarried men and so on.
  • Specifically small percentages of the younger
    people own their own hoe, regardless of marital
    status or gender.

128
Percent of Home Owners versus Age, Married and
Gender
129
More Analysis -- continued
  • The second pivot table shows the percentage of
    each age group who are married, for any
    combination of the Gender and OwnHome variables.
  • For example, we can check that the
    married/unmarried split is quite different for
    women who dont own a home than for male home
    owners.
  • The third pivot table shows the average Salary
    broken down by Age and Gender, with page
    variables for OwnHome and Married.

130
Percent of Married versus Age, OwnHome and Gender
131
Average Salary versus Age, Gender, Married and
OwnHome
132
More Analysis --continued
  • The shape of the resulting charts is practically
    the same for any combination of the page
    variables. However, the heights of the bars
    change appreciably. For example, salaries are
    higher for the married home owners than for
    unmarried customers who are not home owners.
  • Another pivot table and chart break the data down
    in another way. Each column in the pivot table
    shows the percentages in the various History
    categories for a particular number of children.
    Each of these columns corresponds to one of the
    bars in the stacked bar chart. We have also
    used Close as a page variable.

133
Percentages in History Categories versus Children
and Close
134
More Analysis -- continued
  • Two interesting points emerge from this plot.
  • First, customers with more children tend to be
    more heavily represented in the low-spending
    History category.
  • Second, you can check by changing the setting of
    the Close variable from 1 to 2, the percentage of
    high spenders among customers who live far from
    electronic stores is much higher than for those
    who live close to such stores.
  • The next pivot table provide insight into how
    HyTex determined its catalog mailing
    distributions. Each row shows a percentage of a
    particular History category that were sent
    6,12,18 or 24 catalogs.

135
Catalog Distribution versus History
136
More Analysis -- continued
  • The companys distribution policy is still
    unclear but we can see that it definitely sends
    more catalogs to high spending customers and
    fewer to low spending customers.
  • Finally the last pivot table shows AmountSpent
    versus History and Catalogs, with a variety of
    demographic variables in the page area. There are
    so many possible combinations so it can be
    difficult to discover all the existing patterns

137
Average Amount Spent versus History, Catalogs and
Demographic Variables
138
More Analysis -- continued
  • One things stands out loud and clear from the
    graph the more customers receive, the more they
    tend to spend.
  • Also, if they were large spenders last year, they
    tend to be large spenders this year.
  • In a pivot table with this many combinations
    there will almost certainly be some combinations
    with no observations.
Write a Comment
User Comments (0)
About PowerShow.com