Title: Believe it Or Not
1Believe it Or Not
2Statistics Can Be Fun
3Learning Outcomes
- At the end of this talk, I hope that you will
- Realize that statisticians are real people with
real lives - Agree that Statistics is an important area
- Have an understanding of Data Mining
4Who Am I and Life Beyond Statistics
- Worked for NASA put the men on the moon
1964-1969 - Teacher/Researcher
- Businessman
- Civic leader
5Outside of Statistics
- Wife, 2 children and 5 grandchildren
- Wife makes porcelain dolls
- Wife and I make pottery specialize in
crystalline glazes - We fish as often as we can
6Pictures
7Dianes Dolls
8Are You New To Statistics
9But
Nice you asked I am a brain surgeon
I am a statistician
10It is Nice to Be Alone
Just kidding around
11Shakespeare Statistics
- There is a lot in common
- Good Party Talk
12Shakespeare Statistics
- BEATRICE
- He set up his bills here in Messina and
challengedCupid at the flight and my uncle's
fool, readingthe challenge, subscribed for
Cupid, and challengedhim at the bird-bolt. I
pray you, how many hath hekilled and eaten in
these wars? But how many hathhe killed? for
indeed I promised to eat all of his killing.
And your literature professor would ask What
are the essential concepts that Shakespeare is
trying to convey? What is Shakespeare saying?
13Statistics
And we want to know what are the essentials
parts of the this data. What is the data saying
to us?
14Difference Between Shakespeare and Statistics
- Statistics has a set of rules and reasonable
people will come to similar conclusions about the
data - In literature many different interpretations
15Where did the Shakespeare Quote Come From?
16Data Mining
- Data mining, or knowledge discovery, is the
computer-assisted process of digging through and
analyzing enormous sets of data and then
extracting the meaning of the data. - Data mining tools predict behaviors and future
trends, allowing businesses to make proactive,
knowledge-driven decisions.
17Example
- One Midwest grocery chain used data mining to
analyze local buying patterns. They discovered
that when men bought diapers on Thursdays and
Saturdays, they also tended to buy beer. Further
analysis showed that these shoppers typically did
their weekly grocery shopping on Saturdays. On
Thursdays, however, they only bought a few items.
The retailer concluded that they purchased the
beer to have it available for the upcoming
weekend. The grocery chain could use this newly
discovered information in various ways to
increase revenue. For example, they could move
the beer display closer to the diaper display.
And, they could make sure beer and diapers were
sold at full price on Thursdays.
18Another Example
- Merck-Medco Managed Care is a mail-order business
which sells drugs to the country's largest health
care providers Blue Cross and Blue Shield state
organizations, large HMOs, U.S. corporations,
state governments, etc. Merck-Medco is mining its
one terabyte data warehouse to uncover hidden
links between illnesses and known drug
treatments, and spot trends that help pinpoint
which drugs are the most effective for what types
of patients. The results are more effective
treatments that are also less costly.
Merck-Medco's data mining project has helped
customers save an average of 10-15 on
prescription costs.
19Other Uses
- Market segmentation - Identify the common
characteristics of customers who buy the same
products from your company. - Customer churn - Predict which customers are
likely to leave your company and go to a
competitor. - Fraud detection - Identify which transactions are
most likely to be fraudulent. - Direct marketing - Identify which prospects
should be included in a mailing list to obtain
the highest response rate.
20More uses
- Interactive marketing - Predict what each
individual accessing a Web site is most likely
interested in seeing. - Market basket analysis - Understand what products
or services are commonly purchased together
e.g., beer and diapers. - Trend analysis - Reveal the difference between a
typical customer this month and last.
21Another definition
- Data mining is the use of automated data analysis
techniques to uncover previously undetected
relationships among data items. Data mining
involves the statistical analysis of data stored
in a data warehouse. Three of the major data
mining techniques are regression, classification
and clustering.
22Misconceptionfrom tu
23Census 2000 Data Set
- The CENSUS2000 data is a postal code-level
summary of the entire 2000 United States Census.
It features seven variables - ID postal code of the region
- LOCX region longitude
- LOCY region latitude
- MEANHHSZ average household size in the region
- MEDHHINC median household income in the region
- REGDENS region population density percentile
(1lowest density, 100highest density) - REGPOP number people in the region
24Census Data33,000 observations
25Data, Speak to Me with Thine Numbers
- What is going on here?
- What are the essential parts?
- Let us do a plot.
26Plot of Latitude and LongitudeColor by Density
27Pattern Discovery
The Essence of Data Mining? the discovery of
interesting, unexpected, or valuable structures
in large data sets. David Hand
28Pattern Discovery
The Essence of Data Mining? the discovery of
interesting, unexpected, or valuable structures
in large data sets. David Hand
If youve got terabytes of data, and youre
relying on data mining to find interesting things
in there for you, youve lost before youve even
begun.
Herb Edelstein
29k-means Clustering Algorithm
Training Data
1. Select inputs. 2. Select k cluster
centers. 3. Assign cases to closest
center. 4. Update cluster centers. 5. Re-assign
cases. 6. Repeat steps 4 and 5 until
convergence.
30k-means Clustering Algorithm
Training Data
1. Select inputs. 2. Select k cluster
centers. 3. Assign cases to closest
center. 4. Update cluster centers. 5. Re-assign
cases. 6. Repeat steps 4 and 5 until
convergence.
31k-means Clustering Algorithm
Training Data
1. Select inputs. 2. Select k cluster
centers. 3. Assign cases to closest
center. 4. Update cluster centers. 5. Re-assign
cases. 6. Repeat steps 4 and 5 until
convergence.
32Demographic Segmentation Demonstration
Analysis goal
Group geographic regions into segments based on
income, household size, and population density.
Analysis plan
Select and transform segmentation inputs.
Select the number of segments to create.
Create segments with the Cluster tool.
Interpret the segments.
33SAS Program
34Skewed Data
35Transformed Data - log
3610 Clusters
37Select a Cluster
38What Can We Say?
39Where Do They Live
40Summary
- Statistics is a great field of study
- Thank you for being here and wanting to teach AP
statistics - We want to help you
41