Title: Prof. Juran's lecture note 1 (at Columbia University) ..
1Data Mining
- References
- U.S. News and World Report's Business
Technology section, 12/21/98, by William J.
Holstein - Prof. Jurans lecture note 1 (at Columbia
University) - J.H. Friedman (1999) Data Mining and Statistics.
technical report, Dept. of Stat., Stanford
University
2Main Goal
- Study statistical tools useful in managerial
decision making. - Most management problems involve some degree of
uncertainty. - People have poor intuitive judgment of
uncertainty. - IT revolution... abundance of available
quantitative information - data mining large databases of info, ...
- market segmentation targeting
- stock market data
- almost anything else you may want to know...
- What conclusions can you draw from your data?
- How much data do you need to support your
conclusions?
3Applications in Management
- Operations management
- e.g., model uncertainty in demand, production
function... - Decision models
- portfolio optimization, simulation, simulation
based optimization... - Capital markets
- understand risk, hedging, portfolios, beta's...
- Derivatives, options, ...
- it is all about modeling uncertainty
- Operations and information technology
- dynamic pricing, revenue management, auction
design, ... - Data mining... many applications
4Portfolio Selection
- You want to select a stock portfolio of companies
A, B, C, - Information Stock Annual returns by year
- A 10, 14, 13, 27,
- B 16, 27, 42, 23,
- Questions
- How do we measure the volatility of each stock?
- How do we quantify the risk associated with a
given portfolio? - What is the tradeoff between risk and returns?
5(No Transcript)
6Currency Value (Relative to Jan 2 1998)
7Introduction
- Premise All business becomes information driven.
- The concept of Data Mining is becoming
increasingly popular as a business information
management tool where it is expected to reveal
knowledge structures that can guide decisions in
conditions of limited certainty. - Competitiveness How you collect and exploit
information to your advantage? - The challenges
- Most corporate data systems are not ready.
- Can they share information?
- What is the quality of the information going in
- Most data techniques come from the empirical
sciences the world is not a laboratory. - Cutting through vendor hype, info-topia.
- Defining good metrics abandoning gut rules of
thumb may be too "risky" for the manager. - Communicating success, setting the right
expectations.
8Wal-Mart
- U.S. News and World Report's Business
Technology section, 12/21/98, by William J.
Holstein - Data-Crunching Santa
- Wal-Mart knows what you bought last Christmas
- Wal-Mart is expected to finish the year with 135
billion in sales, up from 118 billion last year. - It hurts department stores such as Sears, J. C.
Penney, and Federated's Macy's and Bloomingdale's
units, which have been slower to link all their
operations from stores directly to manufacturers.
. - For example, Sears stocked too many winter coats
this season and was surprised by warmer than
average weather. - The field of business analytics has improved
significantly over the past few years, giving
business users better insights, particularly from
operational data stored in transactional systems.
business analytics in its everyday activities. - Analytics are now routinely used in sales,
marketing, supply chain optimization, and fraud
detection.
9A visualization of a Naive Bayes model for
predicting who in the U.S. earns more than
50,000 in yearly salary. The higher the bar,
the greater the amount of evidence a person with
this attribute value earns a high salary.
10Telecommunications
- Data mining flourishes in telecommunications due
to the availability of vast quantities of
high-quality data. - A significant stream of it consists of call
records collected at network switches used
primarily for billing it enables data mining
applications in toll fraud detection and consumer
marketing. - The best-known marketing application of data
mining, albeit via unconfirmed anecdote, concerns
MCIs Friends Family promotion launched in
the domestic U.S. market in 1991. - As the anecdote goes, market researchers observed
relatively small subgraphs in this long-distance
phone companys large call-graph of network
activity. - It reveals the promising strategy of adding
entire calling circles to the companys
subscriber base, rather than the traditional and
costly approach of seeking individual customers
one at a time. Indeed, MCI increased its domestic
U.S. market share in the succeeding years by
exploiting the viral capabilities of calling
circles one infected member causes others to
become infected. - Interestingly, the plan was abandoned some years
later (not available since 1997), possibly
because the virus had run its course but more
likely due to other competitive forces.
11Telecommunications
- In toll-fraud detection, data mining has been
instrumental in completely changing the landscape
for how anomalous behaviors are detected. - Nearly all fraud detection systems in the
telecommunications industry 10 years ago were
based on global threshold models. - They can be expressed as rule sets of the form
If a customer makes more than X calls per hour
to country Y, then apply treatment Z. - The placeholders X, Y, and Z are parameters of
these rule sets applied to all customers. - Given the range of telecommunication customers,
blanket application of these rules produces many
false positives. - Data mining methods for customized monitoring of
land and mobile phone lines were subsequently
developed by leading service providers, including
ATT, MCI, and Verizon, whereby each customers
historic calling patterns are used as a baseline
against which all new calls are compared. - For customers routinely calling country Y more
than X times a day, such alerts would be
suppressed, but if they ventured to call a
different country Y, an alert might be generated.
12Risk management and targeted marketing
- Insurance and direct mail are two industries that
rely on data analysis to make profitable business
decisions. - Insurers must be able to accurately assess the
risks posed by their policyholders to set
insurance premiums at competitive levels. - For example, overcharging low-risk policyholders
would motivate them to seek lower premiums
elsewhere undercharging high-risk policyholders
would attract more of them due to the lower
premiums. - In either case, costs would increase and profits
inevitably decrease. - Effective data analysis leading to the creation
of accurate predictive models is essential for
addressing these issues. - In direct-mail targeted marketing, retailers must
be able to identify subsets of the population
likely to respond to promotions in order to
offset mailing and printing costs. - Profits are maximized by mailing only to those
potential customers most likely to generate net
income to a retailer in excess of the retailers
mailing and printing costs.
13Medical applications (diabetic screening)
- Preprocessing and postprocessing steps are often
the most critical elements determining the
effectiveness of real-life data-mining
applications, as illustrated by the following
recent medical application in diabetic patient
screening. - In the 1990s in Singapore, about 10 of the
population was diabetic, a disease with many side
effects, including increased risk of eye disease
kidney failure, and other complications. - However, early detection and proper care
management can make a difference in the health
and longevity of individual sufferers. - To combat the disease, the government of
Singapore introduced a regular screening program
for diabetic patients in its public hospitals in
1992. - Patient information, clinical symptoms,
eye-disease diagnosis, treatments, and other
details, were captured in a database maintained
by government medical authorities. - After almost 10 years of collecting data, a
wealth of medical information is available. This
vast store of historical data leads naturally to
the application of data mining techniques to
discover interesting patterns. - The objective is to find rules physicians can use
to understand more about diabetes and how it
might be associated with different segments of
the population.
14Christmas Season Georgia Stores
- Store at Decatur (just east of Atlanta)
- A black middle-income community
- Decoration display African-American angels and
ethnic Santas aplenty - Music section Promoting seasonal disks like
"Christmas on Death Row," which features rapper
Snoop Doggy Dogg. - Toy department a large selection of
brown-skinned dolls - Store at Dunwoody (20 miles away fom Decatur)
- An affluent, mostly white suburb (north of
Atlanta) - Music section Showcasing Christmas tunes by
country superstar Garth Brooks. - Toy department a few expensive toys that aren't
available in the Decatur store Out of the
hundreds of dolls in stock, only two have brown
skin. - How to determine the kinds of products that are
carried by various Wal-Marts across the land?
15Wal-Mart system
- Every item in the store has a laser bar code, so
when customers pay for their purchases a scanner
captures information about - what is selling on what day of the week and at
what price. - The scanner also records what other products were
in each shopper's basket. - Wal-Mart analyzes what is in the shopping cart
itself. - The combination of what's in a purchaser's cart
gives you a good indication of the age of that
consumer and the preferences in terms of ethnic
background. - Wal-Mart combines the in-store data with
information about the demographics of communities
around each store. - The end result is surprisingly different
personalities for Wal-Marts. - It also help Wal-Mart figure out how to place
goods on the floor to get what retailers call
"affinity sales," or sales of related products.
16Wal-Mart system (Cont.)
- One big strength of the system is that about
5,000 manufacturers are tied into it through the
company's Retail Link program, which they access
via the Internet. - Pepsi, Disney, or Mattel, for example, can tap
into Wal-Mart's data warehouse to see how well
each product is selling at each Wal-Mart. - They can look at how things are selling in
individual areas and make decisions about
categories where there may be an opportunity to
expand. - That tight information link helps Wal-Mart work
with its suppliers to replenish stock of products
that are selling well and to quickly pull those
that aren't.
17Data Mining and Statistics
- Data Mining is used to discover patterns and
relationships in data with an emphasis on large
observational data bases. - It sits at the common frontiers of several fields
including Data Base Management, Artificial
Intelligence, Machine Learning, Pattern
Recognition and Data Visualization. - From a statistical perspective it can be viewed
as computer automated exploratory data analysis
of large complex data sets. - Many organizations have large transaction
oriented data bases used for inventory billing
accounting, etc. These data bases were very
expensive to create and are costly to maintain.
For a relatively small additional investment DM
tools offer to discover highly profitable nuggets
of information hidden in these data. - Data, especially large amounts of it reside in
data base management systems DBMS. - Conventional DBMS are focused on online
transaction processing (OLTP) that is the
storage and fast retrieval of individual records
for purposes of data organization. They are used
to keep track of inventory payroll records,
billing records, invoices, etc.
18Data Mining Techniques
- Data Mining as an analytic process designed to
- explore data (usually large amounts of -
typically business or market related - data) in
search for consistent patterns and/or systematic
relationships between variables, and then - to validate the findings by applying the detected
patterns to new subsets of data. - The ultimate goal of data mining is prediction -
and predictive data mining is the most common
type of data mining and one that has most direct
business applications. - The process of data mining consists of three
stages - the initial exploration,
- model building or pattern identification with
validation and verification, and it is concluded
with - deployment (i.e., the application of the model to
new data in order to generate predictions).
19Stage 1 Exploration
- It usually starts with data preparation which may
involve cleaning data, data transformations,
selecting subsets of records and - in case of
data sets with large numbers of variables
("fields") - performing some preliminary feature
selection operations to bring the number of
variables to a manageable range (depending on the
statistical methods which are being considered). - Depending on the nature of the analytic problem,
this first stage of the process of data mining
may involve anywhere between a simple choice of
straightforward predictors for a regression
model, to elaborate exploratory analyses using a
wide variety of graphical and statistical methods
in order to identify the most relevant variables
and determine the complexity and/or the general
nature of models that can be taken into account
in the next stage.
20Stage 2 Model building and validation
- This stage involves considering various models
and choosing the best one based on their
predictive performance - Explain the variability in question and
- Producing stable results across samples.
- How do we achieve these goals?
- This may sound like a simple operation, but in
fact, it sometimes involves a very elaborate
process. - "competitive evaluation of models," that is,
applying different models to the same data set
and then comparing their performance to choose
the best. - These techniques - which are often considered the
core of predictive data mining - include Bagging
(Voting, Averaging), Boosting, Stacking (Stacked
Generalizations), and Meta-Learning.
21Models for Data Mining
- In the business environment, complex data mining
projects may require the coordinate efforts of
various experts, stakeholders, or departments
throughout an entire organization. - In the data mining literature, various "general
frameworks" have been proposed to serve as
blueprints for how to organize the process of
gathering data, analyzing data, disseminating
results, implementing results, and monitoring
improvements. - CRISP (Cross-Industry Standard Process for data
mining) was proposed in the mid-1990s by a
European consortium of companies to serve as a
non-proprietary standard process model for data
mining. - The Six Sigma methodology - is a well-structured,
data-driven methodology for eliminating defects,
waste, or quality control problems of all kinds
in manufacturing, service delivery, management,
and other business activities.
22CRISP
- This general approach postulates the following
(perhaps not particularly controversial) general
sequence of steps for data mining projects
23Six Sigma
- This model has recently become very popular (due
to its successful implementations) in various
American industries, and it appears to gain favor
worldwide. It postulated a sequence of,
so-called, DMAIC steps - The categories of activities Define (D), Measure
(M), Analyze (A), Improve (I), Control (C ). - Postulates the following general sequence of
steps for data mining projects - Define (D) ? Measure (M) ? Analyze (A)
? Improve (I) ? Control (C ) - - It grew up from the manufacturing, quality
improvement, and process control traditions and
is particularly well suited to production
environments (including "production of services,"
i.e., service industries). - Define. It is concerned with the definition of
project goals and boundaries, and the
identification of issues that need to be
addressed to achieve the higher sigma level. - Measure. The goal of this phase is to gather
information about the current situation, to
obtain baseline data on current process
performance, and to identify problem areas. - Analyze. The goal of this phase is to identify
the root cause(s) of quality problems, and to
confirm those causes using the appropriate data
analysis tools. - Improve. The goal of this phase is to implement
solutions that address the problems (root causes)
identified during the previous (Analyze) phase. - Control. The goal of the Control phase is to
evaluate and monitor the results of the previous
phase (Improve).
24Six Sigma Process
- A six sigma process is one that can be expected
to produce only 3.4 defects per one million
opportunities. - The concept of the six sigma process is important
in Six Sigma quality improvement programs. - The term Six Sigma derives from the goal to
achieve a process variation, so that 6sigma (the
estimate of the population standard deviation)
will "fit" inside the lower and upper
specification limits for the process. - In that case, even if the process mean shifts by
1.5sigma in one direction (e.g., to 1.5 sigma
in the direction of the upper specification
limit), then the process will still produce very
few defects. - For example, suppose we expressed the area above
the upper specification limit in terms of one
million opportunities to produce defects. The
6sigma process shifted upwards by 1.5 sigma
will only produce 3.4 defects (i.e., "parts" or
"cases" greater than the upper specification
limit) per one million opportunities
25Statisticianss remark on DM paradigms
- The DM community may have to moderate its romance
with big. - A prevailing attitude seems to be that unless an
analysis involves gigabytes or terabytes of data,
it can not possibly be worthwhile. - It seems to be a requirement that all of the data
that has been collected must be used in every
aspect of the analysis. - Sophisticated procedures that cannot
simultaneously handle data sets of such size are
not considered relevant to DM. - Most DM applications routinely require data sets
that are considerably larger than those that have
been addressed by traditional statistical
procedures (kilobytes). - It is often the case that the questions being
asked of the data can be answered to sufficient
accuracy with less than the entire giga or
terabyte data base. - Sampling methodology which has a long tradition
in Statistics can profitably be used to improve
accuracy while mitigating computational
requirements. - Also a powerful computationally intense procedure
operating on a subsample of the data may in fact
provide superior accuracy than a less
sophisticated one using the entire data base.
26Sampling
- Objective Determine the average amount of money
spent in the Central Mall. - Sampling A Central City official randomly
samples 12 people as they exit the mall. - He asks them the amount of money spent and
records the data. - Data for the 12 people
- Person spent Person spent
Person spent - 1 132 5
123 9 449 - 2 334 6
5 10 133 - 3 33 7
6 11 44 - 4 10 8
14 12 1 - The official is trying to estimate mean and
variance of the population based on a sample of
12 data points.
27Population versus Sample
- A population is usually a group we want to know
something about - all potential customers, all eligible voters, all
the products coming off an assembly line, all
items in inventory, etc.... - Finite population u1, u2, ... , uN versus
Infinite population - A population parameter is a number (q) relevant
to the population that is of interest to us - the proportion (in the population) that would buy
a product, the proportion of eligible voters who
will vote for a candidate, the average number of
MM's in a pack.... - A sample is a subset of the population that we
actually do know about (by taking measurements of
some kind) - a group who fill out a survey, a group of voters
that are polled, a number of randomly chosen
items off the line.... - x1, x2, ... , xn
- A sample statistic g(x1, x2, ... , xn) is often
the only practical estimate of a population
parameter. - We will use g(x1, x2, ... , xn) as proxies for q,
but remember their difference.
28Average Amount of Money spent in the Central Mall
- A sample (x1, x2, ... , xn)
- Its mean is the sum of their values divided by
the number of observations. - The sample mean, the sample variance, and the
sample standard deviation are 107, 220,854, and
144.40, respectively. - It claims that on average 107 are spent per
shopper with a standard deviation of 144.40. - Why can we claim so?
29- The variance s2 of a set of observations is the
average of the squares of the deviations of the
observations from their mean. - The standard deviation s is the square root of
the variance s2 . - How far the observations are from the mean? s2
and s will be - large if the observations are widely spread about
their mean, - small if they are all close to the mean.
30Stock Market Indexes
- It is a statistical measure that shows how the
prices of a group of stocks changes over time. - Price-Weighted Index DJIA
- Market-Value-Weighted Index Standard and Poors
500 composite Index - Equally Weighted Index Wilshire 5000 Equity
Index - Price-Weighted Index It shows the change in the
average price of the stock that are included in
the index. - Price per share in current period P0 and price
per share in next period P1. - Number of shares outstanding in current period Q0
and number of shares outstanding in next period
Q1.
31DJIA
- Dow Jones industrial average (DJIA)
- Charles Dow first concocted his 12-stock
industrial average in 1896 (expanding to 30 in
1928) - Original It is an arithmetic average of the
thirty stock prices that make up the index. - DJIA (P01 P02 P0,30)/30/(P11
P12 P1,30)/30 - Current It is adjusted for stock splits and the
insurance of stock dividends. - DJIA (P01 P02 P0,30)/AD1/(P11
P12 P1,30) - where AD1 is the appropriate divisor.
- How do we adjust AD1 to account for stock splits,
adding new stocks,...? - The adjustment process is designed to keep the
index value the same as it would have been if the
split had not occurred. - Suppose X30 splits 21 from 100 to 50. Then
change c to c0 such that (X1 X2 100)/c
(X1 X2 50)/c0 - change to c0 lt c to keep index constant before
after split. - How about when new stocks are added and others
are removed?
32DJIA
- How each stock in the Dow performed during the
period when the Dow rose 100 percent (from its
close above 5,000 on Nov. 21, 1995 until it
closed above 10,000 on March 29, 1999). - Companies not in the Dow when it crossed 5,000.
- Adjusted for spinoffs. Does not reflect
performance of stocks spun off to shareholders. - Company Weight in the Dow () Change in
Price () - Alcoa 1.9
52 - AlliedSignal 2.3
129 - Amer. Express 5.5
185 - ATT 3.6
87 - Boeing 1.5
-5 - Caterpillar 2.1
59 - Chevron 4.0
77 - Citigroup 2.8
262 - Coca-Cola 3.0
69 - Du Pont 2.5
76 - Eastman Kodak 2.9
-6
33DJIA
- Company Weight in the Dow () Change in
Price () - Exxon 3.2
83 - General Electric 5.3
232 - General Motors 3.9
89 - Goodyear 2.2
23 - Hewlett-Packard 3.1
66 - I.B.M. 1.9
276 - International Paper 2.0
24 - J. P. Morgan 5.0
63 - Johnson Johnson 4.2
120 - McDonald's 2.0
102 - Merck 3.6
175 - Minnesota Mining 3.2
15 - Philip Morris 1.8
37 - Procter Gamble 4.5
134 - Sears, Roebuck 2.1
18 - Union Carbide 2.1
19 - United Technologies 6.0
196 - Wal-Mart 4.2
288
34SP 500
- The SP 500, which started in 1957, weights
stocks on the basis of their total market value. - Suppose X30 splits 21 from 100 to 50. Then
change c to c0 such that (X1 X2 100)/c
(X1 X2 50)/c0 - change to c0 lt c to keep index constant before
after split. - How about when new stocks are added and others
are removed? - SP 500 is computed by
- SP 500 (w1X1 w2X2 w500X500)/c
- where Xiprice of ith stock and wi of shares
of ith stock. - What happens when a stock splits?
- It is a weighted average.
35Sample vs Population
- For both problems, we try to infer properties of
a large group (population) by analyzing a small
subgroup (the sample). - The population is the group we are trying to
analyze e.g., all eligible voters, etc. - A sample is a subset of the total population that
we have observed or collected data from e.g.,
voters that are actually polled, etc. - How to draw a sample which can be used to make
statements about the population? - Sample must be representative of the population
- Sampling is the way to obtain reliable
information in a cost effective way (why not
census?)
36Issues in sampling
- Representativeness
- Interviewer discretion
- Respondent discretion - non-response
- Key question is the reason for non-response
related to the attribute you are trying to
measure? Illegal aliens/Census. Start-up
companies/not in phone book. Library exit survey. - Good samples
- Good samples probability samples each unit in
the population has a known probability of being
in the sample - Simplest case equal probability sample, each
unit has the same chance of being in the sample.
37Utopian Sample for Analysis
- You have a complete and accurate list of ALL the
units in the target population (sampling frame) - From this you draw an equal probability sample
(generate a list of random numbers) - Reality check incomplete frame, impossible
frame, practical constraints on the simple random
sample (cost and time of sampling) - Precision considerations
- How large a sample do I need?
- Focus on confidence interval - choose coverage
rate (90, 95, 99) margin of error (half the
width). Typically trade off width against
coverage rate. - Simple rule of thumb for a population proportion
- if it's a 95 CI, then use n 1/(margin of
error)2.
38Data Analysis
- Statistical Thinking is understanding variation
and how to deal with it. - Move as far as possible to the right on this
continuum - Ignorance--gtUncertainty--gtRisk--gtCertainty
- Information sciencelearning from data
- Probabilistic inference based on mathematics
- What is Statistics?
- What is the connection if any
- elds including Data Base Management Articial In