Introduction to Data Mining

1 / 38

About This Presentation

Title:

Introduction to Data Mining

Description:

Collect various demographic, lifestyle, and other related information about all such customers. ... on their geographical and lifestyle related information. ... – PowerPoint PPT presentation

Number of Views:13

Avg rating:3.0/5.0

Slides: 39

Provided by: aaa23

more less

Transcript and Presenter's Notes

Title: Introduction to Data Mining

1
Introduction to Data Mining
2
Why Mine Data? Commercial Viewpoint

Lots of data is being collected and warehoused
Web data, e-commerce
purchases at department/grocery stores
Bank/Credit Card transactions
Twice as much information was created in 2002 as
in 1999 (30 growth rate)
Other growth rate estimates even higher

3
Largest databases in 2007

Largest database in the world World Data Centre
for Climate (WDCC) operated by the Max Planck
Institute and German Climate Computing Centre
220 terabytes of data on climate research and
climatic trends,
110 terabytes worth of climate simulation data.
6 petabytes worth of additional information
stored on tapes.
ATT
323 terabytes of information
1.9 trillion phone call records
Google
91 million searches per day,
After a year worth of searches, this figure
amounts to more than 33 trillion database entries.

4
Why Mine Data? Scientific Viewpoint

Data is collected and stored at enormous speeds
(GB/hour). E.g.
remote sensors on a satellite
telescopes scanning the skies
scientific simulations generating terabytes of
data
Very little data will ever be looked at by a
human
Knowledge Discovery is NEEDED to make sense and
use of data.

5
Data Mining

Data mining is the process of automatically
discovering useful information in large data
repositories.
Human analysts may take weeks to discover useful
information.
Much of the data is never analyzed at all.

The Data Gap
Total new disk (TB) since 1995
Number of analysts
6
What is (not) Data Mining?

What is Data Mining?
Certain names are more prevalent in certain
locations (OBrien, ORurke, OReilly in Boston
area)
Discover groups of similar documents on the Web

What is not Data Mining?
Look up phone number in phone directory
Query a Web search engine for information about
Amazon

7
Origins of Data Mining

Draws ideas from machine learning/AI,
statistics, and database systems

Statistics
Machine Learning
Data Mining
Database systems
8
Data Mining Tasks

Data mining tasks are generally divided into two
major categories
Predictive tasks Use some attributes to predict
unknown or future values of other attributes.
Classification
Regression
Deviation Detection
Descriptive tasks Find human-interpretable
patterns that describe the data.
Association Discovery
Clustering

9
Predictive Data Mining or Supervised learning

Given a collection of records (training set)
Each record contains a set of attributes, one of
the attributes is the class.
Find ("learn") a model for the class attribute
as a function of the values of the other
attributes.
Goal previously unseen records should be
assigned a class as accurately as possible.

10
Learning

We can think of at least three different problems
being involved in learning
memory,
averaging, and
generalization.

11
Example problem(Adapted from Leslie Kaelbling's
example in the MIT courseware)

Imagine that I'm trying predict whether my
neighbor is going to drive into work, so I can
ask for a ride.
Whether she drives into work seems to depend on
the following attributes of the day
temperature,
expected precipitation,
day of the week,
what she's wearing.

12
Memory

Okay. Let's say we observe our neighbor on three
days

13
Memory

Now, we find ourselves on a snowy 5 degree
Monday, when the neighbor is wearing casual
clothes and going shopping.
Do you think she's going to drive?

Temp Precip Day Clothes
25 None Sat Casual Walk
-5 Snow Mon Casual Drive
15 Snow Mon Casual Walk
-5 Snow Mon Casual
14
Memory

The standard answer in this case is "yes".
This day is just like one of the ones we've seen
before, and so it seems like a good bet to
predict "yes."
This is about the most rudimentary form of
learning, which is just to memorize the things
you've seen before.

Temp Precip Day Clothes
25 None Sat Casual Walk
-5 Snow Mon Casual Drive
15 Snow Mon Casual Walk
-5 Snow Mon Casual Drive
15
Noisy Data

Things arent always as easy as they were in the
previous case. What if you get this set of noisy
data?

Temp Precip Day Clothes
25 None Sat Casual Walk
25 None Sat Casual Walk
25 None Sat Casual Drive
25 None Sat Casual Drive
25 None Sat Casual Walk
25 None Sat Casual Walk
25 None Sat Casual Walk
25 None Sat Casual ?

Now, we are asked to predict what's going to
happen.
We have certainly seen this case before.
But the problem is that it has had different
answers. Our neighbor is not entirely reliable.

16
Averaging

One strategy would be to predict the majority
outcome.
The neighbor walked more times than she drove in
this situation, so we might predict "walk".

We might plausibly make any of the following
arguments
She's going to walk because it's raining today
and the only other time it rained, she walked.
She's going to drive because she has always
driven on Mondays

Dealing with previously unseen cases
Will she walk or drive?

Temp Precip Day Clothes
22 None Fri Casual Walk
3 None Sun Casual Walk
10 Rain Wed Casual Walk
30 None Mon Casual Drive
20 None Sat Formal Drive
25 None Sat Casual Drive
-5 Snow Mon Casual Drive
27 None Tue Casual Drive
24 Rain Mon Casual ?
18
Classification Another Example
categorical
categorical
continuous
class
Learn Classifier
Training Set
19
Example of a Decision Tree
Splitting Attributes
Refund
Yes
No
MarSt
NO
Married
Single, Divorced
TaxInc
NO
lt 80K
gt 80K
YES
NO
Model Decision Tree
Training Data
20
Apply Model to Test Data
Test Data
Start from the root of tree.
21
Apply Model to Test Data
Test Data
22
Apply Model to Test Data
Test Data
Refund
Yes
No
MarSt
NO
Married
Single, Divorced
TaxInc
NO
lt 80K
gt 80K
YES
NO
23
Apply Model to Test Data
Test Data
Refund
Yes
No
MarSt
NO
Married
Single, Divorced
TaxInc
NO
lt 80K
gt 80K
YES
NO
24
Apply Model to Test Data
Test Data
Refund
Yes
No
MarSt
NO
Married
Single, Divorced
TaxInc
NO
lt 80K
gt 80K
YES
NO
25
Apply Model to Test Data
Test Data
Refund
Yes
No
MarSt
NO
Assign Cheat to No
Married
Single, Divorced
TaxInc
NO
lt 80K
gt 80K
YES
NO
26
Classification Direct Marketing

Goal Reduce cost of mailing by targeting a set
of consumers likely to buy a new cell-phone
product.
Approach
Use the data for a similar product introduced
before.
We know which customers decided to buy and which
decided otherwise. This buy, dont buy decision
forms the class attribute.
Collect various demographic, lifestyle, and other
related information about all such customers.
E.g.
Type of business,
where they stay,
how much they earn, etc.
Use this information as input attributes to learn
a classifier model.

27
Classification Fraud Detection

Goal Predict fraudulent cases in credit card
transactions.
Approach
Use credit card transactions and the information
associated with them as attributes, e.g.
when does a customer buy,
what does he buy,
where does he buy, etc.
Label some past transactions as fraud or fair
transactions. This forms the class attribute.
Learn a model for the class of the transactions.
Use this model to detect fraud by observing
credit card transactions on an account.

28
Classification Attrition/Churn

Situation Attrition rate for mobile phone
customers is around 25-30 a year!
Goal To predict whether a customer is likely to
be lost to a competitor.
Approach
Use detailed record of transactions with each of
the past and present customers, to find
attributes. E.g.
how often the customer calls,
where he calls,
what time-of-the day he calls most,
his financial status,
marital status, etc.
Label the customers as loyal or disloyal. Find a
model for loyalty.

Success story (Reported in 2003)
Verizon Wireless performed this kind of data
mining reducing attrition rate from over 2 per
month to under 1.5 per month.
Huge impact, with gt30 M subscribers (0.5 is
150,000 customers).

29
Assessing Credit Risk

Situation Person applies for a loan
Task Should a bank approve the loan?
People who have the best credit dont need the
loans
People with worst credit are not likely to repay.
Banks best customers are in the middle
Banks develop credit models using a variety of
data mining methods.
Mortgage and credit card proliferation are the
results of being able to "successfully" predict
if a person is likely to default on a loan.
Widely deployed in many countries.

30
Frequent-Itemset Mining (Association Discovery)

The Market-Basket Model
A large set of items, e.g., things sold in a
supermarket.
A large set of baskets, each of which is a small
set of the items, e.g., the things one customer
buys on one day.
Fundamental problem
What sets of items are often bought together?
Application
If a large number of baskets contain both hot
dogs and mustard, we can use this information in
several ways. How?

31
Hot Dogs and Mustard

Apparently, many people walk from where the hot
dogs are to where the mustard is.
We can put them close together, and put between
them other foods that might also be bought with
hot dogs and mustard, e.g., ketchup or potato
chips.
Doing so can generate additional "impulse" sales.
The store can run a sale on hot dogs and at the
same time raise the price of mustard.
People will come to the store for the cheap hot
dogs, and many will need mustard too.
It is not worth the trouble to go to another
store for cheaper mustard, so they buy that too.
The store makes back on mustard what it loses on
hot dogs, and also gets more customers into the
store.

32
Beer and Diapers

Whats the explanation here?

33
On-Line Purchases

Amazon.com offers several million different items
for sale, and has several tens of millions of
customers.
Basket Customer, Item Book, DVD, etc.
Motivation Find out what items are bought
together.
Basket Book, DVD, etc. Item Customer
Motivation Find out similar customers.

34
Words and Documents

Baskets sentences items words in those
sentences.
Lets us find words that appear together unusually
frequently, i.e., linked concepts.
Baskets sentences, items documents containing
those sentences.
Items that appear together too often could
represent plagiarism.

35
Genes

Baskets people items genes or
blood-chemistry factors.
Has been used to detect combinations of genes
that result in diabetes

36
Clustering

Given a set of data points, each having a set of
attributes, and a similarity measure among them,
find clusters such that
Data points in one cluster are more similar to
one another.
Data points in separate clusters are less similar
to one another.
Similarity Measures
Euclidean Distance if attributes are continuous.
Other Problem-specific Measures.

37
Clustering Application 1

Market Segmentation
Goal subdivide a market into distinct subsets of
customers where any subset may conceivably be
selected as a market target to be reached with a
distinct marketing mix.
Approach
Collect different attributes of customers based
on their geographical and lifestyle related
information.
Find clusters of similar customers.

38
Clustering Application 2

Document Clustering
Goal To find groups of documents that are
similar to each other based on the important
words appearing in them.
Approach
Identify frequently occurring words in each
document.
Form a similarity measure based on the
frequencies of different terms. Use it to
cluster.
Gain Information Retrieval can utilize the
clusters to relate a new document to clustered
documents.

There are two natural clusters in the data set.
The first cluster consists of the first four
articles, which correspond to news about the
economy. The second cluster contains the last
four articles, which correspond to news about
health care.
Each article is represented as a set of
word-frequency pairs (w, c).

Write a Comment

User Comments (0)