Knowledge Discovery and Data Mining - PowerPoint PPT Presentation

1 / 65
About This Presentation
Title:

Knowledge Discovery and Data Mining

Description:

Customer Relationship Management (CRM) 4. ????? ...????? ??? ! ... Online Analytical Processing (OLAP) Data Visualization. 24. 24. Common Uses of Data Mining ... – PowerPoint PPT presentation

Number of Views:204
Avg rating:3.0/5.0
Slides: 66
Provided by: qiang5
Category:

less

Transcript and Presenter's Notes

Title: Knowledge Discovery and Data Mining


1
Knowledge Discovery and Data Mining
  • Soongsil University

2
(No Transcript)
3
Customer Relationship Management (CRM)
4
????? ????? ??? !!
???? - ?? ????? - ???? ? ?? ???? ??????
???
http//news.media.daum.net/economic/industry/20061
1/16/joins/v14743974.html?_RIGHT_COMMR4
5
Customer Attrition Case Study
  • Situation Attrition rate at for mobile phone
    customers is around 25-30 a year !
  • With this in mind, what is our task?
  • Assume we have customer information for the past
    N months.

6
Customer Attrition Case Study
  • Task
  • Predict who is likely to attrite next month.
  • Estimate customer value and what is the
    cost-effective offer to be made to this customer.

7
Customer Attrition Results
  • Verizon Wireless built a customer data warehouse
  • Identified potential attriters
  • Developed multiple, regional models
  • Targeted customers with high propensity to accept
    the offer
  • Reduced attrition rate from over 2/month to
    under 1.5/month (huge impact, with gt30 M
    subscribers)
  • (Reported in 2003)

8
Data Mining An Example
  • You are a marketing manager for a brokerage
    company
  • Problem Churn is too high (also known as
    Attrition)
  • Turnover (after six month introductory period
    ends) is 40
  • Customers receive incentives (average cost 160)
    when account is opened
  • Giving new incentives to everyone who might leave
    is very expensive (as well as wasteful)
  • Bringing back a customer after they leave is both
    difficult and costly

8
9
A Solution
  • One month before the end of the introductory
    period is over, predict which customers will
    leave
  • If you want to keep a customer that is predicted
    to churn, offer them something based on their
    predicted value
  • The ones that are not predicted to churn need no
    attention
  • If you dont want to keep the customer, do
    nothing
  • How can you predict future behavior?
  • Build models
  • Test models

9
10
Convergence of Three Technologies
10
11
Why Now? 1. Increasing Computing Power
  • Moores law doubles computing power every 18
    months
  • Powerful workstations became common
  • Cost effective servers (SMPs) provide parallel
    processing to the mass market

11
12
2. Improved Data Collection
  • Data Collection ? Access ? Navigation ? Mining
  • The more data the better (usually)

12
13
Mining Large Data Sets - Motivation
  • There is often information hidden in the data
    that is not readily evident
  • Human analysts may take weeks to discover useful
    information
  • Much of the data is never analyzed at all

14
Largest databases in 2007
  • Commercial databases
  • ATT 312 TB
  • World Data Centre for Climate 220 TB
  • YouTube 45TB of videos
  • Amazon 42 TB (250,000 full textbooks)
  • Central Intelligence Agency (CIA) ?

15
3. Improved Algorithms (AI Data Base)
  • Techniques have often been waiting for computing
    technology to catch up
  • Statisticians already doing manual data mining
  • Good machine learning intelligent application
    of statistical processes
  • A lot of data mining research focused on tweaking
    existing techniques to get small percentage gains

15
16
Definition Predictive Model
  • A black box that makes predictions about the
    future based on information from the past and
    present
  • Large number of inputs usually available

16
17
How are Models Built and Used?
  • View from 20,000 feet

17
18
The Data Mining Process
18
19
What the Real World Looks Like
19
20
Why Mine Data?
Motivation Necessity is the Mother of
Invention
  • Data explosion problem
  • Automated data collection tools and mature
    database technology lead to tremendous amounts of
    data stored in databases, data warehouses and
    other information repositories
  • We are drowning in data, but starving for
    knowledge!
  • Solution Data warehousing and data mining
  • Data warehousing and on-line analytical
    processing
  • Extraction of interesting knowledge (rules,
    regularities, patterns, constraints) from data
    in large databases

21
Predictive Models are
  • Decision Trees
  • Nearest Neighbor Classification
  • Neural Networks
  • Rule Induction
  • K-means Clustering

21
22
Why Data Mining? Potential Applications
  • Database analysis and decision support
  • Market analysis and management
  • target marketing, customer relation management,
    market basket analysis, cross selling, market
    segmentation
  • Risk analysis and management
  • Forecasting, customer retention, improved
    underwriting(????), quality control, competitive
    analysis
  • Fraud detection and management
  • Other Applications
  • Text mining (news group, email, documents) and
    Web analysis.
  • Intelligent query answering

23
Data Mining is Not ...
  • Data warehousing
  • SQL / Ad Hoc Queries / Reporting
  • Software Agents
  • Online Analytical Processing (OLAP)
  • Data Visualization

23
24
Common Uses of Data Mining
  • Marketing
  • Direct mail marketing
  • Web site personalization
  • Fraud Detection
  • Credit card fraud detection
  • Science
  • Bioinformatics
  • Gene analysis
  • Web Text analysis
  • Google

24
25
Corporate Analysis and Risk Management
  • Finance planning and asset evaluation
  • cash flow analysis and prediction
  • contingent claim analysis to evaluate assets
  • trend analysis, etc.
  • Resource planning
  • summarize and compare the resources and spending
  • Competition
  • monitor competitors and market directions
  • group customers into classes and a class-based
    pricing procedure
  • set pricing strategy in a highly competitive
    market

26
Fraud Detection and Management (1)
  • Applications
  • widely used in health care, retail, credit card
    services, telecommunications (phone card fraud),
    etc.
  • Approach
  • use historical data to build models of fraudulent
    behavior and use data mining to help identify
    similar instances
  • Examples
  • auto insurance detect a group of people who
    stage accidents to collect on insurance
  • money laundering detect suspicious money
    transactions (US Treasury's Financial Crimes
    Enforcement Network)
  • medical insurance detect professional patients
    and ring of doctors and ring of references

27
Fraud Detection and Management (2)
  • Detecting inappropriate medical treatment
  • Australian Health Insurance Commission identifies
    that in many cases blanket screening tests were
    requested (save Australian 1m/yr).
  • Detecting telephone fraud
  • Telephone call model destination of the call,
    duration, time of day or week. Analyze patterns
    that deviate from an expected norm.
  • British Telecom identified discrete groups of
    callers with frequent intra-group calls,
    especially mobile phones, and broke a
    multimillion dollar fraud.
  • Retail
  • Analysts estimate that 38 of retail shrink is
    due to dishonest employees.

28
Scientific Viewpoint
  • Data collected and stored at enormous speeds
    (GB/hour)
  • remote sensors on a satellite
  • telescopes scanning the skies
  • microarrays generating gene expression data
  • scientific simulations generating terabytes of
    data
  • Traditional techniques infeasible for raw data
  • Data mining may help scientists
  • in classifying and segmenting data
  • in Hypothesis Formation

29
Other Applications
  • Sports
  • IBM Advanced Scout analyzed NBA game statistics
    (shots blocked, assists, and fouls) to gain
    competitive advantage for New York Knicks and
    Miami Heat
  • Astronomy
  • JPL and the Palomar Observatory discovered 22
    quasars with the help of data mining
  • Internet Web Surf-Aid
  • IBM Surf-Aid applies data mining algorithms to
    Web access logs for market-related pages to
    discover customer preference and behavior pages,
    analyzing effectiveness of Web marketing,
    improving Web site organization, etc.

30
What is Data Mining?
  • Many Definitions
  • Non-trivial extraction of implicit, previously
    unknown and potentially useful information from
    data
  • Exploration analysis, by automatic or
    semi-automatic means, of large quantities of
    data in order to discover meaningful patterns

31
What is (not) Data Mining?
  • What is Data Mining?
  • Certain names are more prevalent in certain US
    locations (OBrien, ORurke, OReilly in Boston
    area)
  • Group together similar documents returned by
    search engine according to their context (e.g.
    Amazon rainforest, Amazon.com,)
  • What is not Data Mining?
  • Look up phone number in phone directory
  • Query a Web search engine for information about
    Amazon

32
Origins of Data Mining
  • Draws ideas from machine learning/AI, pattern
    recognition, statistics, and database systems
  • Traditional Techniquesmay be unsuitable due to
  • Enormity of data
  • High dimensionality of data
  • Heterogeneous, distributed nature of data

33
Data Mining Tasks
  • Prediction Methods
  • Use some variables to predict unknown or future
    values of other variables.
  • Description Methods
  • Find human-interpretable patterns that describe
    the data.

From Fayyad, et.al. Advances in Knowledge
Discovery and Data Mining, 1996
34
Data Mining Tasks...
  • Exploratory Data Analysis
  • Classification Predictive
  • Clustering Descriptive
  • Association Rule Discovery Descriptive
  • Sequential Pattern Discovery Descriptive
  • Regression Predictive
  • Deviation Detection Predictive

35
Exploratory Data Analysis
  • Exploratory Data Analysis (EDA)
  • Explore the data without any clear ideas of what
    we are looking for
  • EDA techniques are interactive and visual
  • Many effective visualization techniques for small
    and low dimensional data
  • High dimensionality gt difficult visualization gt
    requires dimensionality reduction and projection
    techniques
  • Examples of visualization techniques pie charts,
    histograms, scatterplots, contour plots

36
Predictive Data Mining
  • Predictive Modeling Classification and
    Regression
  • Goal Build a model that will predict the value
    of one variable from the known values of other
    variables
  • - Classification the variable to be predicted is
    categorical (i.e. its values belong to a
    pre-specified, finite set of possibilities)
  • - Regression the variable to be predicted is
    numeric
  • called supervised learning in Machine Learning

37
Classification Definition
  • Given a collection of records (training set )
  • Each record contains a set of attributes, one of
    the attributes is the class.
  • Find a model for class attribute as a function
    of the values of other attributes.
  • Goal previously unseen records should be
    assigned a class as accurately as possible.
  • A test set is used to determine the accuracy of
    the model. Usually, the given data set is divided
    into training and test sets, with training set
    used to build the model and test set used to
    validate it.

38
Classification Example
categorical
categorical
continuous
class
Learn Classifier
Training Set
39
Classification Application 1
  • Direct Marketing
  • Goal Reduce cost of mailing by targeting a set
    of consumers likely to buy a new cell-phone
    product.
  • Approach
  • Use the data for a similar product introduced
    before.
  • We know which customers decided to buy and which
    decided otherwise. This buy, dont buy decision
    forms the class attribute.
  • Collect various demographic, lifestyle, and
    company-interaction related information about all
    such customers.
  • Type of business, where they stay, how much they
    earn, etc.
  • Use this information as input attributes to learn
    a classifier model.

From Berry Linoff Data Mining Techniques, 1997
40
  • Ex.1 Credit card purchases authorization
  • - Credit card companies must determine
    whether to authorize credit card purchases based
    on past transactions. 4 classes have been
    identified
  • authorize
  • ask for further identification before
    authorization
  • do not authorize
  • do not authorize and call police
  • Ex. 2 Credit card application approval
  • - Predict if to accept or deny credit card
    applications
  • Historic data

41
Classification Application 2
  • Fraud Detection
  • Goal Predict fraudulent cases in credit card
    transactions.
  • Approach
  • Use credit card transactions and the information
    on its account-holder as attributes.
  • When does a customer buy, what does he buy, how
    often he pays on time, etc
  • Label past transactions as fraud or fair
    transactions. This forms the class attribute.
  • Learn a model for the class of the transactions.
  • Use this model to detect fraud by observing
    credit card transactions on an account.

42
Classification Application 3
  • Customer Attrition/Churn
  • Goal To predict whether a customer is likely to
    be lost to a competitor.
  • Approach
  • Use detailed record of transactions with each of
    the past and present customers, to find
    attributes.
  • How often the customer calls, where he calls,
    what time-of-the day he calls most, his financial
    status, marital status, etc.
  • Label the customers as loyal or disloyal.
  • Find a model for loyalty.

From Berry Linoff Data Mining Techniques, 1997
43
Classification Application 4
  • Sky Survey Cataloging
  • Goal To predict class (star or galaxy) of sky
    objects, especially visually faint ones, based on
    the telescopic survey images (from Palomar
    Observatory).
  • 3000 images with 23,040 x 23,040 pixels per
    image.
  • Approach
  • Segment the image.
  • Measure image attributes (features) - 40 of them
    per object.
  • Model the class based on these features.
  • Success Story Could find 16 new high red-shift
    quasars, some of the farthest objects that are
    difficult to find!

From Fayyad, et.al. Advances in Knowledge
Discovery and Data Mining, 1996
44
Classifying Galaxies
Courtesy http//aps.umn.edu
  • Attributes
  • Image features,
  • Characteristics of light waves received, etc.

Early
  • Class
  • Stages of Formation

Intermediate
Late
  • Data Size
  • 72 million stars, 20 million galaxies
  • Object Catalog 9 GB
  • Image Database 150 GB

45
Descriptive Data Mining
Goal Describe all of the data (or the process
that generated the data) Density estimation -
what is the probability distribution Dependency
modeling what are the relationships
between variables Clustering (segmentation)
find groups of data objects that are ?
similar to one another within the same
group(cluster) ? dissimilar to the objects in
other clusters ? called unsupervised learning
in Machine Learning
46
Clustering More Example
  • Ex. 3 Re-design of uniforms for female soldiers
    in US army
  • Goal reduce the number of uniform sizes to be
    kept in inventory while still providing good fit
  • Researchers from Cornell Uni used clustering and
    designed a new set of sizes
  • ? - Traditional clothing size system ordered
    set of graduated sizes where all dimensions
    increase together
  • ? - The new system sizes that fit body types
  • e.g. one size for short-legged, small waist,
    women with wide and long torsos, average arms,
    broad shoulders, and skinny necks

47
Clustering Definition
  • Given a set of data points, each having a set of
    attributes, and a similarity measure among them,
    find clusters such that
  • Data points in one cluster are more similar to
    one another.
  • Data points in separate clusters are less similar
    to one another.
  • Similarity Measures
  • Euclidean Distance if attributes are continuous.
  • Other Problem-specific Measures.

48
Illustrating Clustering
  • Euclidean Distance Based Clustering in 3-D space.

Intracluster distances are minimized
Intercluster distances are maximized
49
Clustering Application 1
  • Market Segmentation
  • Goal subdivide a market into distinct subsets of
    customers where any subset may conceivably be
    selected as a market target to be reached with a
    distinct marketing mix.
  • Approach
  • Collect different attributes of customers based
    on their geographical and lifestyle related
    information.
  • Find clusters of similar customers.
  • Measure the clustering quality by observing
    buying patterns of customers in same cluster vs.
    those from different clusters.

50
Clustering Application 2
  • Document Clustering
  • Goal To find groups of documents that are
    similar to each other based on the important
    terms appearing in them.
  • Approach To identify frequently occurring terms
    in each document. Form a similarity measure based
    on the frequencies of different terms. Use it to
    cluster.
  • Gain Information Retrieval can utilize the
    clusters to relate a new document or search term
    to clustered documents.

51
(No Transcript)
52
Associative DM
  • Goal Find relationships among data
  • market-basket analysis - find combinations
    of items that occur
  • typically together
  • sequential analysis find sequential
    patterns in data
  • Market-basket analysis
  • Uses the information about what customers
    buy to give us
  • insight into who they are and why they
    make certain purchases
  • Ex.1 A grocery store retailer is trying to
    decide if to put bread on
  • sale.
  • He generates association rules and finds
    what other products are
  • typically purchased with bread. A
    particular type of cheese is sold
  • 60 of the time the bread is sold and a
    jelly is sold 70 of the time.
  • Based on these findings, he decides
  • 1) to place some cheese and jelly at the end
    of the aisle where the
  • bread is placed and
  • 2) not to place either of these 3 items on
    sale at the same time.

53
Market-Basket Analysis More Examples
  • Where should strawberries be placed to maximize
    its sale?
  • Services purchased together by telecommunication
    customers (e.g.
  • broad band Internet, call forwarding, etc.) help
    determine how to
  • bundle these services together to maximize
    revenue.
  • Unusual combinations of insurance claims can be a
    sign of a fraud
  • Medical histories can give indications of
    complications based on
  • combinations of treatments
  • Sport analyzing game statistics (shots blocked,
    assists, and fouls) to
  • gain competitive advantage
  • - When player X is on the floor, player Ys
    shot accuracy decreases from 75 to 30
  • - Bhandari et.al. (1997). Advanced Scout data
    mining and knowledge discovery in NBA data, Data
    Mining and Knowledge Discovery, 1(1), pp.121-125

54
Association Rule Discovery Definition
  • Given a set of records each of which contain some
    number of items from a given collection
  • Produce dependency rules which will predict
    occurrence of an item based on occurrences of
    other items.

Rules Discovered Milk --gt Coke
Diaper, Milk --gt Beer
55
Association Rule Discovery Application 1
  • Marketing and Sales Promotion
  • Let the rule discovered be
  • Bagels, --gt Potato Chips
  • Potato Chips as consequent gt Can be used to
    determine what should be done to boost its sales.
  • Bagels in the antecedent gt Can be used to see
    which products would be affected if the store
    discontinues selling bagels.
  • Bagels in antecedent and Potato chips in
    consequent gt Can be used to see what products
    should be sold with Bagels to promote sale of
    Potato chips!

56
Association Rule Discovery Application 2
  • Supermarket shelf management.
  • Goal To identify items that are bought together
    by sufficiently many customers.
  • Approach Process the point-of-sale data
    collected with barcode scanners to find
    dependencies among items.
  • A classic rule --
  • If a customer buys diaper and milk, then he is
    very likely to buy beer.
  • So, dont be surprised if you find six-packs
    stacked next to diapers!

57
Association Rule Discovery Application 3
  • Inventory Management
  • Goal A consumer appliance repair company wants
    to anticipate the nature of repairs on its
    consumer products and keep the service vehicles
    equipped with right parts to reduce on number of
    visits to consumer households.
  • Approach Process the data on tools and parts
    required in previous repairs at different
    consumer locations and discover the co-occurrence
    patterns.

58
Sequential Pattern Discovery Definition
  • Given is a set of objects, with each object
    associated with its own timeline of events, find
    rules that predict strong sequential dependencies
    among different events.
  • Rules are formed by first disovering patterns.
    Event occurrences in the patterns are governed by
    timing constraints.

59
Sequential Analysis
  • Finds sequential patterns in data
  • - These patterns are similar to market-basket
    analysis but the relationship is based on time
  • Ex.1. Most people who purchase CD players,
    purchase CDs within 3 days.
  • Ex.2. The webmaster at the company X periodically
    analyses the web log data to determine how the
    users of X browse them. He finds that 70 of the
    users of page A follow one of the following
    patterns
  • - A-gtB-gtC
  • - A-gtD-gtB-gtC
  • - A-gtE-gtB-gtC
  • He then decides to add a link from page A to C

60
Deviation/Anomaly Detection
  • Detect significant deviations from normal
    behavior
  • Applications
  • Credit Card Fraud Detection
  • Network Intrusion Detection

Typical network traffic at University
level may reach over 100 million connections per
day
61
Challenges of Data Mining
  • Scalability
  • Dimensionality
  • Complex and Heterogeneous Data
  • Data Quality
  • Data Ownership and Distribution
  • Privacy Preservation
  • Streaming Data

62
Data Mining A KDD Process
Knowledge
Pattern Evaluation
  • Data mining the core of knowledge discovery
    process.

Data Mining
Task-relevant Data
Selection
Data Warehouse
Data Cleaning
Data Integration
Databases
63
Steps of a KDD Process
  • Learning the application domain
  • relevant prior knowledge and goals of application
  • Creating a target data set data selection
  • Data cleaning and preprocessing (may take 60 of
    effort!)
  • Data reduction and transformation
  • Find useful features, dimensionality/variable
    reduction, invariant representation.
  • Choosing functions of data mining
  • summarization, classification, regression,
    association, clustering.
  • Choosing the mining algorithm(s)
  • Data mining search for patterns of interest
  • Pattern evaluation and knowledge presentation
  • visualization, transformation, removing redundant
    patterns, etc.
  • Use of discovered knowledge

64
Data Mining Confluence of Multiple Disciplines
Database Technology
Statistics
Data Mining
Machine Learning
Visualization
Information Science
Other Disciplines
65
Components of DM Algorithms
  • DM algorithms have 3 main components
  • Model (structure)
  • - DM algorithms attempt to fit a model
    to data tree in
  • Decision Trees (DT)
  • - layers of non-linear transformations of
    weighted sums
  • of the inputs in backpropagation Neural
    Networks (NNs)
  • Preference (score function) preference
    criteria used to fit
  • one model over another
  • - Number of misclassifications in DTs
  • - Mean squared error in NNs
  • Search method how the data is searched by
    the algorithm
  • - Greedy search over structure in DTs
  • - Gradient descent over parameters in NNs
Write a Comment
User Comments (0)
About PowerShow.com