Data Mining - PowerPoint PPT Presentation

About This Presentation
Title:

Data Mining

Description:

Data Mining Adrian Tuhtan 004757481 CS157A Section1 Overview Introduction Explanation of Data Mining Techniques Advantages Applications Privacy Data Mining What is ... – PowerPoint PPT presentation

Number of Views:371
Avg rating:3.0/5.0
Slides: 22
Provided by: peg110
Learn more at: http://cs.furman.edu
Category:

less

Transcript and Presenter's Notes

Title: Data Mining


1
Data Mining
  • Adrian Tuhtan
  • 004757481
  • CS157A
  • Section1

2
Overview
  • Introduction
  • Explanation of Data Mining Techniques
  • Advantages
  • Applications
  • Privacy

3
Data Mining
  • What is Data Mining?
  • The process of semi automatically analyzing
    large databases to find useful patterns
    (Silberschatz)
  • KDD Knowledge Discovery in Databases (3)
  • Attempts to discover rules and patterns from
    data
  • Discover Rules ? Make Predictions
  • Areas of Use
  • Internet Discover needs of customers
  • Economics Predict stock prices
  • Science Predict environmental change
  • Medicine Match patients with similar problems ?
    cure

4
Example of Data Mining
  • Credit Card Company wants to discover information
    about clients from databases. Want to find
  • Clients who respond to promotions in Junk Mail
  • Clients that are likely to change to another
    competitor
  • Clients that are likely to not pay
  • Services that clients use to try to promote
    services affiliated with the Credit Card Company
  • Anything else that may help the Company provide/
    promote services to help their clients and
    ultimately make more money.

5
Data Mining Data Warehousing
  • Data Warehouse is a repository (or archive) of
    information gathered from multiple sources,
    stored under a unified schema, at a single site.
    (Silberschatz)
  • Collect data ? Store in single repository
  • Allows for easier query development as a single
    repository can be queried.
  • Data Mining
  • Analyzing databases or Data Warehouses to
    discover patterns about the data to gain
    knowledge.
  • Knowledge is power.

6
Discovery of Knowledge
7
Data Mining Techniques
  • Classification
  • Clustering
  • Regression
  • Association Rules

8
Classification
  • Classification Given a set of items that have
    several classes, and given the past instances
    (training instances) with their associated class,
    Classification is the process of predicting the
    class of a new item.
  • Therefore to classify the new item and identify
    to which class it belongs
  • Example A bank wants to classify its Home Loan
    Customers into groups according to their response
    to bank advertisements. The bank might use the
    classifications Responds Rarely, Responds
    Sometimes, Responds Frequently.
  • The bank will then attempt to find rules about
    the customers that respond Frequently and
    Sometimes.
  • The rules could be used to predict needs of
    potential customers.

9
Technique for Classification
  • Decision-Tree Classifiers

Job
Doctor
Engineer
Carpenter
gt100K
lt30K
lt40K
lt50K
gt50K
gt90K
Predicting credit risk of a person with the jobs
specified.
10
Clustering
  • Clustering algorithms find groups of items that
    are similar. It divides a data set so that
    records with similar content are in the same
    group, and groups are as different as possible
    from each other. (2)
  • Example Insurance company could use clustering
    to group clients by their age, location and types
    of insurance purchased.
  • The categories are unspecified and this is
    referred to as unsupervised learning

11
Clustering
  • Group Data into Clusters
  • Similar data is grouped in the same cluster
  • Dissimilar data is grouped in the same cluster
  • How is this achieved ?
  • K-Nearest Neighbor
  • A classification method that classifies a point
    by calculating the distances between the point
    and points in the training data set. Then it
    assigns the point to the class that is most
    common among its k-nearest neighbors (where k is
    an integer).(2)
  • Hierarchical
  • Group data into t-trees

12
Regression
  • Regression deals with the prediction of a value,
    rather than a class. (1, P747)
  • Example Find out if there is a relationship
    between smoking patients and cancer related
    illness.
  • Given values X1, X2... Xn
  • Objective predict variable Y
  • One way is to predict coefficients a0, a1, a2
  • Y a0 a1X1 a2X2 anXn
  • Linear Regression

13
Regression
  • Example graph
  • Line of Best Fit
  • Curve Fitting

14
Association Rules
  • An association algorithm creates rules that
    describe how often events have occurred
    together. (2)
  • Example When a customer buys a hammer, then 90
    of the time they will buy nails.

15
Association Rules
  • Support is a measure of what fraction of the
    population satisfies both the antecedent and the
    consequent of the rule(1, p748)
  • Example
  • People who buy hotdog buns also buy hotdog
    sausages in 99 of cases. High Support
  • People who buy hotdog buns buy hangers in 0.005
    of cases. Low support
  • Situations where there is high support for the
    antecedent are worth careful attention
  • E.g. Hotdog sausages should be placed in near
    hotdog buns in supermarkets if there is also high
    confidence.

16
Association Rules
  • Confidence is a measure of how often the
    consequent is true when the antecedent is true.
    (1, p748)
  • Example
  • 90 of Hotdog bun purchases are accompanied by
    hotdog sausages.
  • High confidence is meaningful as we can derive
    rules.
  • Hotdog bun? Hotdog sausage
  • 2 rules may have different confidence levels and
    have the same support.
  • E.g. Hotdog sausage ? Hotdog bun may have a much
    lower confidence than Hotdog bun ? Hotdog sausage
    yet they both can have the same support.

17
Advantages of Data Mining
  • Provides new knowledge from existing data
  • Public databases
  • Government sources
  • Company Databases
  • Old data can be used to develop new knowledge
  • New knowledge can be used to improve services or
    products
  • Improvements lead to
  • Bigger profits
  • More efficient service

18
Uses of Data Mining
  • Sales/ Marketing
  • Diversify target market
  • Identify clients needs to increase response rates
  • Risk Assessment
  • Identify Customers that pose high credit risk
  • Fraud Detection
  • Identify people misusing the system. E.g. People
    who have two Social Security Numbers
  • Customer Care
  • Identify customers likely to change providers
  • Identify customer needs

19
Applications of Data Mining
  • (4)

Source IDC 1998
20
Privacy Concerns
  • Effective Data Mining requires large sources of
    data
  • To achieve a wide spectrum of data, link multiple
    data sources
  • Linking sources leads can be problematic for
    privacy as follows If the following histories of
    a customer were linked
  • Shopping History
  • Credit History
  • Bank History
  • Employment History
  • The users life story can be painted from the
    collected data

21
References
  • Silberschatz, Korth, Sudarshan, Database System
    Concepts, 5th Edition, Mc Graw Hill, 2005
  • http//www.twocrows.com/glossary.htm, Two Crows,
    Data Mining Glossary
  • http//en.wikipedia.org/wiki/Data_mining,
    Wikipedia
  • http//phoenix.phys.clemson.edu/tutorials/excel/re
    gression.html
  • http//wwwmaths.anu.edu.au/steve/pdcn.pdf
Write a Comment
User Comments (0)
About PowerShow.com