CS4333 Data Mining kuliah 1 Introduction

1 / 39
About This Presentation
Title:

CS4333 Data Mining kuliah 1 Introduction

Description:

Data: clickstream and purchase data from Gazelle.com, legwear and legcare e-tailer ... Obituary: Gazelle.com out of business, Aug 2000. 8/19/09. Data Mining: ... – PowerPoint PPT presentation

Number of Views:343
Avg rating:3.0/5.0
Slides: 40
Provided by: jiaw230

less

Transcript and Presenter's Notes

Title: CS4333 Data Mining kuliah 1 Introduction


1
CS4333 Data Mining kuliah 1 Introduction (1/3)
  • Materi kuliah -termasuk slides, tugas- banyak
    mengacu ke
  • Han, KDnuggets/weka, MSU

2
Tiga mata kuliah terkait
  • Tiga mata kuliah terkait
  • CS4323 Information Retrieval
  • CS4333 Data Mining
  • CS4353 Data Warehousing
  • Keterkaitan dari sisi data mining
  • DM-IR text mining
  • DM-DW mining multidimensional data

3
CS4322 sem2 2004/05
  • Web site,
  • Mailing list, daftar email mhs
  • Acuan utama, acuan tambahan
  • Pokok bahasan
  • Introduction (3x)
  • Data and data preprocessing (4x)
  • Classification and Prediction (7x)
  • Frequent Pattern and Association (5x)
  • Cluster Analysis (7x)
  • Lab tools dan aplikasi menggunakan tool
  • Schedule
  • Tugas
  • Tugas untuk persiapan TA
  • Waktu konsultasi
  • Contact person mhs / ketua kelas
  • Copy softcopy bahan-bahan pendukung kuliah

4
Introduction
  • Motivation Why data mining?
  • What is data mining?
  • Data Mining On what kind of data?

5
Nowadays
Data, Data Everywhere
  • Lots of data
  • Lack of theory
  • Data mining can help to generate new hypothesis
    or help analysts to make sense out of the data

6
Necessity Is the Mother of Invention
  • Data explosion problem
  • Automated data collection tools and mature
    database technology lead to tremendous amounts of
    data accumulated and/or to be analyzed in
    databases, data warehouses, and other information
    repositories
  • We are drowning in data, but starving for
    knowledge!
  • Solution Data warehousing and data mining
  • Data warehousing and on-line analytical
    processing
  • Mining interesting knowledge (rules,
    regularities, patterns, constraints) from data in
    large databases

7
Mining Large Data Sets - Motivation
  • There is often information hidden in the data
    that is not readily evident
  • Human analysts may take weeks to discover useful
    information
  • Much of the data is never analyzed at all

The Data Gap
Total new disk (TB) since 1995
Number of analysts
8
Knowledge Discovery Definition
  • Knowledge Discovery in Data is the
  • non-trivial process of identifying
  • valid
  • novel
  • potentially useful
  • and ultimately understandable patterns in data.
  • from Advances in Knowledge Discovery and Data
    Mining, Fayyad, Piatetsky-Shapiro, Smyth, and
    Uthurusamy, (Chapter 1), AAAI/MIT Press 1996

9
Evolution of Database Technology
  • 1960s
  • Data collection, database creation, IMS and
    network DBMS
  • 1970s
  • Relational data model, relational DBMS
    implementation
  • 1980s
  • RDBMS, advanced data models (extended-relational,
    OO, deductive, etc.)
  • Application-oriented DBMS (spatial, scientific,
    engineering, etc.)
  • 1990s
  • Data mining, data warehousing, multimedia
    databases, and Web databases
  • 2000s
  • Stream data management and mining
  • Data mining and its applications
  • Web technology (XML, data integration) and global
    information systems

10
What Is Data Mining?
  • Data mining (knowledge discovery from data)
  • Extraction of interesting (non-trivial, implicit,
    previously unknown and potentially useful)
    patterns or knowledge from huge amount of data
  • Data mining a misnomer?
  • Alternative names
  • Knowledge discovery (mining) in databases (KDD),
    knowledge extraction, data/pattern analysis, data
    archeology, data dredging, information
    harvesting, business intelligence, etc.
  • Watch out Is everything data mining?
  • (Deductive) query processing.
  • Expert systems or small ML/statistical programs

11
Historical Note Many Names of Data Mining
  • Data Fishing, Data Dredging 1960-
  • used by Statistician (as bad name)
  • Data Mining 1990 --
  • used DB, business
  • in 2003 bad image because of TIA
  • Knowledge Discovery in Databases (1989-)
  • used by AI, Machine Learning Community
  • also Data Archaeology, Information Harvesting,
    Information Discovery, Knowledge Extraction, ...

Currently Data Mining and Knowledge Discovery
are used interchangeably
12
Patterns contact lenses data
witteneibe
13
Patterns contact lenses rule set
witteneibe
14
Patterns contact lenses decision tree
witteneibe
15
Classifying iris flowers
witteneibe
16
Why Data Mining?Potential Applications
  • Science data
  • Sky Survey Cataloging
  • Finding Volcanoes on Venus
  • Biosequence Databases
  • Earth Geophysics-Earthquake Photography from
    Space
  • Atmospheric Science
  • Personalization / recommender (e.g. in Amazon)
  • DBMS buffer replacement
  • Political analysis
  • Education
  • Traffic
  • Help desk
  • Business marketing, sales data
  • Intrusion detection
  • Social network
  • DM applications in WWW
  • Email classification, filtering (e.g. spam
    filtering)
  • Application on telco Network Alarm Data
  • Cancer, diabetes .. Detection

17
Market Analysis and Management
  • Where does the data come from?
  • Credit card transactions, loyalty cards, discount
    coupons, customer complaint calls, plus (public)
    lifestyle studies
  • Target marketing
  • Find clusters of model customers who share the
    same characteristics interest, income level,
    spending habits, etc.
  • Determine customer purchasing patterns over time
  • Cross-market analysis
  • Associations/co-relations between product sales,
    prediction based on such association
  • Customer profiling
  • What types of customers buy what products
    (clustering or classification)
  • Customer requirement analysis
  • identifying the best products for different
    customers
  • predict what factors will attract new customers
  • Provision of summary information
  • multidimensional summary reports
  • statistical summary information (data central
    tendency and variation)

18
Corporate Analysis Risk Management
  • Finance planning and asset evaluation
  • cash flow analysis and prediction
  • contingent claim analysis to evaluate assets
  • cross-sectional and time series analysis
    (financial-ratio, trend analysis, etc.)
  • Resource planning
  • summarize and compare the resources and spending
  • Competition
  • monitor competitors and market directions
  • group customers into classes and a class-based
    pricing procedure
  • set pricing strategy in a highly competitive
    market

19
Fraud Detection Mining Unusual Patterns
  • Approaches Clustering model construction for
    frauds, outlier analysis
  • Applications Health care, retail, credit card
    service, telecomm.
  • Auto insurance ring of collisions
  • Money laundering suspicious monetary
    transactions
  • Medical insurance
  • Professional patients, ring of doctors, and ring
    of references
  • Unnecessary or correlated screening tests
  • Telecommunications phone-call fraud
  • Phone call model destination of the call,
    duration, time of day or week. Analyze patterns
    that deviate from an expected norm
  • Retail industry
  • Analysts estimate that 38 of retail shrink is
    due to dishonest employees
  • Anti-terrorism

20
Successful e-commerce Case Study
  • A person buys a book (product) at Amazon.com.
  • Task Recommend other books (products) this
    person is likely to buy
  • Amazon does clustering based on books bought
  • customers who bought Advances in Knowledge
    Discovery and Data Mining, also bought Data
    Mining Practical Machine Learning Tools and
    Techniques with Java Implementations
  • Recommendation program is quite successful

21
Unsuccessful e-commerce case study (KDD-Cup 2000)
  • Data clickstream and purchase data from
    Gazelle.com, legwear and legcare e-tailer
  • Q Characterize visitors who spend more than 12
    on an average order at the site
  • Dataset of 3,465 purchases, 1,831 customers
  • Very interesting analysis by Cup participants
  • thousands of hours - X,000,000 (Millions) of
    consulting
  • Total sales -- Y,000
  • Obituary Gazelle.com out of business, Aug 2000

22
Genomic Microarrays Case Study
  • Given microarray data for a number of samples
    (patients), can we
  • Accurately diagnose the disease?
  • Predict outcome for given treatment?
  • Recommend best treatment?

23
Example ALL/AML data
  • 38 training cases, 34 test, 7,000 genes
  • 2 Classes Acute Lymphoblastic Leukemia (ALL) vs
    Acute Myeloid Leukemia (AML)
  • Use train data to build diagnostic model

ALL
AML
Results on test data 33/34 correct, 1 error may
be mislabeled
24
Security and Fraud Detection - Case Study
  • Credit Card Fraud Detection
  • Detection of Money laundering
  • FAIS (US Treasury)
  • Securities Fraud
  • NASDAQ KDD system
  • Phone fraud
  • ATT, Bell Atlantic, British Telecom/MCI
  • Bio-terrorism detection at Salt Lake Olympics 2002

25
Customer Attrition Case Study
  • Situation Attrition rate at for mobile phone
    customers is around 25-30 a year!
  • Task
  • Given customer information for the past N months,
    predict who is likely to attrite next month.
  • Also, estimate customer value and what is the
    cost-effective offer to be made to this customer.

26
Customer Attrition Results
  • Verizon Wireless built a customer data warehouse
  • Identified potential attriters
  • Developed multiple, regional models
  • Targeted customers with high propensity to accept
    the offer
  • Reduced attrition rate from over 2/month to
    under 1.5/month (huge impact, with gt30 M
    subscribers)
  • (Reported in 2003)

27
Oli Spills
  • Figure An example of a radar image of the sea
    surface

28
Data Mining Steps
Knowledge Discovery in Databases (KDD)
29
Steps of a KDD Process
  • Learning the application domain
  • relevant prior knowledge and goals of application
  • Creating a target data set data selection
  • Data cleaning and preprocessing (may take 60 of
    effort!)
  • Data reduction and transformation
  • Find useful features, dimensionality/variable
    reduction, invariant representation.
  • Choosing functions of data mining
  • summarization, classification, regression,
    association, clustering.
  • Choosing the mining algorithm(s)
  • Data mining search for patterns of interest
  • Pattern evaluation and knowledge presentation
  • visualization, transformation, removing redundant
    patterns, etc.
  • Use of discovered knowledge

30
Knowledge Discovery Processflow, according to
CRISP-DM
see www.crisp-dm.org for more information
31
Data Mining On What Kinds of Data?
  • Relational database
  • Data warehouse
  • Transactional database
  • Advanced database and advanced applications
  • Object-relational databases
  • Temporal databases and time-series databases
  • Spatial databases and spatiotemporal databases
  • Text databases and multimedia database
  • Heterogeneous databases and legacy databases
  • Data streams
  • The World-Wide Web

32
Data Mining Confluence of Multiple Disciplines
Database Systems
Statistics
Data Mining
Machine Learning
Visualization
Algorithm
Other Disciplines
33
Statistics, Machine Learning andData Mining
  • Statistics
  • more theory-based
  • more focused on testing hypotheses
  • Machine learning
  • more heuristic
  • focused on improving performance of a learning
    agent
  • also looks at real-time learning and robotics
    areas not part of data mining
  • Data Mining and Knowledge Discovery
  • integrates theory and heuristics
  • focus on the entire process of knowledge
    discovery, including data cleaning, learning, and
    integration and visualization of results
  • Distinctions are fuzzy

witteneibe
34
Origins of Data Mining
  • Draws ideas from machine learning/AI, pattern
    recognition, statistics, and database systems
  • Traditional Techniquesmay be unsuitable due to
  • Enormity of data
  • High dimensionality of data
  • Heterogeneous, distributed nature of data

Statistics
Machine Learning/ AI/ Pattern Recognition
Data Mining
Database systems
35
Weka
36
Tugas 0
  • 1. Data Mining dalam berita (/- 1/2 hal diluar
    gambar)
  • 2. Meringkas sebuah tulisan tentang penerapan
    data mining (/- 1/2 hal diluar gambar). Daftar
    tulisan tersedia.
  • Dikumpulkan Rabu 2 Maret saat kuliah (softcopy)
  • Setelah saya tambah dan edit, direncanakan akan
    disatukan menjadi satu buku kecil (masih draft,
    nanti disempurnakan setelah tugas penerapan DM).
    Untuk memudahkan pembuatan tugas penerapan DM.

37
Summary
  • Data mining discovering interesting patterns
    from large amounts of data
  • A natural evolution of database technology, in
    great demand, with wide applications
  • A KDD process includes data cleaning, data
    integration, data selection, transformation, data
    mining, pattern evaluation, and knowledge
    presentation
  • Mining can be performed in a variety of
    information repositories

38
Conferences and Journals on Data Mining
  • Other related conferences
  • ACM SIGMOD
  • VLDB
  • (IEEE) ICDE
  • WWW, SIGIR
  • ICML, CVPR, NIPS
  • Journals
  • Data Mining and Knowledge Discovery (DAMI or
    DMKD)
  • IEEE Trans. On Knowledge and Data Eng. (TKDE)
  • KDD Explorations
  • KDD Conferences
  • ACM SIGKDD Int. Conf. on Knowledge Discovery in
    Databases and Data Mining (KDD)
  • SIAM Data Mining Conf. (SDM)
  • (IEEE) Int. Conf. on Data Mining (ICDM)
  • Conf. on Principles and practices of Knowledge
    Discovery and Data Mining (PKDD)
  • Pacific-Asia Conf. on Knowledge Discovery and
    Data Mining (PAKDD)

39
Where to Find References?DBLP, CiteSeer, Google
  • Data mining and KDD (SIGKDD CDROM)
  • Conferences ACM-SIGKDD, IEEE-ICDM, SIAM-DM,
    PKDD, PAKDD, etc.
  • Journal Data Mining and Knowledge Discovery, KDD
    Explorations
  • Database systems (SIGMOD ACM SIGMOD AnthologyCD
    ROM)
  • Conferences ACM-SIGMOD, ACM-PODS, VLDB,
    IEEE-ICDE, EDBT, ICDT, DASFAA
  • Journals IEEE-TKDE, ACM-TODS/TOIS, JIIS, J. ACM,
    VLDB J., Info. Sys., etc.
  • AI Machine Learning
  • Conferences Machine learning (ML), AAAI, IJCAI,
    COLT (Learning Theory), CVPR, NIPS, etc.
  • Journals Machine Learning, Artificial
    Intelligence, Knowledge and Information Systems,
    IEEE-PAMI, etc.
  • Web and IR
  • Conferences SIGIR, WWW, CIKM, etc.
  • Journals WWW Internet and Web Information
    Systems,
  • Statistics
  • Conferences Joint Stat. Meeting, etc.
  • Journals Annals of statistics, etc.
  • Visualization
  • Conference proceedings CHI, ACM-SIGGraph, etc.
  • Journals IEEE Trans. visualization and computer
    graphics, etc.
Write a Comment
User Comments (0)