Title: CS4333 Data Mining kuliah 1 Introduction
1CS4333 Data Mining kuliah 1 Introduction (1/3)
- Materi kuliah -termasuk slides, tugas- banyak
mengacu ke - Han, KDnuggets/weka, MSU
2Tiga mata kuliah terkait
- Tiga mata kuliah terkait
- CS4323 Information Retrieval
- CS4333 Data Mining
- CS4353 Data Warehousing
- Keterkaitan dari sisi data mining
- DM-IR text mining
- DM-DW mining multidimensional data
3CS4322 sem2 2004/05
- Web site,
- Mailing list, daftar email mhs
- Acuan utama, acuan tambahan
- Pokok bahasan
- Introduction (3x)
- Data and data preprocessing (4x)
- Classification and Prediction (7x)
- Frequent Pattern and Association (5x)
- Cluster Analysis (7x)
- Lab tools dan aplikasi menggunakan tool
- Schedule
- Tugas
- Tugas untuk persiapan TA
- Waktu konsultasi
- Contact person mhs / ketua kelas
- Copy softcopy bahan-bahan pendukung kuliah
4Introduction
- Motivation Why data mining?
- What is data mining?
- Data Mining On what kind of data?
5Nowadays
Data, Data Everywhere
- Lots of data
- Lack of theory
- Data mining can help to generate new hypothesis
or help analysts to make sense out of the data
6Necessity Is the Mother of Invention
- Data explosion problem
- Automated data collection tools and mature
database technology lead to tremendous amounts of
data accumulated and/or to be analyzed in
databases, data warehouses, and other information
repositories - We are drowning in data, but starving for
knowledge! - Solution Data warehousing and data mining
- Data warehousing and on-line analytical
processing - Mining interesting knowledge (rules,
regularities, patterns, constraints) from data in
large databases
7Mining Large Data Sets - Motivation
- There is often information hidden in the data
that is not readily evident - Human analysts may take weeks to discover useful
information - Much of the data is never analyzed at all
The Data Gap
Total new disk (TB) since 1995
Number of analysts
8Knowledge Discovery Definition
- Knowledge Discovery in Data is the
- non-trivial process of identifying
- valid
- novel
- potentially useful
- and ultimately understandable patterns in data.
- from Advances in Knowledge Discovery and Data
Mining, Fayyad, Piatetsky-Shapiro, Smyth, and
Uthurusamy, (Chapter 1), AAAI/MIT Press 1996
9Evolution of Database Technology
- 1960s
- Data collection, database creation, IMS and
network DBMS - 1970s
- Relational data model, relational DBMS
implementation - 1980s
- RDBMS, advanced data models (extended-relational,
OO, deductive, etc.) - Application-oriented DBMS (spatial, scientific,
engineering, etc.) - 1990s
- Data mining, data warehousing, multimedia
databases, and Web databases - 2000s
- Stream data management and mining
- Data mining and its applications
- Web technology (XML, data integration) and global
information systems
10What Is Data Mining?
- Data mining (knowledge discovery from data)
- Extraction of interesting (non-trivial, implicit,
previously unknown and potentially useful)
patterns or knowledge from huge amount of data - Data mining a misnomer?
- Alternative names
- Knowledge discovery (mining) in databases (KDD),
knowledge extraction, data/pattern analysis, data
archeology, data dredging, information
harvesting, business intelligence, etc. - Watch out Is everything data mining?
- (Deductive) query processing.
- Expert systems or small ML/statistical programs
11Historical Note Many Names of Data Mining
- Data Fishing, Data Dredging 1960-
- used by Statistician (as bad name)
- Data Mining 1990 --
- used DB, business
- in 2003 bad image because of TIA
- Knowledge Discovery in Databases (1989-)
- used by AI, Machine Learning Community
- also Data Archaeology, Information Harvesting,
Information Discovery, Knowledge Extraction, ...
Currently Data Mining and Knowledge Discovery
are used interchangeably
12Patterns contact lenses data
witteneibe
13Patterns contact lenses rule set
witteneibe
14Patterns contact lenses decision tree
witteneibe
15Classifying iris flowers
witteneibe
16Why Data Mining?Potential Applications
- Science data
- Sky Survey Cataloging
- Finding Volcanoes on Venus
- Biosequence Databases
- Earth Geophysics-Earthquake Photography from
Space - Atmospheric Science
- Personalization / recommender (e.g. in Amazon)
- DBMS buffer replacement
- Political analysis
- Education
- Traffic
- Help desk
- Business marketing, sales data
- Intrusion detection
- Social network
- DM applications in WWW
- Email classification, filtering (e.g. spam
filtering) - Application on telco Network Alarm Data
- Cancer, diabetes .. Detection
17Market Analysis and Management
- Where does the data come from?
- Credit card transactions, loyalty cards, discount
coupons, customer complaint calls, plus (public)
lifestyle studies - Target marketing
- Find clusters of model customers who share the
same characteristics interest, income level,
spending habits, etc. - Determine customer purchasing patterns over time
- Cross-market analysis
- Associations/co-relations between product sales,
prediction based on such association - Customer profiling
- What types of customers buy what products
(clustering or classification) - Customer requirement analysis
- identifying the best products for different
customers - predict what factors will attract new customers
- Provision of summary information
- multidimensional summary reports
- statistical summary information (data central
tendency and variation)
18Corporate Analysis Risk Management
- Finance planning and asset evaluation
- cash flow analysis and prediction
- contingent claim analysis to evaluate assets
- cross-sectional and time series analysis
(financial-ratio, trend analysis, etc.) - Resource planning
- summarize and compare the resources and spending
- Competition
- monitor competitors and market directions
- group customers into classes and a class-based
pricing procedure - set pricing strategy in a highly competitive
market
19Fraud Detection Mining Unusual Patterns
- Approaches Clustering model construction for
frauds, outlier analysis - Applications Health care, retail, credit card
service, telecomm. - Auto insurance ring of collisions
- Money laundering suspicious monetary
transactions - Medical insurance
- Professional patients, ring of doctors, and ring
of references - Unnecessary or correlated screening tests
- Telecommunications phone-call fraud
- Phone call model destination of the call,
duration, time of day or week. Analyze patterns
that deviate from an expected norm - Retail industry
- Analysts estimate that 38 of retail shrink is
due to dishonest employees - Anti-terrorism
20Successful e-commerce Case Study
- A person buys a book (product) at Amazon.com.
- Task Recommend other books (products) this
person is likely to buy - Amazon does clustering based on books bought
- customers who bought Advances in Knowledge
Discovery and Data Mining, also bought Data
Mining Practical Machine Learning Tools and
Techniques with Java Implementations - Recommendation program is quite successful
21Unsuccessful e-commerce case study (KDD-Cup 2000)
- Data clickstream and purchase data from
Gazelle.com, legwear and legcare e-tailer - Q Characterize visitors who spend more than 12
on an average order at the site - Dataset of 3,465 purchases, 1,831 customers
- Very interesting analysis by Cup participants
- thousands of hours - X,000,000 (Millions) of
consulting - Total sales -- Y,000
- Obituary Gazelle.com out of business, Aug 2000
22Genomic Microarrays Case Study
- Given microarray data for a number of samples
(patients), can we - Accurately diagnose the disease?
- Predict outcome for given treatment?
- Recommend best treatment?
23Example ALL/AML data
- 38 training cases, 34 test, 7,000 genes
- 2 Classes Acute Lymphoblastic Leukemia (ALL) vs
Acute Myeloid Leukemia (AML) - Use train data to build diagnostic model
ALL
AML
Results on test data 33/34 correct, 1 error may
be mislabeled
24Security and Fraud Detection - Case Study
- Credit Card Fraud Detection
- Detection of Money laundering
- FAIS (US Treasury)
- Securities Fraud
- NASDAQ KDD system
- Phone fraud
- ATT, Bell Atlantic, British Telecom/MCI
- Bio-terrorism detection at Salt Lake Olympics 2002
25Customer Attrition Case Study
- Situation Attrition rate at for mobile phone
customers is around 25-30 a year! - Task
- Given customer information for the past N months,
predict who is likely to attrite next month. - Also, estimate customer value and what is the
cost-effective offer to be made to this customer.
26Customer Attrition Results
- Verizon Wireless built a customer data warehouse
- Identified potential attriters
- Developed multiple, regional models
- Targeted customers with high propensity to accept
the offer - Reduced attrition rate from over 2/month to
under 1.5/month (huge impact, with gt30 M
subscribers) - (Reported in 2003)
27Oli Spills
- Figure An example of a radar image of the sea
surface
28Data Mining Steps
Knowledge Discovery in Databases (KDD)
29Steps of a KDD Process
- Learning the application domain
- relevant prior knowledge and goals of application
- Creating a target data set data selection
- Data cleaning and preprocessing (may take 60 of
effort!) - Data reduction and transformation
- Find useful features, dimensionality/variable
reduction, invariant representation. - Choosing functions of data mining
- summarization, classification, regression,
association, clustering. - Choosing the mining algorithm(s)
- Data mining search for patterns of interest
- Pattern evaluation and knowledge presentation
- visualization, transformation, removing redundant
patterns, etc. - Use of discovered knowledge
30Knowledge Discovery Processflow, according to
CRISP-DM
see www.crisp-dm.org for more information
31Data Mining On What Kinds of Data?
- Relational database
- Data warehouse
- Transactional database
- Advanced database and advanced applications
- Object-relational databases
- Temporal databases and time-series databases
- Spatial databases and spatiotemporal databases
- Text databases and multimedia database
- Heterogeneous databases and legacy databases
- Data streams
- The World-Wide Web
32Data Mining Confluence of Multiple Disciplines
Database Systems
Statistics
Data Mining
Machine Learning
Visualization
Algorithm
Other Disciplines
33Statistics, Machine Learning andData Mining
- Statistics
- more theory-based
- more focused on testing hypotheses
- Machine learning
- more heuristic
- focused on improving performance of a learning
agent - also looks at real-time learning and robotics
areas not part of data mining - Data Mining and Knowledge Discovery
- integrates theory and heuristics
- focus on the entire process of knowledge
discovery, including data cleaning, learning, and
integration and visualization of results - Distinctions are fuzzy
witteneibe
34Origins of Data Mining
- Draws ideas from machine learning/AI, pattern
recognition, statistics, and database systems - Traditional Techniquesmay be unsuitable due to
- Enormity of data
- High dimensionality of data
- Heterogeneous, distributed nature of data
Statistics
Machine Learning/ AI/ Pattern Recognition
Data Mining
Database systems
35Weka
36Tugas 0
- 1. Data Mining dalam berita (/- 1/2 hal diluar
gambar) - 2. Meringkas sebuah tulisan tentang penerapan
data mining (/- 1/2 hal diluar gambar). Daftar
tulisan tersedia. - Dikumpulkan Rabu 2 Maret saat kuliah (softcopy)
- Setelah saya tambah dan edit, direncanakan akan
disatukan menjadi satu buku kecil (masih draft,
nanti disempurnakan setelah tugas penerapan DM).
Untuk memudahkan pembuatan tugas penerapan DM.
37Summary
- Data mining discovering interesting patterns
from large amounts of data - A natural evolution of database technology, in
great demand, with wide applications - A KDD process includes data cleaning, data
integration, data selection, transformation, data
mining, pattern evaluation, and knowledge
presentation - Mining can be performed in a variety of
information repositories
38Conferences and Journals on Data Mining
- Other related conferences
- ACM SIGMOD
- VLDB
- (IEEE) ICDE
- WWW, SIGIR
- ICML, CVPR, NIPS
- Journals
- Data Mining and Knowledge Discovery (DAMI or
DMKD) - IEEE Trans. On Knowledge and Data Eng. (TKDE)
- KDD Explorations
- KDD Conferences
- ACM SIGKDD Int. Conf. on Knowledge Discovery in
Databases and Data Mining (KDD) - SIAM Data Mining Conf. (SDM)
- (IEEE) Int. Conf. on Data Mining (ICDM)
- Conf. on Principles and practices of Knowledge
Discovery and Data Mining (PKDD) - Pacific-Asia Conf. on Knowledge Discovery and
Data Mining (PAKDD)
39Where to Find References?DBLP, CiteSeer, Google
- Data mining and KDD (SIGKDD CDROM)
- Conferences ACM-SIGKDD, IEEE-ICDM, SIAM-DM,
PKDD, PAKDD, etc. - Journal Data Mining and Knowledge Discovery, KDD
Explorations - Database systems (SIGMOD ACM SIGMOD AnthologyCD
ROM) - Conferences ACM-SIGMOD, ACM-PODS, VLDB,
IEEE-ICDE, EDBT, ICDT, DASFAA - Journals IEEE-TKDE, ACM-TODS/TOIS, JIIS, J. ACM,
VLDB J., Info. Sys., etc. - AI Machine Learning
- Conferences Machine learning (ML), AAAI, IJCAI,
COLT (Learning Theory), CVPR, NIPS, etc. - Journals Machine Learning, Artificial
Intelligence, Knowledge and Information Systems,
IEEE-PAMI, etc. - Web and IR
- Conferences SIGIR, WWW, CIKM, etc.
- Journals WWW Internet and Web Information
Systems, - Statistics
- Conferences Joint Stat. Meeting, etc.
- Journals Annals of statistics, etc.
- Visualization
- Conference proceedings CHI, ACM-SIGGraph, etc.
- Journals IEEE Trans. visualization and computer
graphics, etc.