MIS 467 Data Mining Chapter 1 introduction Fall 2003

1 / 91
About This Presentation
Title:

MIS 467 Data Mining Chapter 1 introduction Fall 2003

Description:

Course Hours :Wednesdays 5(13:00-13:50) :Thusedays1,2(9:00-10:50) ... and fouls) to gain competitive advantage for New York Knicks and Miami Heat ... – PowerPoint PPT presentation

Number of Views:93
Avg rating:3.0/5.0
Slides: 92
Provided by: bertan5

less

Transcript and Presenter's Notes

Title: MIS 467 Data Mining Chapter 1 introduction Fall 2003


1
MIS 467Data MiningChapter 1introduction
Fall2003
2
Personal Information
  • Instructor Bertan Badur, Ph.D
  • Office HKA 226
  • Phone 0 212 358 15 40 ext.2027
  • E-mail badur_at_boun.edu.tr
  • Office Hours Mondays 14.00-15.00
  • Tuesdays
    14.00-15.00
  • or by
    appointment

3
Course Information
  • Course Hours Wednesdays 5(1300-1350)
  • Thusedays1,2(900-
    1050)
  • Lab Hours Wednsdays 4
    (1200-1250)
  • Place HKA303
  • Course Assistant Gülçin Buruncuk
  • Web page www.mis.boun.edu.tr/ba
    dur/MIS467

4
Course Description
  • This course aims at introducing basic
    methodologies and techniques of data mining.
    Basic data mining functionalities such as
    association, concept description, classification,
    prediction and clustering are introduced and
    various algorithms to achieve them are presented.
    Applications of these concepts and techniques to
    real world problems are discussed. Data mining
    software programs are introduced in lab hours.

5
Text Book
  • Main
  • Data Mining Concepts and Techniques, by Jiawei
    Han, Kamber M Morgan Kaufmann Publishers 2001
  • Supplementary Text Books
  • Data Mining Practical Machine Learning Tools
    and Techniques with Java Implementations, by Ian
    H. Witten, Morgan Kaufmann Publishers, 2000
  • Machine Learning, by Tom M. Mitchell, McGraw-Hill
    International Editions, 1997
  • Mastering Data Mining The Art and Science of
    Customer Relationship Management, by Michael T.
    A. Berry, Gordon Linoff, Willey Computer
    Publishing, 2000
  • Predictive Data Mining Weiss S. M. and
    N.Indurkhaya Morgan Koufmann Pub. 1998

6
Hans CS397 course slides
  • http//www-courses.cs.uiuc.edu/cs397han/index.htm

7
Course Outline (1)
  • Introduction, 1W, Ch.1
  • An Overview of Data Warehouses and OLAP 1W, Ch.2
  • Data Preprocessing 2 W, Ch. 3
  • Concept description 1 W, Ch 5

8
Course Outline (2)
  • Association Rule Mining 2 W, Ch.6
  • Classification and Prediction 4W, Ch.7
  • Decision Trees
  • Bayesian Classification
  • Classification by Backpropagation
  • Regression for Classification and prediction
  • Classification Accuracy
  • Cluster Analysis 2W, Ch.8

9
Prerequisites
  • Basic notion of probability
  • Elementary calculus
  • Elementary knowledge of matrices
  • Basic concepts of database systems

10
Grading
  • Midterm 25 due 19.11.2003
  • Homework 50
  • Project optional
  • Final Exam 25

11
Project
  • Each student or group of students is required to
    prepare a term project. Implementation of
    selected data mining algorithms, application of
    studied techniques to a small scale real world
    problem or a literature surveys can be accepted
    as term projects

12
Software
  • DBMiner DBMiner 2.0 Educational Version
    developed by J. Han and his team author of the
    book Data Mining Concepts and Techniques
    compatible with the text book, perform
    association classification and cluster analysis.
  • Microsoft SQL Server Analysis Services
  • SPSS
  • Neural Connection Performs neural network
    modelling for classification and prediction
  • Answer Tree Decision tree analysis

13
Data Sources
  • FoodMart database coming with Analysis Services
  • WareMart database
  • Data sources from internet
  • UCI KDD Archive
  • UCI Machine Learning Library
  • Financial data from IMKB
  • A database about disabled people in Turkey

14
Where to Find the Set of Slidesfor the Text Book?
  • Tutorial sections (MS PowerPoint files)
  • http//www.cs.sfu.ca/han/dmbook
  • Other conference presentation slides (.ppt)
  • http//db.cs.sfu.ca/ or http//www.cs.sfu.ca/han
  • Research papers, DBMiner system, and other
    related information
  • http//db.cs.sfu.ca/ or http//www.cs.sfu.ca/han

15
Chapter 1. Introduction
  • Motivation Why data mining?
  • What is data mining?
  • Business Applications of data mining
  • Data Mining On what kind of data?
  • Data mining functionality
  • Are all the patterns interesting?
  • Classification of data mining systems
  • Major issues in data mining

16
Motivation Necessity is the Mother of
Invention
  • Data explosion problem
  • Automated data collection tools and mature
    database technology lead to tremendous amounts of
    data stored in databases, data warehouses and
    other information repositories
  • Need to convert such data into knowledge and
    information
  • Applications
  • Business management
  • Production control
  • market analysis
  • Engineering design
  • Science exploration

17
Evolution of Database Technology (1)
  • Data collection, database creation
  • Data management
  • data storage and retrieval
  • database transaction processing
  • Data analysis and understanding
  • Data mining and data warehousing

18
Evolution of Database Technology (2) (See Fig.
1.1)
  • 1960s
  • Data collection, database creation, IMS and
    network DBMS
  • 1970s
  • Relational data model, relational DBMS
    implementation
  • 1980s
  • RDBMS, advanced data models (extended-relational,
    OO, deductive, etc.) and application-oriented
    DBMS (spatial, scientific, engineering, etc.)
  • 1990s2000s
  • Data mining and data warehousing, multimedia
    databases, and Web databases

19
Developments in computer hardware
  • Powerful and affordable computers
  • data collection equipment
  • storage media

20
Data Warehouse
  • Data cleaning
  • Data integration
  • OLAP On-Line Analytical Processing
  • summarization
  • consolidation
  • aggregation
  • view information from different angles
  • but additional data analysis tools are needed for
  • classification
  • clustering
  • charecterization of data changing over time

21
Data rich information poor situation
  • Abundance of data
  • need for powerful data analysis tools
  • data tombs - data archives
  • seldom visited
  • Important decisions are made
  • not on the information rich data stored in
    databases
  • but on a decision makers intuition
  • no tool to extract knowledge embedded in vast
    amounts of data
  • Expert system technology
  • domain experts to input knowledge
  • time consuming and costly

22
What Is Data Mining?
  • Data mining (knowledge discovery in databases)
  • Extraction of interesting (non-trivial, implicit,
    previously unknown and potentially useful)
    information or patterns from data in large
    databases
  • Alternative names and their inside stories
  • Data mining a misnomer?
  • Knowledge discovery(mining) in databases (KDD),
    knowledge extraction, data/pattern analysis, data
    archeology, data dredging, information
    harvesting, business intelligence, etc.
  • What is not data mining?
  • query processing.
  • Expert systems or small ML/statistical programs

23
Potential Business Applications
  • Market analysis and management
  • target marketing, customer relation management,
    market basket analysis, cross selling, market
    segmentation
  • Risk analysis and management
  • Banks assume a financial risk when they grant
    loans
  • risk models attempt to predict the probability of
    default or fail to pay back the borrowed amount
  • Credit cards
  • Insurance companies
  • Fraud detection and management
  • Other Applications
  • Text mining (news group, email, documents) and
    Web analysis.
  • Intelligent query answering

24
Market Analysis and Management (1)
  • Where are the data sources for analysis?
  • Credit card transactions, loyalty cards, discount
    coupons, customer complaint calls, plus (public)
    lifestyle studies,clickstreams
  • Customer profiling-segmentation
  • data mining can tell you what types of customers
    buy what products (clustering or classification)
  • Target marketing
  • Find clusters of model customers who share the
    same characteristics interest, income level,
    spending habits, etc.

25
Market Analysis and Management (2)
  • Effectiveness of sales campaigns
  • Advertisements, coupons, discounts, bonuses
  • promote products and attract customers
  • can help improve profits
  • Compare amount of sales and number of
    transactions
  • during the sales period versus before or after
    the sales campaign
  • Association analysis
  • which items are likely to be purchased together
    with the items on sale

26
Market Analysis and Management (3)
  • Customer retention Analysis of Customer loyalty
  • sequences of purchases of particular customers
  • goods purchased at different periods by the same
    customers can be grouped into sequences
  • changes in customer consumtion or loyalty
  • suggests adjustments on the pricing and variety
    of goods
  • to retain old customers and attract new customers
  • Cross-selling and up-selling
  • associations from sales records
  • a customer who buy a PC is likely to buy a
    printer
  • purchase recommendations

27
Fraud Detection and Management
  • Applications
  • widely used in health care, retail, credit card
    services, telecommunications (phone card fraud),
    etc.
  • Approach
  • use historical data to build models of fraudulent
    behavior and use data mining to help identify
    similar instances
  • Examples
  • auto insurance detect a group of people who
    stage accidents to collect on insurance
  • money laundering detect suspicious money
    transactions (US Treasury's Financial Crimes
    Enforcement Network)
  • Detecting telephone fraud
  • Telephone call model destination of the call,
    duration, time of day or week. Analyze patterns
    that deviate from an expected norm.

28
Financial Data Analysis
  • Financial data
  • complete, reliable, high quality
  • Loan payment prediction and customer credit
    policy analysis

29
Loan payment prediction and customer credit
policy analysis
  • Factors influencing loan payment performance
  • loan-to-value ratio
  • term of the loan
  • dept ratio (total monthly debt/total monthly
    income)
  • payment-to-income ratio
  • income level
  • education level
  • residence region
  • credit history
  • analysis may find that
  • payment-income ratio is a dominant factor while
  • education level and debt ratio are not

30
Data Mining for the Telecommunication Industry
  • Telecommunication data are multidimensional
  • calling-time duration
  • location of caller location of callee
  • type of call
  • used to identify and compare
  • data traffic system workload
  • resource usage user group behavior
  • profit
  • fraudulent pattern analysis and identification of
    unusual patterns
  • to achieve customer loyalty
  • characteristics of customers affecting line usage

31
Other Applications
  • Sports
  • IBM Advanced Scout analyzed NBA game statistics
    (shots blocked, assists, and fouls) to gain
    competitive advantage for New York Knicks and
    Miami Heat
  • Astronomy
  • JPL and the Palomar Observatory discovered 22
    quasars with the help of data mining
  • Internet Web Surf-Aid
  • IBM Surf-Aid applies data mining algorithms to
    Web access logs for market-related pages to
    discover customer preference and behavior pages,
    analyzing effectiveness of Web marketing,
    improving Web site organization, etc.

32
Steps of a KDD Process (1)
  • 1. Learning the application domain
  • relevant prior knowledge and goals of application
  • 2. Creating a target data set data selection
  • 3. Data cleaning and preprocessing (may take 60
    -80 of effort!)
  • removal of noise or outliers
  • strategies for missing data fields
  • accounting for time sequence information
  • 4. Data reduction and transformation
  • Find useful features, dimensionality/variable
    reduction, invariant representation.

33
Steps of a KDD Process (2)
  • 5. Choosing functions of data mining
  • summarization, classification, regression,
    association, clustering.
  • 6. Choosing the mining algorithm(s)
  • which models or parameters
  • 7. Data mining search for patterns of interest
  • 8. Pattern evaluation and knowledge presentation
  • visualization, transformation, removing redundant
    patterns, etc.
  • 9. Use of discovered knowledge
  • incorporating into the performance system
  • documenting
  • reporting to interested parties

34
An example customer segmentation
  • 1. Marketing department wants to perform a
    segmentation study on the customers of AE Company
  • 2. Decide on revevant variables from a data
    warehouse on customers, sales, promotions
  • Customers name,ID,income,age,education,...
  • Sales hisory of sales
  • Promotion promotion types durations...
  • 3. Hendle missing income, addresses..
  • determine outliers if any
  • 4. Cenerate new index variables representing
    wealth of customers
  • Wealth aincomebhousesccars...
  • Make neccesary transformations z scores so that
    some data mining algorithms work more efficiently

35
  • 5. Choose clustering as the data mining
    functionality as it is the natural one for a
    segmentation study so as to find group of
    customers with similar charecteristics
  • 6. Choose a clustering algorithm
  • K-means or k-medoids or any suitable one for that
    problem
  • 7. Apply the algorithm
  • Find clusters or segments
  • 8. make reverse transformations, visualize the
    customer segments
  • 9. present the results in the form of a report to
    the marketing deprtment
  • Implement the segmentation as part of a DSS so
    that it can be applied repeatedly at certain
    internvals as new customers arrive

36
Data Mining A KDD Process
  • Knowledge

Pattern Evaluation
  • Data mining the core of knowledge discovery
    process.

Data Mining
Task-relevant Data
Selection
Data Warehouse
Data Cleaning
Data Integration
Databases
37
Architecture of a Typical Data Mining System
Graphical user interface
Pattern evaluation
Data mining engine
Knowledge-base
Database or data warehouse server
Filtering
Data cleaning data integration
Data Warehouse
Databases
38
Architecture of a Typical Data Mining System
  • Data base, data warehouse
  • Data base or data warehouse server
  • Knowledge base
  • concept hierarchies
  • user beliefs
  • asses patterns interestingness
  • other thresholds
  • Data mining engine
  • functional modules
  • characterization, association, classification,
    cluster analysis, evolution and deviation
    analysis
  • Pattern evaluation module
  • Graphical user interface

39
Data Mining Confluence of Multiple Disciplines
Database Technology
Statistics
Data Mining
Machine Learning
Visualization
Information Science
Other Disciplines
40
Efficient and Scalable Techniques
  • For an algorithm to be scalable
  • its running time should grow linearly in
    proportion to the size of the data base

41
Data Mining On What Kind of Data?
  • Relational databases
  • Data warehouses
  • Transactional databases
  • Advanced DB and information repositories
  • Object-oriented and object-relational databases
  • Spatial databases
  • Time-series data and temporal data
  • Text databases and multimedia databases
  • Heterogeneous and legacy databases
  • WWW

42
An Example problem
  • All Electronic is a multi branch retail company
  • relational tables include
  • customer
  • ID,name, address, age, income,education ,sex, m
    status
  • items
  • ID,name,brand,category,type,price,place_made,
    supplier, cost
  • employee
  • ID,name,department, education, salary
  • branch
  • purchases
  • transID, item_sold, customer ID, emp_ID, date,
    time ,method_paid,amount

43
Two styles of data mining
  • Descriptive data mining
  • Charecterize the general properties of the data
    in the database
  • finds patterns in data and
  • user determines which ones are important
  • Predictive data mining
  • perform inference on the current data to make
    predictions
  • we know what to predict
  • Not mutually exclusive
  • used together

44
Descriptive Data Mining (1)
  • Discovering new patterns inside the data
  • Used during the data exploration steps
  • Typical questions answered by descriptive data
    mining
  • what is in the data
  • what does it look like
  • are there any unusual patterns
  • what dose the data suggest for customer
    segmentation
  • users may have no idea
  • which kind of patterns may be interesting

45
Descriptive Data Mining (2)
  • patterns at verious granularities
  • geograph
  • country - city - region - street
  • student
  • university - faculty - department - minor
  • Fuctionalities of descriptive data mining
  • Clustering
  • Ex customer segmentation
  • summarization
  • visualization
  • Association
  • Ex market basket analysis

46
A model is a black box
X vector of independent variables Y f(X) an
unknown function
Model
Y output
inputs X1,X2
The user does not care what the model is doing it
is a black box interested in the accuracy of its
predictions
47
Predictive Data Mining (1)
  • Using known examples the model is trained
  • the unknown function is learned from data
  • the more data with known outcomes is available
  • the better the predictive power of the model
  • Used to predict outcomes whose inputs are known
    but the output values are not realized yet
  • Never 100 accurate

48
Predictive Data Mining (2)
  • The performance of a model on past data is not
    important
  • to predict the known outcomes
  • Its performance on unknown data is much more
    important

49
Typical questions answered by predictive models
  • Who is likely to respond to our next offer
  • based on history of previous marketing campaigns
  • Which customers are likely to leave in the next
    six months
  • What transactions are likely to be fraudulent
  • based on known examples of fraud
  • What is the total amount spending of a customer
    in the next month

50
Data Mining Functionalities (1)
  • Concept description Characterization and
    discrimination
  • Generalize, summarize, and contrast data
    characteristics, e.g., dry vs. wet regions
  • Association (correlation and causality)
  • Multi-dimensional vs. single-dimensional
    association
  • age(X, 20..29) income(X, 20..29K) à buys(X,
    PC) support 2, confidence 60
  • contains(T, computer) à contains(x, software)
    1, 75

51
Data Mining Functionalities (2)
  • Classification and Prediction
  • Finding models (functions) that describe and
    distinguish classes or concepts for future
    prediction
  • E.g., classify countries based on climate, or
    classify cars based on gas mileage
  • Presentation decision-tree, classification rule,
    neural network
  • Prediction Predict some unknown or missing
    numerical values
  • Cluster analysis
  • Class label is unknown Group data to form new
    classes, e.g., cluster houses to find
    distribution patterns
  • Clustering based on the principle maximizing the
    intra-class similarity and minimizing the
    interclass similarity

52
Data Mining Functionalities (3)
  • Outlier analysis
  • Outlier a data object that does not comply with
    the general behavior of the data
  • It can be considered as noise or exception but is
    quite useful in fraud detection, rare events
    analysis
  • Trend and evolution analysis
  • Trend and deviation regression analysis
  • Sequential pattern mining, periodicity analysis
  • Similarity-based analysis
  • Other pattern-directed or statistical analyses

53
Concept Description
  • Characterization
  • Discerimination
  • Data
  • classes or
  • concpets
  • classes of items for sale
  • computers, printers
  • concepts of customers
  • bigSpenders
  • BudgetSpenders

54
Data Characterization
  • Summarization the data of the class under study
    (target class)
  • Methods
  • SQL queries
  • OLAP roll up -operation
  • user-controlled data summarization
  • along a specified dimension
  • attribute oriented induction
  • without step by step user interraction
  • the output of characterization
  • pie charts, bar chars, curves, multidimensional
    data cube, or cross tabs
  • in rule form as characteristic rules

55
Characterization example
  • Description summarizing the characteristics of
    customers who spend more than 1000 a year at
    AllElecronics
  • age, employment, income
  • drill down on any dimension
  • on occupation view these according to their type
    of employment

56
Data Discrimination
  • Comparing the target class with one or a set of
    comparative classes (contrasting classes)
  • these classes can be specified by the use
  • database queries
  • methods and output
  • similar to those used for characterization
  • include comparative measures to distinguish
    between the target and contrasting classes

57
Discrimination examples
  • Example 1Compare the general features of
    software products
  • whose sales increased by 10 in the last year
  • whose sales decreased by at least 30 during the
    same period
  • Example 2 Compare two groups of AE customers
  • I) who shop for computer products regularly
  • more than two times a month
  • II) who rarely shop for such products
  • lee than three times a year
  • The resulting description
  • 80 of I group customers
  • university education
  • ages 20-40
  • 60 of II group customers
  • seniors or young
  • no university degree

58
Multidimensional Data
  • sales according to region month and product type

Dimensions Product, Location, Time Hierarchical
summarization paths
Region
Industry Region Year Category
Country Quarter Product City Month
Week Office Day
Product
Month
59
Association Analysis
  • Discovery of association rules showing
    attribute-value conditions that occur frequently
    together in a given set of data
  • widely used
  • market basket
  • transaction data analysis
  • more formally
  • X ? Y that is
  • A1?A2.. ?Ak ? B1?B2.. ?Bl
  • A1 , B1 are attribute value pairs

60
Example association analysis
  • From the AllEs database
  • age(X,20..29)?income(X,1 billion...2
    billon)?buy(X,CD player)
  • (support 2,
  • confidence 60)
  • X is a variable representing a customer
  • 2 of the AE customers are
  • between 20 and 29 age
  • incomes ranging from 1 to 2 billon TL
  • buy CD player
  • with 60 probability that customers in those age
    and income groups will buy CD player
  • a multidimensional association rule
  • contains more than one attribute or predicate

61
Market basket analysis
  • customers buying behavior is investigated
  • Based on only the transactions data
  • no information about customer properties age
    income
  • Managers
  • are interested in which products or product
    groups are sold together

62
Example basket analysis rule
  • buy(computer)?buy(printer)
  • (support 1,confidence60)
  • 1 of all transactions contains
  • computer and printer
  • if a transaction contains computer
  • there is a 60 chance that it contains printer as
    well
  • a single dimensional association rule
  • contains a single predicate
  • an association rule is interesting if
  • its support exceeds a minimum threshold and
  • its confidence exceeds a min threshold
  • These min values are set by specialists

63
Classification and Prediction
  • Classification Finding models (functions) that
    describe and distinguish classes or concepts for
    future prediction
  • The derived model is based on the analysis of a
    set of training data
  • E.g., classify countries based on climate, or
    classify cars based on gas mileage
  • Presentation decision-tree, classification rule,
    neural network
  • Prediction Predict some unknown or missing
    numerical values
  • may need to be preceded by relevance analysis
    which attempts to identify attributes that do not
    contribute to the classification or prediction
    process
  • these attributes can be excluded

64
Steps of classification process
  • Train the model
  • using a training set
  • object whose class labels are known
  • Test the model
  • on a test sample
  • whose class labels are known but not used for
    training the model
  • Use the model for classification
  • on new data whose class labels are unknown

65
An hypothetical example
Historical data Each customer type Is known Each
customer has a Label
  • Testing set whose labels are also
  • Known but not used in model
  • Training the model
  • New customers Whose type hsa to be
  • Estimated
  • Each new customer hss to be classified as Risky
    normal or good

66
An hypothetical example cont.
  • Based on historical data develop a classification
    model
  • Decision tree, neural network, regression ...
  • Test the performance of the model on a portion of
    the historical data
  • If accuricy of the model is satisfactory
  • Use the model on the new customers
  • 11 and 27 to assign a type the these new
    customers

67
example
wealth
OK DEFAULT
Yearly income
68
Decision Trees
x1 yearly income x2 wealth y 0 DEFAULT y
1 OK
  • Numerical values of
  • q2 amdq2
  • are estimated
  • by the algorithm

69
Solution
OK DEFAULT
q2
rule IF yearly incomegt q1 and wealthgt q2
THEN OK ELSE DEFAULT
70
Artificial Neural Nets Perseptron
x01
x1
w1
w0
x2
g
w2
y
wd
xd
71
Training ANNs
Learning set
Find w which minimizes the error on X
72
ANN for classification
73
Prediction methods
  • linear regression
  • Yi a0a1X1,ia2X2,i...akXk,iui
  • non-linear regression
  • Yi f(X1,i, X2,i,.., Xk,ia1,a2,..,ak,ui)
  • generalized linear regression
  • logistic
  • logit,probit
  • when the dependent variable is categorical
  • good customer bed customer or employed unemployed
  • poisson regression
  • for count variables

74
ExamplePrediction and Clasification
  • Classification is used to classify customers
    applying for credit cards
  • known class labels risky,reliable
  • when a new customer applies looking at her
    charecteristics
  • income age education wealth region ...
  • Customer class is predicted
  • Prediction The monthly expense of a new customer
    ( a real continuous variable ) is predicted based
    on personal information
  • independent variables
  • income education wealth profession ...
  • Some are numeric some categorical

75
Cluster Analysis
  • Class label is unknown Group data to form new
    classes,
  • assign class labels to each data object
  • e.g., cluster houses to find distribution
    patterns
  • Clustering based on the principle maximizing the
    intra-class similarity and minimizing the
    interclass similarity
  • Objects within a cluster have high similarity in
    comparison to one another
  • but are very dissimilar to objects in other
    clusters
  • there may be hierarchy of classes

76
Example Clustering
  • Can be performed on AE customer data
  • to identify homogenous subpopulations of
    customers
  • represent individual target groups for marketing

77

distance
Type1
Type 2
type 3
income
Clustering according to income and distance to
store three cluster of data points are evident
78
Outlier Analysis
  • Outlier a data object that does not comply with
    the general behavior of the data
  • It can be considered as noise or exception but is
    quite useful in fraud detection, rare events
    analysis
  • DECTECED using
  • statistical tests
  • distance measures
  • visually inspecting the data
  • Examples

79
Reasons for outliers
  • Measurement errors
  • coding errors
  • age is entered as 999
  • nature of data
  • salary of the general manager is much more higher
    than the other employees
  • in crisis the interest rate was in the order of
    1000s

80
Evolution Analysis
  • Describes and models regularities or trends for
    objects whose behavior changes over time
  • Distinct features include
  • Trend and deviation time-series data analysis
  • Sequential pattern mining, periodicity analysis
  • Similarity-based analysis
  • Example
  • Stock market predictions future stock prices
  • for overall stocks indexes or individual company
    stocks

81
Are All the Discovered Patterns Interesting?
  • A data mining system/query may generate thousands
    of patterns, not all of them are interesting.
  • Are all patterns interesting?
  • Typically not -only a small fraction of patterns
    are interesting to any given user
  • Interestingness measures A pattern is
    interesting if
  • it is easily understood by humans,
  • valid on new or test data with some degree of
    certainty,
  • potentially useful,
  • novel, or
  • validates some hypothesis that a user seeks to
    confirm

82
Objective vs. subjective interestingness measures
  • Objective
  • Objective based on statistics and structures of
    patterns, e.g.,
  • support,
  • X ?Y P(X ? Y)probability of a transaction
    contains both X and Y
  • confidence, degree of certainty of the detected
    association
  • P(Y I X) the conditional probability the
    probability that a transaction containing X also
    contains Y
  • thresholds - controlled by the user
  • ex rules that do not satisfy a confidence
    threshold of 50 are uninteresting
  • Subjective based on users belief in the data,
    e.g., unexpectedness, novelty, actionability,
    etc.

83
Can We Find All and Only Interesting Patterns?
  • Find all the interesting patterns Completeness
  • Can a data mining system find all the interesting
    patterns?
  • Association vs. classification vs. clustering
  • Search for only interesting patterns
    Optimization
  • Can a data mining system find only the
    interesting patterns?
  • Approaches
  • First general all the patterns and then filter
    out the uninteresting ones.
  • Generate only the interesting patternsmining
    query optimization

84
Data Mining Classification Schemes
  • General functionality
  • Descriptive data mining
  • Predictive data mining
  • Different views, different classifications
  • Kinds of databases to be mined
  • Kinds of knowledge to be discovered
  • Kinds of techniques utilized
  • Kinds of applications adapted

85
A Multi-Dimensional View of Data Mining
Classification
  • Databases to be mined
  • Relational, transactional, object-oriented,
    object-relational, active, spatial, time-series,
    text, multi-media, heterogeneous, legacy, WWW,
    etc.
  • Knowledge to be mined
  • Characterization, discrimination, association,
    classification, clustering, trend, deviation and
    outlier analysis, etc.
  • Multiple/integrated functions and mining at
    multiple levels
  • Techniques utilized
  • Database-oriented, data warehouse (OLAP), machine
    learning, statistics, visualization, neural
    network, etc.
  • Applications adapted
  • Retail, telecommunication, banking, fraud
    analysis, DNA mining, stock market analysis, Web
    mining, Weblog analysis, etc.

86
Major Issues in Data Mining (1)
  • Mining methodology and user interaction
  • Mining different kinds of knowledge in databases
  • Interactive mining of knowledge at multiple
    levels of abstraction
  • Incorporation of background knowledge
  • Data mining query languages and ad-hoc data
    mining
  • Expression and visualization of data mining
    results
  • Handling noise and incomplete data
  • Pattern evaluation the interestingness problem
  • Performance and scalability
  • Efficiency and scalability of data mining
    algorithms
  • Parallel, distributed and incremental mining
    methods

87
Major Issues in Data Mining (2)
  • Issues relating to the diversity of data types
  • Handling relational and complex types of data
  • Mining information from heterogeneous databases
    and global information systems (WWW)
  • Issues related to applications and social impacts
  • Application of discovered knowledge
  • Domain-specific data mining tools
  • Intelligent query answering
  • Process control and decision making
  • Integration of the discovered knowledge with
    existing knowledge A knowledge fusion problem
  • Protection of data security, integrity, and
    privacy

88
Summary
  • Data mining discovering interesting patterns
    from large amounts of data
  • A natural evolution of database technology, in
    great demand, with wide applications
  • A KDD process includes data cleaning, data
    integration, data selection, transformation, data
    mining, pattern evaluation, and knowledge
    presentation
  • Mining can be performed in a variety of
    information repositories
  • Data mining functionalities characterization,
    discrimination, association, classification,
    clustering, outlier and trend analysis, etc.
  • Classification of data mining systems
  • Major issues in data mining

89
A Brief History of Data Mining Society
  • 1989 IJCAI Workshop on Knowledge Discovery in
    Databases (Piatetsky-Shapiro)
  • Knowledge Discovery in Databases (G.
    Piatetsky-Shapiro and W. Frawley, 1991)
  • 1991-1994 Workshops on Knowledge Discovery in
    Databases
  • Advances in Knowledge Discovery and Data Mining
    (U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and
    R. Uthurusamy, 1996)
  • 1995-1998 International Conferences on Knowledge
    Discovery in Databases and Data Mining
    (KDD95-98)
  • Journal of Data Mining and Knowledge Discovery
    (1997)
  • 1998 ACM SIGKDD, SIGKDD1999-2001 conferences,
    and SIGKDD Explorations
  • More conferences on data mining
  • PAKDD, PKDD, SIAM-Data Mining, (IEEE) ICDM, etc.

90
Where to Find References?
  • Data mining and KDD (SIGKDD member CDROM)
  • Conference proceedings KDD, and others, such as
    PKDD, PAKDD, etc.
  • Journal Data Mining and Knowledge Discovery
  • Database field (SIGMOD member CD ROM)
  • Conference proceedings ACM-SIGMOD, ACM-PODS,
    VLDB, ICDE, EDBT, DASFAA
  • Journals ACM-TODS, J. ACM, IEEE-TKDE, JIIS, etc.
  • AI and Machine Learning
  • Conference proceedings Machine learning, AAAI,
    IJCAI, etc.
  • Journals Machine Learning, Artificial
    Intelligence, etc.
  • Statistics
  • Conference proceedings Joint Stat. Meeting, etc.
  • Journals Annals of statistics, etc.
  • Visualization
  • Conference proceedings CHI, etc.
  • Journals IEEE Trans. visualization and computer
    graphics, etc.

91
References
  • U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and
    R. Uthurusamy. Advances in Knowledge Discovery
    and Data Mining. AAAI/MIT Press, 1996.
  • J. Han and M. Kamber. Data Mining Concepts and
    Techniques. Morgan Kaufmann, 2000.
  • T. Imielinski and H. Mannila. A database
    perspective on knowledge discovery.
    Communications of ACM, 3958-64, 1996.
  • G. Piatetsky-Shapiro, U. Fayyad, and P. Smith.
    From data mining to knowledge discovery An
    overview. In U.M. Fayyad, et al. (eds.), Advances
    in Knowledge Discovery and Data Mining, 1-35.
    AAAI/MIT Press, 1996.
  • G. Piatetsky-Shapiro and W. J. Frawley. Knowledge
    Discovery in Databases. AAAI/MIT Press, 1991.
Write a Comment
User Comments (0)