Data Mining: Current Status and Research Directions - PowerPoint PPT Presentation

1 / 50
About This Presentation
Title:

Data Mining: Current Status and Research Directions

Description:

Text mining, Web mining and Weblog analysis. Spatial, multimedia, scientific data analysis ... customization: home page Weblog user profiles. 9/3/09. Data ... – PowerPoint PPT presentation

Number of Views:129
Avg rating:3.0/5.0
Slides: 51
Provided by: jiaw185
Category:

less

Transcript and Presenter's Notes

Title: Data Mining: Current Status and Research Directions


1
Data Mining Current Status and Research
Directions
  • Jiawei Han
  • Intelligent Database Systems Research Lab
  • School of Computing Science
  • Simon Fraser University, Canada
  • http//www.cs.sfu.ca/han

2
Outline
  • Why is data mining hot?
  • Current status Major technical progress
  • Is data mining flying high, or not?
  • How to fly data mining high?Research directions
    on data mining

3
Why Is Data Mining Hot?
  • Data mining (knowledge discovery in databases)
  • Extraction of interesting (non-trivial, implicit,
    previously unknown and potentially useful)
    information (knowledge) or patterns from data in
    large databases or other information repositories
  • Necessity is the mother of invention
  • Data is everywheredata mining should be
    everywhere, too!
  • Understand and use dataan imminent task!

4
Data, Data, Everywhere!!
  • Relational databaseA commodity of every
    enterprise
  • Huge data warehouses are under construction
  • POS (Point of Sales) Transactional DBs in
    terabytes
  • Object-relational databases, distributed,
    heterogeneous, and legacy databases
  • Spatial databases (GIS), remote sensing database
    (EOS), and scientific/engineering databases
  • Time-series data (e.g., stock trading) and
    temporal data
  • Text (documents, emails) and multimedia databases
  • WWW A huge, hyper-linked, dynamic, global
    information system

5
Data Mining Is Everywhere, too!A
Multi-Dimensional View of Data Mining
  • Databases to be mined
  • Relational, transactional, object-relational,
    active, spatial, time-series, text, multi-media,
    heterogeneous, legacy, WWW, etc.
  • Knowledge to be mined
  • Characterization, discrimination, association,
    classification, clustering, trend, deviation and
    outlier analysis, etc.
  • Techniques utilized
  • Database-oriented, data warehouse (OLAP), machine
    learning, statistics, visualization, neural
    network, etc.
  • Applications adapted
  • Retail, telecommunication, banking, fraud
    analysis, DNA mining, stock market analysis, Web
    mining, Weblog analysis, etc.

6
Data Mining Confluence of Multiple Disciplines
Database Technology
Statistics
Data Mining
Machine Learning (AI)
Visualization
Information Science
Other Disciplines
7
Data MiningOne Can Trace Back to Early
Civilization
  • Most scientific discoveries involve data mining
  • Keplers Law, Newtons Laws, periodic table of
    chemical elements, , from big bang to DNA
  • Statistics A discipline dedicated to data
    analysis
  • Then why data mining? What are the differences?
  • Huge amount of datain giga to tera bytes
  • Fast computerquick response, interactive
    analysis
  • Multi-dimensional, powerful, thorough analysis
  • High-level, declarativeusers ease and control
  • Automated or semi-automatedmining functions
    hidden or built-in in many systems

8
A Brief History of Data Mining Activities
  • 1989 IJCAI Workshop on Knowledge Discovery in
    Databases
  • Knowledge Discovery in Databases (G.
    Piatetsky-Shapiro and W. Frawley, 1991)
  • 1991-1994 Workshops on Knowledge Discovery in
    Databases
  • Advances in Knowledge Discovery and Data Mining
    (U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and
    R. Uthurusamy, 1996)
  • 1995-1998 International Conferences on Knowledge
    Discovery in Databases and Data Mining
    (KDD95-98)
  • Journal of Data Mining and Knowledge Discovery
    (1997)
  • 1998 ACM SIGKDD, SIGKDD1999-2001 conferences,
    and SIGKDD Explorations
  • More conferences on data mining
  • PAKDD, PKDD, SIAM-Data Mining, (IEEE) ICDM,
    DaWaK, SPIE-DM, etc.

9
Research Progress in the Last Decade
  • Multi-dimensional data analysis Data warehouse
    and OLAP (on-line analytical processing)
  • Association, correlation, and causality analysis
  • Classification scalability and new approaches
  • Clustering and outlier analysis
  • Sequential patterns and time-series analysis
  • Similarity analysis curves, trends, images,
    texts, etc.
  • Text mining, Web mining and Weblog analysis
  • Spatial, multimedia, scientific data analysis
  • Data preprocessing and database compression
  • Data visualization and visual data mining
  • Many others, e.g., collaborative filtering

10
Multi-Dimensional Data Analysis
  • Data warehousing integration from heterogeneous
    or semi-structured databases
  • Multi-dimensional modeling of data star
    snowflake schemas
  • Efficient and scalable computation of data cubes
    or iceberg cubes
  • OLAP (on-line analytical processing) drilling,
    dicing, slicing, etc.
  • Discovery-driven exploration of data cubes
  • From OLAP to OLAM A multi-dimensional view for
    on-line analytical mining

11
Association and Frequent Pattern Analysis
  • Efficient mining of frequent patterns and
    association rules
  • Apriori and FP-growth algorithms
  • Multi-level, multi-dimensional, quantitative
    association mining
  • From association to correlation, sequential
    patterns, partial periodicity, cyclic rules,
    ratio rules, etc.
  • Query and constraint-based association analysis

12
Classification Scalable Methods and Handling of
Complex Types of Data
  • Classification has been an essential theme in
    machine learning, and statistics research
  • Decision trees, Bayesian classification, neural
    networks, k-nearest neighbors, etc.
  • Tree-pruning, Boosting, bagging techniques
  • Efficient and scalable classification methods
  • Exploration of attribute-class pairs
  • SLIQ, SPRINT, RainForest, BOAT, etc.
  • Classification of semi-structured and
    non-structured data
  • Classification by clustering association rules
    (ARCS)
  • Association-based classification
  • Web document classification

13
Clustering and Outlier Analysis
  • Partitioning methods
  • k-means, k-medoids, CLARANS
  • Hierarchical methods micro-clusters
  • Birch, Cure, Chameleon
  • Density-based methods
  • DBSCAN and OPTICS, DENCLU
  • Grid-based methods
  • STING, CLIQUE, WaveCluster
  • Outlier analysis
  • statistics-based, distance-based, deviation-based
  • Constraint-based clustering
  • COD (Clustering with Obstructed Distance)
  • User-specified constraints

14
Sequential Patterns and Time-Series Analysis
  • Trend analysis
  • Trend movement vs. cyclic variations, seasonal
    variations and random fluctuations
  • Similarity search in time-series database
  • Handling gaps, scaling, etc.
  • Indexing methods and query languages for
    time-series
  • Sequential pattern mining
  • Various kinds of sequences, various methods
  • From GSP to PrefixSpan
  • Periodicity analysis
  • Full periodicity, partial periodicity, cyclic
    association rules

15
Similarity Search Similar Curves, Trends,
Images, and Texts
  • Various kinds of data, various similarity mining
    methods
  • Discovery of similar trends in time-series data
  • Data transformation high-dimensional structures
  • Finding similar images based on color, texture,
    etc.
  • Content-based vs. keyword-based retrieval
  • Color histogram-based signature
  • Multi-feature composed signature
  • Finding documents with similar texts
  • Similar keywords (synonymy polysemy)
  • Term frequency matrix
  • Latent semantic indexing

16
Spatial, Multimedia, Scientific Data Analysis
  • Multi-dimensional analysis of spatial, multimedia
    and scientific data
  • Geo-spatial data cube and spatial OLAP
  • The curse of dimensionality problem
  • Association analysis
  • A progressive refinement methodology
  • Micro-clustering can be used for preprocessing in
    the analysis of complex types of data
  • Classification
  • Association-based for handling high-dimensionality
    and sparse data

17
Data Mining Industry and Applications
  • From research prototypes to data mining products,
    languages, and standards
  • IBM Intelligent Miner, SAS Enterprise Miner, SGI
    MineSet, Clementine, MS/SQLServer 2000, DBMiner,
    BlueMartini, MineIt, DigiMine, etc.
  • A few data mining languages and standards (esp.
    MS OLEDB for Data Mining).
  • Application achievements in many domains
  • Market analysis, trend analysis, fraud detection,
    outlier analysis, Web mining, etc.

18
Is Data Mining Flying? Or Not??
  • Data mining is flying
  • R D have been striding forward greatly
  • Applications have been broadened substantially
  • But not as high as some may have hoped. Why not?
  • Hope to see billions of s within years?
  • A young and coming technology, not a hype!
  • Not bread-and-butter but value-added service
  • DBMS, WWW, and other information systems will
    still be a data mining aircraft-carrier
  • Not on-the-shelf in nature
  • Need training, understanding, and customizing
    (re-develop.)
  • Young technologyneed much RD to fly high
  • Much research, development, and real problem
    solving!

19
How to Fly Data Mining High?Research Directions
  • Web mining
  • Towards integrated data mining environments and
    tools
  • Vertical (or application-specific) data mining
  • Invisible data mining
  • Towards intelligent, efficient, and scalable data
    mining methods

20
Web Mining A Fast Expanding Frontier in Data
Mining
  • Mine what Web search engine finds
  • Automatic classification of Web documents
  • Discovery of authoritative Web pages, Web
    structures and Web communities
  • Meta-Web Warehousing Web yellow page service
  • Web usage mining

21
Mine What Web Search Engine Finds
  • Current Web search engines A convenient source
    for mining
  • keyword-based, return too many, often low quality
    answers, still missing a lot, not customized,
    etc.
  • Data mining will help
  • coverage Enlarge and then shrink, using
    synonyms and conceptual hierarchies
  • better search primitives user preferences/hints
  • linkage analysis authoritative pages and
    clusters
  • Web-based languages XML WebSQL WebML
  • customization home page Weblog user profiles

22
Discovery of Authoritative Pages in WWW
  • Page-rank method ( Brin and Page, 1998)
  • Rank the "importance" of Web pages, based on a
    model of a "random browser."
  • Hub/authority method (Kleinberg, 1998)
  • Prominent authorities often do not endorse one
    another directly on the Web.
  • Hub pages have a large number of links to many
    relevant authorities.
  • Thus hubs and authorities exhibit a mutually
    reinforcing relationship
  • Both the page-rank and hub/authority
    methodologies have been shown to provide
    qualitatively good search results for broad query
    topics on the WWW.

23
Automatic Classification of Web Documents
  • Web document classification
  • Good human classification Yahoo!, CS term
    hierarchies
  • These classifications can be used as training
    sets to build up learning model
  • Key-word based classification is different from
    multi-dimensional classification
  • Association or clustering-based classification is
    often more effective
  • Multi-level classification is important

24
A Multiple Layered Meta-Web Architecture
More Generalized Descriptions
Layern
...
Generalized Descriptions
Layer1
Layer0
25
Web Yellow Page Service A Multi-Layer, Meta-Web
Approach
  • XML facilitates structured and meta-information
    extraction
  • Automatic classification of Web documents
  • based on Yahoo!, etc. as training set
    keyword-based correlation/classification analysis
    (IR/AI assistance)
  • Automatic ranking of important Web pages
  • authoritative site recognition and clustering Web
    pages
  • Generalization-based multi-layer meta-Web
    construction
  • With the assistance of clustering and
    classification analysis
  • Meta-Web can be warehoused and incrementally
    updated
  • Querying and mining can be performed on or
    assisted by meta-Web

26
Importance of Constructing Multi-Layer Meta Web
  • Benefits of Multi-Layer Meta-Web
  • Multi-dimensional Web info summary analysis
  • Approximate and intelligent query answering
  • Web high-level query answering (WebSQL, WebML)
  • Web content and structure mining
  • Observing the dynamics/evolution of the Web
  • Is it realistic to construct such a meta-Web?
  • It benefits even if it is partially constructed
  • The benefit may justify the cost of tool
    development, standardization, and partial
    restructuring

27
Web Usage (Click-Stream) Mining
  • Weblog provides rich information about Web
    dynamics
  • Multidimensional Weblog analysis
  • disclose potential customers, users, markets,
    etc.
  • Plan mining (mining general Web accessing
    regularities)
  • Web linkage adjustment, performance improvements
  • Web accessing association/sequential pattern
    analysis
  • Web cashing, prefetching, swapping
  • Trend analysis
  • Dynamics of the Web what has been changing?
  • Customized to individual users

28
Towards Integrated Data Mining Environments and
Tools
  • OLAP Mining Integration of Data Warehousing and
    Data Mining
  • Querying and Mining An Integrated Information
    Analysis Environment
  • Basic Mining Operations and Mining Query
    Optimization
  • Vertical (or application-specific) data mining
  • Invisible data mining

29
OLAP Mining An Integration of Data Mining and
Data Warehousing
  • Data mining systems, DBMS, Data warehouse systems
    coupling
  • No coupling, loose-coupling, semi-tight-coupling,
    tight-coupling
  • On-line analytical mining data
  • integration of mining and OLAP technologies
  • Interactive mining multi-level knowledge
  • Necessity of mining knowledge and patterns at
    different levels of abstraction by
    drilling/rolling, pivoting, slicing/dicing, etc.
  • Integration of multiple mining functions
  • Characterized classification, first clustering
    and then association

30
An OLAM Architecture
Layer4 User Interface
Mining query
Mining result
User GUI API
OLAM Engine
OLAP Engine
Layer3 OLAP/OLAM
Data Cube API
Layer2 MDDB
MDDB
Meta Data
Database API
FilteringIntegration
Filtering
Layer1 Data Repository
Data Warehouse
Data cleaning
Databases
Data integration
31
Querying and Mining An Integrated Information
Analysis Environment
  • Data mining as a component of DBMS, data
    warehouse, or Web information system
  • Integrated information processing environment
  • MS/SQLServer-2000 (Analysis service)
  • IBM IntelligentMiner on DB2
  • SAS EnterpriseMiner data warehousing mining
  • Query-based mining
  • Querying database/DW/Web knowledge
  • Efficiency and flexibility preprocessing,
    on-line processing, optimization, integration,
    etc.

32
Basic Mining Operations and Mining Query
Optimization
  • Relational databases There are a set of basic
    relational operations and a standard query
    language, SQL
  • E.g., selection, projection, join, set
    difference, intersection, Cartesian product, etc.
  • Are there a set of standard data mining
    operations, on which optimizations can be done?
  • Difficulty different definitions on operations
  • Importance optimization can be performed on them
    systematically, standardization to facilitate
    information exchange and system interoperability

33
Vertical Data Mining
  • Generic data mining tools? Too simple to match
    domain-specific, sophisticated applications
  • Expert knowledge and business logic represent
    many years of work in their own fields!
  • Data mining business logic domain experts
  • A multi-dimensional view of data miners
  • Complexity of data Web, sequence, spatial,
    multimedia,
  • Complexity of domains DNA, astronomy, market,
    telecom,
  • Domain-specific data mining tools
  • Provide concrete, killer solution to specific
    problems
  • Feedback to build more powerful tools

34
Invisible Data Mining
  • Build mining functions into daily information
    services
  • Web search engine (link analysis, authoritative
    pages, user profiles)adaptive web sites, etc.
  • Improvement of query processing history data
  • Making service smart and efficient
  • Benefits from/to data mining research
  • Data mining research has produced many scalable,
    efficient, novel mining solutions
  • Applications feed new challenge problems to
    research

35
Towards Intelligent Tools for Data Mining
  • Integration paves the way to intelligent mining
  • Smart interface brings intelligence
  • Easy to use, understand and manipulate
  • One picture may worth 1,000 words
  • Visual and audio data mining
  • Human-Centered Data Mining
  • Towards self-tuning, self-managing,
    self-triggering data mining

36
Integrated Mining A Booster for Intelligent
Mining
  • Integration paves the way to intelligent mining
  • Data mining integrates with DBMS, DW, WebDB, etc
  • Integration inherits the power of up-to-date
    information technology querying, MD analysis,
    similarity search, etc.
  • Mining can be viewed as querying database
    knowledge
  • Integration leads to standard interface/language,
    function/process standardization, utility, and
    reachability
  • Efficiency and scalability bring intelligent
    mining to reality

37
One Picture May Worth 1000 Words!
  • Visual Data Mining
  • Visualization of data
  • Visualization of data mining results
  • Visualization of data mining processes
  • Interactive data mining visual classification
  • One melody may worth 1000 words too!
  • Audio data mining turn data into music and
    melody!
  • Uses audio signals to indicate the patterns of
    data or the features of data mining results

38
Visualization of data mining results in SAS
Enterprise Miner scatter plots

39
Visualization of association rules in MineSet 3.0
40
Visualization of a decision tree in MineSet 3.0
41
Visualization of Data Mining Processes by
Clementine
42
Interactive Visual Mining by Perception-Based
Classification (PBC)
43
Human-Centered Data Mining
  • Finding all the patterns autonomously in a
    database? unrealistic because the patterns
    could be too many but uninteresting
  • Data mining should be an interactive process
  • User directs what to be mined
  • Users must be provided with a set of primitives
    to be used to communicate with the data mining
    system using a data mining query language
  • User should provide constraints on what to be
    mined
  • System should use such constraints to guide the
    mining process (constraint-based mining or mining
    query optimization)

44
Constraint-Based Mining
  • What kinds of constraints can be used in mining?
  • Knowledge type constraint classification,
    association, etc.
  • Data constraint SQL-like queries
  • Find products sold together in Vancouver in
    Feb.01.
  • Dimension/level constraints
  • in relevance to region, price, brand, customer
    category.
  • Rule constraints
  • small sales (price lt 10) triggers big sales
    (sum gt 200).
  • Interestingness constraints
  • E.g., strong rules (min_support ? 3,
    min_confidence ? 60, min_lift gt 3.0).

45
Rule Constraints A Classification
Succinctness
Anti-monotonicity
Monotonicity
Convertible constraints
Inconvertible constraints
46
Constraint-Based Clustering Analysis
  • User-specified constraints no cluster has less
    than 1000 gold customers
  • Resource allocation (clustering) with obstacles

47
Towards Automated Data Mining?
  • It is not realistic to automatically find all the
    knowledge in a large database
  • Thus we promote human-centered, constraint-based
    mining
  • However, to achieve genuine intelligent data
    mining, data mining process should be
    self-tuning, self-managing, self-triggering
  • Functions should be developed to achieve such
    performance

48
Conclusions
  • Data miningA promising research frontier
  • Data mining research has been striding forward
    greatly in the last decade
  • However, data mining, as an industry, has not
    been flying as high as expected
  • Much research and application exploration are
    needed
  • Web mining
  • Towards integrated data mining environments and
    tools
  • Towards intelligent, efficient, and scalable data
    mining methods

49
http//www.cs.sfu.ca/han http//db.cs.sfu.ca
  • Thank you !!!

50
References
  • J. Han and M. Kamber, Data Mining Concepts and
    Techniques, Morgan Kaufmann, 2001.
  • J. Han, L. V. S. Lakshmanan, and R. T. Ng,
    "Constraint-Based, Multidimensional Data Mining",
    COMPUTER (special issues on Data Mining), 32(8)
    46-50, 1999.
Write a Comment
User Comments (0)
About PowerShow.com