Three Challenges in Data Mining - PowerPoint PPT Presentation

About This Presentation
Title:

Three Challenges in Data Mining

Description:

... as many domain experts?? Ignore domain knowledge? No! Formulate ... Domain knowledge goes into table creation. Standard table can be mined with standard tools ... – PowerPoint PPT presentation

Number of Views:135
Avg rating:3.0/5.0
Slides: 30
Provided by: dongm7
Category:

less

Transcript and Presenter's Notes

Title: Three Challenges in Data Mining


1
Three Challenges in Data Mining
  • Anne Denton
  • Department of Computer Science NDSU

2
Why Data Mining?
  • Parkinsons Law of Data
  • Data expands to fill the space
  • available for storage
  • Disk-storage version of Moores law
  • Capacity ? 2 t / 18 months
  • Available data grows exponentially!

3
Outline
  • Motivation of 3 challenges
  • More records (rows)
  • More attributes (columns)
  • New subject domains
  • Some answers to the challenges
  • Thesis work
  • Generalized P-Tree structure
  • Kernel-based semi-naïve Bayes classification
  • KDD-cup 02/03 and with Csci 366 students
  • Data with graph relationship
  • Outlook Data with time dependence

4
Examples
  • More records
  • Many stores save each transaction
  • Data warehouses keep historic data
  • Monitoring network traffic
  • Micro sensors / sensor networks
  • More attributes
  • Items in a shopping cart
  • Keywords in text
  • Properties of a protein (multi-valued
    categorical)
  • New subject domains
  • Data mining hype increases audience

5
Algorithmic Perspective
  • More records
  • Standard scaling problem
  • More attributes
  • Different algorithms needed for 1000 vs. 10
    attributes
  • New subject domains
  • New techniques needed
  • Joining of separate fields
  • Algorithms should be domain-independent
  • Need for experts does not scale well
  • Twice as many data sets
  • Twice as many domain experts??
  • Ignore domain knowledge?
  • No! Formulate it systematically

6
Some Answers to Challenges
  • Large data quantity (Thesis)
  • Many records
  • P-Tree concept and its generalization to
  • non-spatial data
  • Many attributes
  • Algorithm that defies curse of dimensionality
  • New techniques / Joining separate fields
  • Mining data on a graph
  • Outlook Mining data with time dependence

7
Challenge 1 Many Records
  • Typical question
  • How many records satisfy given conditions on
    attributes?
  • Typical answer
  • In record-oriented database systems
  • Database scan O(N)
  • Sorting / indexes?
  • Unsuitable for most problems
  • P-Trees
  • Compressed bit-column-wise storage
  • Bit-wise AND replaces database scan

8
P-Trees Compression Aspect
9
P-Trees Ordering Aspect
  • Compression relies on long sequences of 0 or 1
  • Images
  • Neighboring pixels are probably similar
  • Peano-ordering
  • Other data?
  • Peano-ordering can be generalized
  • Peano-order sorting

10
Peano-Order Sorting
11
Impact of Peano-Order Sorting
  • Speed improvement especially for large data sets
  • Less than O(N) scaling for all algorithms

12
So Far
  • Answer to challenge 1 Many records
  • P-Tree concept allows scaling better than O(N)
  • for AND (equivalent to database scan)
  • Introduced effective generalization to
    non-spatial data (thesis)
  • Challenge 2 Many attributes
  • Focus Classification
  • Curse of dimensionality
  • Some algorithms suffer more than others

13
Curse of Dimensionality
  • Many standard classification algorithms
  • E.g., decision trees, rule-based classification
  • For each attribute 2 halves relevant ?
    irrelevant
  • How often can we divide by 2 before small size of
    relevant part makes results insignificant?
  • Inverse of
  • Double number of rice grains for each square of
    the chess board
  • Many domains have hundreds of attributes
  • Occurrence of terms in text mining
  • Properties of genes

14
Possible Solution
  • Additive models
  • Each attribute contributes to a sum
  • Techniques exist (statistics)
  • Computationally intensive
  • Simplest Naïve Bayes
  • x(k) is value
  • of kth attribute
  • Considered additive model
  • Logarithm of probability additive

15
Semi-Naïve Bayes Classifier
  • Correlated attributes are joined
  • Has been done for categorical data
  • Kononenko 91, Pazzani 96
  • Previously Continuous data discretized
  • New (thesis)
  • Kernel-based
  • evaluation of correlation

16
Results
  • Error decrease in units of standard deviation for
    different parameter sets
  • Improvement for wide range of correlation
    thresholds 0.05 (white) to 1 (blue)

17
So Far
  • Answer to challenge 1 More records
  • Generalized P-tree structure
  • Answer to challenge 2 More attributes
  • Additive algorithms
  • Example Kernel-based semi-naïve Bayes
  • Challenge 3 New subject domains
  • Data on a graph
  • Outlook Data with time dependence

18
Standard Approach to Data Mining
  • Conversion to a relation (table)
  • Domain knowledge goes into table creation
  • Standard table can be mined with standard tools
  • Does that solve the problem?
  • To some degree, yes
  • But we can do better

19
  • Everything should be made as simple as
    possible, but not simpler
  • Albert Einstein

20
Claim Representation as single relation is not
rich enough
  • Example Contribution of a graph structure to
    standard mining problems
  • Genomics
  • Protein-protein interactions
  • WWW
  • Link structure
  • Scientific publications
  • Citations

Scientific American 05/03
21
Data on a Graph Old Hat?
  • Common Topics
  • Analyze edge structure
  • Google
  • Biological Networks
  • Sub-graph matching
  • Chemistry
  • Visualization
  • Focus on graph structure
  • Our work
  • Focus on mining node data
  • Graph structure provides connectivity

22
Protein-Protein Interactions
  • Protein data
  • From Munich Information Center for Protein
    Sequences (also KDD-cup 02)
  • Hierarchical attributes
  • Function
  • Localization
  • Pathways
  • Gene-related
  • properties
  • Interactions
  • From experiments
  • Undirected graph

23
Questions
  • Prediction of a property
  • (KDD-cup 02 AHR)
  • Which properties in neighbors are relevant?
  • How should we integrate neighbor knowledge?
  • What are interesting patterns?
  • Which properties say more about neighboring nodes
    than about the node itself?

But not
AHR Aryl Hydrocarbon Receptor Signaling Pathway
24
Possible Representations
  • OR-based
  • At least one neighbor has property
  • Example Neighbor essential true
  • AND-based
  • All neighbors have property
  • Example Neighbor essential false
  • Path-based
  • (depends on maximum hops)
  • One record for each path
  • Classification weighting?
  • Association Rule Mining
  • Record base changes

AHR
essential
AHR
essential
AHR
not essential
25
Association Rule Mining
  • OR-based representation
  • Conditions
  • Association rule involves AHR
  • Support across a link greater than within a node
  • Conditions on minimum confidence and support
  • Top 3 with respect to support
  • (Results by Christopher Besemann, project CSci
    366)

AHR ? essential
AHR ? nucleus (localization)
AHR ? transcription (function)
26
Classification Results
  • Problem
  • (especially path-based representation)
  • Varying amount of information per record
  • Many algorithms unsuitable in principle
  • E.g., algorithms that divide domain space
  • KDD-cup 02
  • Very simple additive model
  • Based on visually identifying relationship
  • Number of interacting essential genes adds to
    probability of predicting protein as AHR

27
KDD-Cup 02 Honorable Mention
NDSU Team
28
Outlook Time-Dependent Data
  • KDD-cup 03
  • Prediction of citations of scientific papers
  • Old Time-series prediction
  • New Combination with similarity-based prediction

29
Conclusions and Outlook
  • Many exciting problems in data mining
  • Various challenges
  • Scaling of existing algorithms (more records)
  • Different types of algorithms gain importance
  • (more attributes)
  • Identifying and solving new challenges in a
  • domain-independent way (new subject areas)
  • Examples of general structural components that
    apply to many domains
  • Graph-structure
  • Time-dependence
  • Relationships between attributes
  • Software engineering aspects
  • Software design of scientific applications
  • Rows vs. columns
Write a Comment
User Comments (0)
About PowerShow.com