Title: Data Mining: Potentials and Challenges
1Data MiningPotentials and Challenges
- Rakesh Agrawal
- IBM Almaden Research Center
2Thesis
- Data mining has started to live up to its promise
in the commercial world, particularly in
applications involving structured data - Promising data mining applications in
non-conventional domains are beginning to emerge,
involving combination of structured and
unstructured data - Investment in data mining research can have large
payoff
3Outline
- Examples of some promising non-conventional data
mining applications and technologies - Some hurdles we need to cross
4Identifying Social Links Using Association Rules
Input Crawl of about 1 million pages
5Website Profiling using Classification
Input Example pages for each category during
training
6Discovering Trends Using Sequential Patterns
Shape Queries
Input i) patent database ii) shape of interest
7Discovering Micro-communities
Frequently co-cited pages are related. Pages
with large bibliographic overlap are related.
8Technical Chasms
- Privacy Concerns?
- Privacy-preserving data mining
- Data for data mining?
- Data mining over compartmentalized databases
9Inducing Classifiers over Privacy Preserved
Numeric Data
Alices age
Alices salary
Johns age
30 becomes 65 (3035)
10Reconstruction Algorithm
- fX0 Uniform distribution
- j 0
- repeat
- fXj1(a)
Bayes Rule - j j1
- until (stopping criterion met)
- Converges to maximum likelihood estimate.
- D. Agrawal C.C. Aggarwal, PODS 2001.
11Works Well
12Accuracy vs. Randomization
13Discovering frequent itemsets
Breach level 50.
Soccer smin 0.2
Mailorder smin 0.2
14Computation over Compartmentalized Databases
15Some Hard Problems
- Past may be a poor predictor of future
- Abrupt changes
- Wrong training examples
- Reliability and quality of data
- Actionable patterns (principled use of domain
knowledge?) - Over-fitting vs. not missing the rare nuggets
- Richer patterns
- Simultaneous mining over multiple data types
- When to use which algorithm?
- Automatic, data-dependent selection of algorithm
parameters
16Summary
- Data mining has shown promise but we need further
research to realize its full potential
We stand on the brink of great new answers, but
even more, of great new questions -- Matt Ridley