Title: Data Mining: Crossing the Chasm
1Data Mining Crossing the Chasm
- Rakesh Agrawal
- IBM Almaden Research Center
2Thesis
- The greatest challenge facing data mining is to
make the transition from being an early market
technology to mainstream technology - We have the opportunity to make this transition
successful
3Outline
- Chasm in the technology adoption life cycle, Ã la
Geoffrey Moore - Experience with Quest/Intelligent Miner
- Ideas for successful chasm crossing
- Geoffrey A Moore. Crossing the Chasm. Harper
Business. http//www.chasmgroup.com
4Technology Adoption Life Cycle
Pragmatists Stick with the herd!
Conservatives Hold on!
Visionaries Get ahead of the herd!
Skeptics No way!
Techies Try it!
Late Majority
Early Majority
Early Adopters
Laggards
Innovators
Psychographic profile of each group is different
5Innovators Technology Enthusiasts
- Intrigued by any fundamental advance in
technology - Like to alpha test new products
- Can ignore the missing elements
- Want access to top technologists
- Want no-profit pricing (preferably free)
Gatekeepers to early adopters
6Early Adopters Visionaries
- Driven by vision of dramatic competitive
advantage via revolutionary breakthroughs - Great imagination for strategic applications
- Not so price-sensitive
- Want rapid time to market
- Demand high degree of customization
Fund the development of early market
7Early Majority Pragmatists
- Want sustainable productivity improvement
through evolutionary change - Astute managers of mission-critical apps
- Understand real-world issues and tradeoffs
- Focus on proven applications want to see the
solution in production
Bulwark of the mainstream market
8Late Majority Conservatives
- Want to stay even with the competition
- Risk averse
- Price sensitive
- Need completely pre-assembled solutions
Extend technology life cycles
9Laggards Skeptics
- Driven to maintain status quo
- Good at debunking marketing hype
- Disbelieve productivity-improvement arguments
- Can be formidable opposition to early adoption of
a technology
Retard the development of high-tech markets
10Crack in the curve
Chasm
Mainstream Market
Early Market
The greatest peril in the development of a
high-tech market lies in making the transition
from an early market dominated by a few
visionaries to a mainstream market dominated by
pragmatists.
11Visionaries vs. Pragmatists
- Adventurous
- First strike capability
- Early buy-in
- State of the art
- Think big
- Spend big
- Prudent
- Staying power
- Wait-and-see
- Industry standard
- Manage expectation
- Spend to budget
12Is data mining following this curve?
- Yes!!!
- My personal viewpoint based on Quest/Intelligent
Miner experience
13Quest
- Started as skunk work in early nineties
- Inspired by needs articulated by industry
visionaries - Transaction data collected over a long period
- Current tools/SQL dont cut it
- About ready to throw data
14Approach
- Examine real applications
- Identify operations that cut across applications
- Design fast, scalable algorithms for each
operation - Develop applications by composing operations
15Operations
- Associations
- Sequential Patterns
- Similar time series
- New Operations
- Completeness, scalability
- Classification
- Clustering
- Deviations
- Adopted from Statistics/Learning
- Scalability
http//www.almaden.ibm.com/cs/quest
16Bringing Quest to market
- Visionaries who inspired Quest did not become
first customers - Wanted evidence that the technology worked
- Frustrating attempts to interest major IBM
customers - Integration with existing applications
- Too-far-out technology
- Resistance from in-house analytic groups
17First hits
- Small information-based companies who provided
data in exchange for free results - CIO who wanted to be seen as the technology
pioneer in his industry - CIO who wanted the success story to feature in
the companys annual report
Led to the formation of a group offering services
using Quest
18Characteristics of engagements
- Mostly associations and sequential patterns
- Completeness a big plus
- Unanticipated uses
- Feedback for further development
19Into the product land
- Formation of a small out-of-plan product group
to productize Quest - Facilitated by a closet mathematician
- Successes of the services group used for market
validation - Continued development and infusion of technology
20Intelligent Miner
- Serious product
- Integrates technologies from various groups
- Fast, scalable, runs on multiple platforms
- Several early market success
stories
http//www.software.ibm.com/data/iminer/
21Are we in the chasm?
- Perceived to be sophisticated technology, usable
only by specialists - Long, expensive projects
- Stand-alone, loosely-coupled with data
infrastructures - Difficult to infuse into existing
mission-critical applications
22Chasm Crossing
- Personal speculations on some technical
challenges - Do not imply IBM research/product directions
23XML-based Data Mining Standard (1)
- Model Building
- A pair of standard DTDs for each operation
- Interchangeable library of operator
implementations
Data Specs
Standard DTD
Parameters
Operator
Library
Standard DTD
Model
Ack Mattos, Pirahesh, Schwenkries
24XML-based Data Mining Standard (2)
Standard DTDs
- Model Deployment
- Mapping XML object provides mapping between names
and format in the model object and the data
record - Model could have been developed on a different
system
Data Record
Mapping
Model
Application
Library
Standard DTD
Result
25Implications
- Standard interfaces for application developers to
incorporate data mining - Coupling with relational databases
- mappings from DTDs to relational schemas
- implementation using existing infrastructure
26Data Mining Benchmarks
- UC Irvine repository
- Generating synthetic benchmarks modeled after
real data sets is a hard problem - How to map names into meaningful literals
- How to preserve empirical distributions
Ack Srikant, Ullman
27Auto-focus data mining
- Automatic parameter tuning
- Automatic algorithm selection (Ã la join method
selection in database query optimization)
Ack Andreas Arning
28Web Greatest opportunity
- Huge collection of data (e.g. Yahoo collecting
50GB every day) - Universal digital distribution medium makes data
mining results actionable in fundamentally new
ways - But watch for privacy pitfall
29Privacy-preserving data mining
- Technical vs. legislated solutions
- Implication for data mining algorithms when some
fields of a data record have been fudged
according to the users privacy sensitivity
Ack R. Srikant
30Personalization
- Internet might provide for the first time tools
necessary for users to capture information about
themselves and to selectively release this
information - Will we be providing these tools?
- John Hagel, Marc Singer. Net Worth. Harvard
Business School Press.
31What about Association Rules?
- Very long patterns
- Separating wheat from chaff
- Principled introduction of domain knowledge
32What else?
- Formal foundations of data mining
33Summary
- Closely couple data mining with database systems
- Embed data mining into applications
- Focus on web
- Standard interfaces
- Benchmarks
- Auto focussing
- Personalization
- Privacy
34Concluding remarks
- Data mining, a great technology
- Combination of intriguing theoretical questions
with large commercial interest in the technology - Poised for transitioning into mainstream
technology - Will we rise to the challenge as a community?
35Acknowledgments