Title: MASS COLLABORATION AND DATA MINING
1MASS COLLABORATION AND DATA MINING
- Raghu Ramakrishnan
- Founder and CTO, QUIQ
- Professor, University of Wisconsin-Madison
- Keynote Talk, KDD 2001, San Francisco
2 DATA MINING
Extracting actionable intelligence from large
datasets
- Is it a creative process requiring a unique
combination of tools for each application? - Or is there a set of operations that can be
composed using well-understood principles to
solve most target problems? - Or perhaps there is a framework for addressing
large classes of problems that allows us to
systematically leverage the results of mining.
3 MINING APPLICATION CONTEXT
- Scalability is important.
- But when is 2x speed-up or scale-up important?
When is 10x unimportant? - What is the appropriate measure, model?
- Recall, precision
- MT for search vs. MT for content conversion
Answers to these questions come from the context
of the application.
4 TALK OUTLINE
- A New Approach to Customer Support
- Mass Collaboration
- Technical challenges
- A framework and infrastructure for P2P knowledge
capture and delivery - Role of data mining
- Confluence of DB, IR, and mining
5TYPICAL CUSTOMER SUPPORT
Web Support KB
Customer
Support Center
6TRADITIONAL KNOWLEDGE MANAGMENT
QUESTION
KNOWLEDGE BASE
ANSWER
EXPERTS
Knowledge created and structured by trained
experts using a rigorous process.
CONSUMERS
7MASS COLLABORATION
QUESTION
KNOWLEDGE BASE
People using the web to share knowledge and help
each other find solutions
SELF SERVICE
Answer added to power self service
MASS COLLABORATION
ANSWER
-Experts -Partners -Customers -Employees
8TIMELY ANSWERS
77 of answers are provided within 24h
6,845
- No effort to answer each question
- No added experts
- No monetary incentives for enthusiasts
86 (4,328)
74 answered
77 (3,862)
65 (3,247)
40 (2,057)
Answers provided in 12h
Answers provided in 24h
Answers provided in 3h
Answers provided in 48h
Questions
9MASS CONTRIBUTION
Users who on average provide only 2 answers
provide 50 of all answers
Answers
100 (6,718)
Contributed by mass of users
50 (3,329)
Top users
Contributing Users
7 (120)
93 (1,503)
10POWER OF KNOWLEDGE CREATION
SUPPORT
SHIELD 1
SHIELD 2
Knowledge Creation
Self-Service )
- 85
Customer Mass Collaboration )
- 64
5
Support Incidents
Agent Cases
) Averages from QUIQ implementations
11TYPICAL SERVICE CHAIN
40
50
10
Self Service Knowledge base
Auto Email
Manual Email
Call Center
2nd Tier Support
FAQ
Chat
QUIQ SERVICE CHAIN
80
15
5
QUIQ
QUIQ
2nd Tier Support
Self Service
Manual Email
Call Center
Chat
Mass Collaboration
12CASE STUDIES COMPAQ
In newsgroups, conversations disappear and you
have to ask the same question over and over
again. The thing that makes the real difference
is the ability for customers to collaborate and
have information be persistent. Thats how we
found QUIQ. Its exactly the philosophy were
looking for.
Tech support people cant
keep up with generating content and are not
experts on how to effectively utilize the product
Mass Collaboration is the next step in Customer
Service. Steve Young, VP of Customer Care,
Compaq
13ASP 2001 Top Ten Support Site
Austin-based National Instruments deployed a
Network to capture the specialized knowledge of
its clients and take the burden off its costly
support engineers, and is pleased with the
results. QUIQ increased customers participation,
flattened call volume and continues to do the
work of 50 support engineers.
David Daniels, Jupiter Media Metrix
14 MASS COLLABORATION
Internet-scale P2P knowledge sharing
Communities Knowledge Management Service
Workflows
Mass Collaboration
Many Experts
Support Newsgroups
Few Experts
Support Knowledge Base
Call Center
Solutions
Interactions
15CORPORATE MEMORY Untapped Knowledge in Extended
Business Community
16User-to-User Exchange
User-to-Enthusiast
Structured User Forum
User-to-Expert
Self-Organizing
Incentive to Participate
User Acquisition
Web Site
Areas of Interest
17GOALS ISSUES
- Interactions must be structured to encourage
creation of solutions - Resolve issue escalate if necessary
- Capture knowledge from interactions
- Encourage participation
- Sociology
- Privacy, security
- Credibility, authority, history
- Accountability, incentives
18REQUIRED CAPABILITIES
- Roles Credibility, administration
- Moderators, experts, editors, enthusiasts
- Groups Privacy, security, entitlements
- Departments, gold customers
- Workflow QoS, validation, escalation
19TECHNICAL CHALLENGES
20SEARCHING PEOPLE-BASES
ROUTING, NOTIFICATION
?
SEARCH
If its not there, find someone who knows - And
get it there (knowledge creation)!
21QUIQ, the Best in Class Support Channel
SUPPORT
Email Support
Call Center
Automated Emails 1)
-20
100
80
Support Incidents
Agent Cases
Support Incidents
Agent Cases
Mass Collaboration
Web Self-Service
Knowledge Creation
Self-Service 2)
Self-Service
-42
-85
Customer Mass Collaboration
-64
68
5
Support Incidents
Agent Cases
Support Incidents
Agent Cases
1) Source QUIQ Client Information 2) Source
Association of Support Professionals
22SEARCH AND INDEXING
- User types in How can I configure the IP address
on my Presario? - Need to find most relevant content that is of
high quality and is approved for external
viewing, and that this user is entitled to see
based on her roles, groups, and service levels. - User decides to post question because no good
answer was found in the KB. - Search controls when experts and other users will
see this new question need to make this
real-time. - Concurrency, recovery issues!
23SEARCH AND INDEXING
- Data is organized into tabular channels
- Questions, responses, users,
- Each item has several fields, e.g., a question
- Author id, author status, service level, item
popularity metrics, rating metrics, answer
status, approval status, visibility group, update
timestamp, notification timestamp, usage
signature, category, relevant products, relevant
problems, subject, body, responses
Which 5 items should be returned?
24RUNTIME ARCHITECTURE
Web server
Web server
Hive Manager
Email
Real-time Indexing, Caching, Alerts
Cache
Alerts
Indexer
Files, Logs
DBMS
Warehouse
RAID STORAGE
25LEARNING FROM ACTIVITY DATA TO KNOWLEDGE
Periodic offline activity
Miner
Indexer
Large R/W
Small reads
Files, Logs
DBMS
Warehouse
RAID STORAGE
26SEARCH AND INDEXING
Which 5 items should be returned?
- Question text, user attributes, system policies
- IR-style ranked output
- Search constraints
- Show matches subject match twice as important
- Show only approved answers to non-editors
- Give preference to category Laptop
- Give preference to recent solutions
- Weight quality of solution
27VECTOR SPACE MODEL
- Documents, queries are vectors in term space
- Vector distance from the query is used to rank
retrieved documents
...,
,
w
w
w
Q
1
,
12
11
1
t
...,
,
w
w
w
D
2
,
22
21
2
t
t
Ã¥
w
ed
unnormaliz
)
,
(
w
D
Q
sim
i
2
1
2
1
i
1
i
ith term in summation can be seen as the
relevance contribution of term i
28TF-IDF DOCUMENT VECTOR
29A HYBRID DB-IR SYSTEM
- Searches are queries with three parts
- Filter
- DB-style yes/no criteria
- Match
- TF-IDF relevance based on a combination of fields
- Quality
- Relevance boost based on a policy
30A HYBRID DB-IR SYSTEM
- A query is built up from atomic constraints using
Boolean operators. - Atomic constraint
- value op term, constraint-type
- Terms are drawn from discrete domains and are of
two types hierarchy and scalar - Constraint-type is exact or approximate
31A HYBRID DB-IR SYSTEM
- Applying an atomic constraint to a set of items
returns a tagged result set - The result inherits the constraint-type
- Each result item has a (TF-IDF) relevance score
0 for exact - Combining two tagged item sets using Boolean
operators yields a tagged set - The result type is exact if both inputs are
exact, and approximate otherwise - Result contains intersection of input item sets
if either input is exact union otherwise - Each result item is tagged with a combined
relevance
32A HYBRID DB-IR SYSTEM
- Semantics of Boolean expressions over constraints
is associative and commutative - Evaluating exact constraints and approximate
constraints separately (in DB and IR subsystems)
is a special case. Additionally - Uniform handling of relevance contributions of
categories, popularity metrics, recency, etc. - Absolute and relative relevance modifiers can be
introduced for greater flexibility.
33CONCURRENCY, RECOVERY, PARALLELISM
- Concurrency
- Index is updated in real-time
- Automatic partitioning, two-step locking protocol
result in very low overhead - Relies upon post-processing to address some
anomalies - Recovery
- Partitioning is again the key
- Leverages recovery guarantees of DBMS
- Approach also supports efficient refresh of
global statistics - Parallelism
- Hash based partitioning
34NOTIFICATION
- Extension of search Each user can define one or
more standing searches, and request instant or
periodic notification. - Boolean combinations of atomic constraints.
- Major challenges
- Scaling with number of standing searches.
- Requires multiple timestamps, indexing searches.
- Exactly-once delivery property.
- Many subtleties center around notifiability of
updates!
35ROLE OF DATA MINING
36DATA MINING TASKS
- There is a lot of insight to be gained by
analyzing the data. - What will help the user with her problem?
- Who does a given user trust?
- Characteristic metrics for high-quality content.
- Identify helpful content in similar, past
queries. - Summarize content.
- Who can answer this question?
37LEVERAGING DATA MINING
- How do we get at the data?
- Relevant information is distributed across
several sources, not just the DBMS. - Aggregated in a warehouse.
- How do we incorporate the insights obtained by
mining into the search phase? - Need to constantly update info about every piece
of content (Qs, As, users )
38LEVERAGING DATA MINING
- Three-step approach
- Off-line analysis to gather new insight
- Periodic refresh indexes
- Use insight (from KB/index) to improve search
using the extended DB/IR query framework
Use mining to create useful metadata
39SOME UNIQUE TWISTS
- Identify the kinds of feedback that would be
helpful in refining a search. - I.e., Not just specific terms, but the types of
concepts that would be useful discriminators
(e.g., a good hierarchy of feedback concepts) - Metrics of quality
- Link-analysis is a good example, but what are the
links here? - Self-tuning searches
- The more the knobs, the more the choices
- Next step self-personalizing searches?
40CONCLUSIONS
41CONFLUENCES
IR SEARCH
?
DB QUERIES
P2P KM