MASS COLLABORATION AND DATA MINING - PowerPoint PPT Presentation

About This Presentation
Title:

MASS COLLABORATION AND DATA MINING

Description:

Auto. Email. Manual. Email. Chat. Call. Center. 2nd Tier Support. 50% 40% 10 ... How do we incorporate the insights obtained by mining into the search phase? ... – PowerPoint PPT presentation

Number of Views:17
Avg rating:3.0/5.0
Slides: 42
Provided by: JoshuaS97
Category:

less

Transcript and Presenter's Notes

Title: MASS COLLABORATION AND DATA MINING


1
MASS COLLABORATION AND DATA MINING
  • Raghu Ramakrishnan
  • Founder and CTO, QUIQ
  • Professor, University of Wisconsin-Madison
  • Keynote Talk, KDD 2001, San Francisco

2
DATA MINING
Extracting actionable intelligence from large
datasets
  • Is it a creative process requiring a unique
    combination of tools for each application?
  • Or is there a set of operations that can be
    composed using well-understood principles to
    solve most target problems?
  • Or perhaps there is a framework for addressing
    large classes of problems that allows us to
    systematically leverage the results of mining.

3
MINING APPLICATION CONTEXT
  • Scalability is important.
  • But when is 2x speed-up or scale-up important?
    When is 10x unimportant?
  • What is the appropriate measure, model?
  • Recall, precision
  • MT for search vs. MT for content conversion

Answers to these questions come from the context
of the application.
4
TALK OUTLINE
  • A New Approach to Customer Support
  • Mass Collaboration
  • Technical challenges
  • A framework and infrastructure for P2P knowledge
    capture and delivery
  • Role of data mining
  • Confluence of DB, IR, and mining

5
TYPICAL CUSTOMER SUPPORT
Web Support KB
Customer
Support Center
6
TRADITIONAL KNOWLEDGE MANAGMENT
QUESTION
KNOWLEDGE BASE
ANSWER
EXPERTS
Knowledge created and structured by trained
experts using a rigorous process.
CONSUMERS
7
MASS COLLABORATION
QUESTION
KNOWLEDGE BASE
People using the web to share knowledge and help
each other find solutions
SELF SERVICE
Answer added to power self service
MASS COLLABORATION
ANSWER
-Experts -Partners -Customers -Employees
8
TIMELY ANSWERS
77 of answers are provided within 24h
6,845
  • No effort to answer each question
  • No added experts
  • No monetary incentives for enthusiasts

86 (4,328)
74 answered
77 (3,862)
65 (3,247)
40 (2,057)
Answers provided in 12h
Answers provided in 24h
Answers provided in 3h
Answers provided in 48h
Questions
9
MASS CONTRIBUTION
Users who on average provide only 2 answers
provide 50 of all answers
Answers
100 (6,718)
Contributed by mass of users
50 (3,329)
Top users
Contributing Users
7 (120)
93 (1,503)
10
POWER OF KNOWLEDGE CREATION
SUPPORT
SHIELD 1
SHIELD 2
Knowledge Creation
Self-Service )
- 85
Customer Mass Collaboration )
- 64
5
Support Incidents
Agent Cases
) Averages from QUIQ implementations
11
TYPICAL SERVICE CHAIN
40
50
10
Self Service Knowledge base
Auto Email
Manual Email
Call Center
2nd Tier Support
FAQ
Chat



QUIQ SERVICE CHAIN
80
15
5
QUIQ
QUIQ
2nd Tier Support
Self Service
Manual Email
Call Center
Chat
Mass Collaboration



12
CASE STUDIES COMPAQ
In newsgroups, conversations disappear and you
have to ask the same question over and over
again. The thing that makes the real difference
is the ability for customers to collaborate and
have information be persistent. Thats how we
found QUIQ. Its exactly the philosophy were
looking for.
Tech support people cant
keep up with generating content and are not
experts on how to effectively utilize the product
Mass Collaboration is the next step in Customer
Service. Steve Young, VP of Customer Care,
Compaq
13
ASP 2001 Top Ten Support Site
Austin-based National Instruments deployed a
Network to capture the specialized knowledge of
its clients and take the burden off its costly
support engineers, and is pleased with the
results. QUIQ increased customers participation,
flattened call volume and continues to do the
work of 50 support engineers.

David Daniels, Jupiter Media Metrix
14
MASS COLLABORATION
Internet-scale P2P knowledge sharing
Communities Knowledge Management Service
Workflows
Mass Collaboration
Many Experts
Support Newsgroups
Few Experts
Support Knowledge Base
Call Center
Solutions
Interactions
15
CORPORATE MEMORY Untapped Knowledge in Extended
Business Community
16
User-to-User Exchange
User-to-Enthusiast
Structured User Forum
User-to-Expert
Self-Organizing
Incentive to Participate
User Acquisition
Web Site
Areas of Interest
17
GOALS ISSUES
  • Interactions must be structured to encourage
    creation of solutions
  • Resolve issue escalate if necessary
  • Capture knowledge from interactions
  • Encourage participation
  • Sociology
  • Privacy, security
  • Credibility, authority, history
  • Accountability, incentives

18
REQUIRED CAPABILITIES
  • Roles Credibility, administration
  • Moderators, experts, editors, enthusiasts
  • Groups Privacy, security, entitlements
  • Departments, gold customers
  • Workflow QoS, validation, escalation

19
TECHNICAL CHALLENGES
20
SEARCHING PEOPLE-BASES
ROUTING, NOTIFICATION
?
SEARCH
If its not there, find someone who knows - And
get it there (knowledge creation)!
21
QUIQ, the Best in Class Support Channel
SUPPORT
Email Support
Call Center
Automated Emails 1)
-20
100
80
Support Incidents
Agent Cases
Support Incidents
Agent Cases
Mass Collaboration
Web Self-Service
Knowledge Creation
Self-Service 2)
Self-Service
-42
-85
Customer Mass Collaboration
-64
68
5
Support Incidents
Agent Cases
Support Incidents
Agent Cases
1) Source QUIQ Client Information 2) Source
Association of Support Professionals
22
SEARCH AND INDEXING
  • User types in How can I configure the IP address
    on my Presario?
  • Need to find most relevant content that is of
    high quality and is approved for external
    viewing, and that this user is entitled to see
    based on her roles, groups, and service levels.
  • User decides to post question because no good
    answer was found in the KB.
  • Search controls when experts and other users will
    see this new question need to make this
    real-time.
  • Concurrency, recovery issues!

23
SEARCH AND INDEXING
  • Data is organized into tabular channels
  • Questions, responses, users,
  • Each item has several fields, e.g., a question
  • Author id, author status, service level, item
    popularity metrics, rating metrics, answer
    status, approval status, visibility group, update
    timestamp, notification timestamp, usage
    signature, category, relevant products, relevant
    problems, subject, body, responses

Which 5 items should be returned?
24
RUNTIME ARCHITECTURE
Web server
Web server
Hive Manager
Email
Real-time Indexing, Caching, Alerts
Cache
Alerts
Indexer
Files, Logs
DBMS
Warehouse
RAID STORAGE
25
LEARNING FROM ACTIVITY DATA TO KNOWLEDGE
Periodic offline activity
Miner
Indexer
Large R/W
Small reads
Files, Logs
DBMS
Warehouse
RAID STORAGE
26
SEARCH AND INDEXING
Which 5 items should be returned?
  • Question text, user attributes, system policies
  • IR-style ranked output
  • Search constraints
  • Show matches subject match twice as important
  • Show only approved answers to non-editors
  • Give preference to category Laptop
  • Give preference to recent solutions
  • Weight quality of solution

27
VECTOR SPACE MODEL
  • Documents, queries are vectors in term space
  • Vector distance from the query is used to rank
    retrieved documents


...,
,
w
w
w
Q
1
,
12
11
1
t

...,
,
w
w
w
D
2
,
22
21
2
t
t
Ã¥


w
ed
unnormaliz

)
,
(
w
D
Q
sim
i
2
1
2
1
i

1
i
ith term in summation can be seen as the
relevance contribution of term i
28
TF-IDF DOCUMENT VECTOR
29
A HYBRID DB-IR SYSTEM
  • Searches are queries with three parts
  • Filter
  • DB-style yes/no criteria
  • Match
  • TF-IDF relevance based on a combination of fields
  • Quality
  • Relevance boost based on a policy

30
A HYBRID DB-IR SYSTEM
  • A query is built up from atomic constraints using
    Boolean operators.
  • Atomic constraint
  • value op term, constraint-type
  • Terms are drawn from discrete domains and are of
    two types hierarchy and scalar
  • Constraint-type is exact or approximate

31
A HYBRID DB-IR SYSTEM
  • Applying an atomic constraint to a set of items
    returns a tagged result set
  • The result inherits the constraint-type
  • Each result item has a (TF-IDF) relevance score
    0 for exact
  • Combining two tagged item sets using Boolean
    operators yields a tagged set
  • The result type is exact if both inputs are
    exact, and approximate otherwise
  • Result contains intersection of input item sets
    if either input is exact union otherwise
  • Each result item is tagged with a combined
    relevance

32
A HYBRID DB-IR SYSTEM
  • Semantics of Boolean expressions over constraints
    is associative and commutative
  • Evaluating exact constraints and approximate
    constraints separately (in DB and IR subsystems)
    is a special case. Additionally
  • Uniform handling of relevance contributions of
    categories, popularity metrics, recency, etc.
  • Absolute and relative relevance modifiers can be
    introduced for greater flexibility.

33
CONCURRENCY, RECOVERY, PARALLELISM
  • Concurrency
  • Index is updated in real-time
  • Automatic partitioning, two-step locking protocol
    result in very low overhead
  • Relies upon post-processing to address some
    anomalies
  • Recovery
  • Partitioning is again the key
  • Leverages recovery guarantees of DBMS
  • Approach also supports efficient refresh of
    global statistics
  • Parallelism
  • Hash based partitioning

34
NOTIFICATION
  • Extension of search Each user can define one or
    more standing searches, and request instant or
    periodic notification.
  • Boolean combinations of atomic constraints.
  • Major challenges
  • Scaling with number of standing searches.
  • Requires multiple timestamps, indexing searches.
  • Exactly-once delivery property.
  • Many subtleties center around notifiability of
    updates!

35
ROLE OF DATA MINING
36
DATA MINING TASKS
  • There is a lot of insight to be gained by
    analyzing the data.
  • What will help the user with her problem?
  • Who does a given user trust?
  • Characteristic metrics for high-quality content.
  • Identify helpful content in similar, past
    queries.
  • Summarize content.
  • Who can answer this question?

37
LEVERAGING DATA MINING
  • How do we get at the data?
  • Relevant information is distributed across
    several sources, not just the DBMS.
  • Aggregated in a warehouse.
  • How do we incorporate the insights obtained by
    mining into the search phase?
  • Need to constantly update info about every piece
    of content (Qs, As, users )

38
LEVERAGING DATA MINING
  • Three-step approach
  • Off-line analysis to gather new insight
  • Periodic refresh indexes
  • Use insight (from KB/index) to improve search
    using the extended DB/IR query framework

Use mining to create useful metadata
39
SOME UNIQUE TWISTS
  • Identify the kinds of feedback that would be
    helpful in refining a search.
  • I.e., Not just specific terms, but the types of
    concepts that would be useful discriminators
    (e.g., a good hierarchy of feedback concepts)
  • Metrics of quality
  • Link-analysis is a good example, but what are the
    links here?
  • Self-tuning searches
  • The more the knobs, the more the choices
  • Next step self-personalizing searches?

40
CONCLUSIONS
41
CONFLUENCES
IR SEARCH
?
DB QUERIES
P2P KM
Write a Comment
User Comments (0)
About PowerShow.com