Analysing and Modelling Large-Scale Enterprise Data - PowerPoint PPT Presentation

1 / 46
About This Presentation
Title:

Analysing and Modelling Large-Scale Enterprise Data

Description:

... Trait Space 234566 456457 13456 654777 User ID Male Female Gender Country UK USA 1.2m Height 34 345 64 5474 Item ID Horror Movie Genre Drama Documentary ... – PowerPoint PPT presentation

Number of Views:88
Avg rating:3.0/5.0
Slides: 47
Provided by: RalfHe2
Category:

less

Transcript and Presenter's Notes

Title: Analysing and Modelling Large-Scale Enterprise Data


1
Analysing and Modelling Large-Scale Enterprise
Data
  • Thore Graepel
  • Online Services and Advertising Group
  • Microsoft Research Cambridge

2
Overview
  • Complex large-scale data in the enterprise
  • What kind of data is available?
  • What technologies are used?
  • Tasks and enterprise-specific challenges?
  • Methodology
  • Bayesian Inference in Factor Graph Models
  • PQL Using SQL to describe probability models
  • Applications
  • Gamer Rating and Matchmaking TrueSkill
  • Click-Through Rate Prediction AdPredictor
  • Large-Scale Recommendations Matchbox

3
Complex Data
Joint work with Tom Minka Phillip Trelford
4
Data Sources at Microsoft (External)
  • Online Services Division
  • Web index
  • Search and Ad click logs (12-15 TB / day)
  • Hotmail, Instant messaging, Internet Explorer
    (100s million users)
  • MSN portal and Bing maps
  • Xbox Live Gaming Service
  • User transaction log data
  • Ranking and matchmaking data
  • Game instrumentation for user testing

5
Data Sources at Microsoft (Internal)
  • Development and Software Instrumentation
  • Watson (customer feedback data)
  • Source depot (MS source code, e.g., Office,
    Windows)
  • Multilingual technical documentation
  • Business
  • Customer databases
  • Sales and Marketing

6
Data-Intensive Tasks at Microsoft
  • Prediction of user behaviour and preferences
  • Improve web search
  • Improve targeting for advertising
  • Spam filtering and content prioritisation
  • Improve user experience
  • Matchmaking for games
  • Multi-modal user interfaces (Natal, speech)
  • Improve software development process
  • Improve productivity of developers
  • Analyse software for defects

7
Technical Infrastructure
  • Relational Databases/SQL
  • Great agility for analysis and reliability for
    business
  • Limited scalability
  • Need to import data into SQL
  • Windows HPC
  • Complex computations / fine grained parallelism
  • Need to move data to HPC cluster
  • Cosmos
  • Take the computation to the data
  • Super efficient stream based computations

8
Cosmos Architecture
SCOPE
DryadLINQ
Sputnik
Dryad
Cosmos
Stream
Stream
Stream
Stream
9
Enterprise/Online specific challenges
  • Privacy
  • Privacy limit the ways in which data can be used
  • Interesting trade-offs (differential privacy)
  • Incentives
  • Data produced by self-interested agents
  • Need to design incentive compatible mechanisms
  • Exploration/Exploitation
  • Results of inference feed back into business
    process and determine future observations.
  • Need to aim at long-term benefits

10
Factor Graphs
11
Factor Graphs / Trees
  • Definition Graphical representation of product
    structure of a function (Wiberg, 1996)
  • Nodes Factors Variables
  • Edges Dependencies of factors on variables.
  • Question
  • What are the marginals of the function (all but
    one variable are summed out)?
  • What is the mode of the function?

12
Factor Graphs and Bayesian Inference
  • Bayes law
  • Factorising prior
  • Factorising likelihood
  • Sum out latent variables

s2
s1
s
t1
t2
d
y
13
Factor Trees Separation
y
f3(x,y)
v
w
x
f1(v,w)
f2(w,x)
z
f4(x,z)
  • Observation Sum of products becomes product of
    sums of all messages from neighbouring factors to
    variable!

14
Messages From Factors To Variables
y
f3(x,y)
w
x
f2(w,x)
z
f4(x,z)
  • Observation Factors only need to sum out all
    their local variables!

15
Messages From Variables To Factors
y
f3(x,y)
x
f2(w,x)
z
f4(x,z)
  • Observation Variables pass on the product of all
    incoming messages!

16
The Sum-Product Algorithm
  • Three update equations (Aji McEliece, 1997)
  • Update equations can be directly derived from the
    distributive law.
  • Efficient for messages in the exponential family.
  • Calculate all marginals at the same time.

17
Approximate Message Passing
  • Problem The exact messages from factors to
    variables may not be closed under products.
  • Solution Approximate the marginal as well as
    possible in the sense of minimal KL divergence.
  • Expectation Propagation (Minka, 2001)
    Approximate the marginal by moment-matching
    resulting in

18
Distributed Message Passing
  • Map-Reduce for IID data
  • Map Nodes compute messages mfi?s from data yi
    and mfi?s
  • Reduce Combine messages mfi?s into ps by
    multiplication
  • Caveats
  • All approximate data factors need the incoming
    message ms?fi!
  • All messages m fi ?s need to be stored if the
    same data point is considered multiple times

s
y1
y2
y3
19
PQL
Joint work with Ralf Herbrich Jurgen Van Gael
20
PQL as a Platform
21
PQL I Augmenting Schemas
People AUGMENT DB.People ADD weight FLOAT
22
PQL II Factor Types
Single Relation Cross Relation Cross Entity

23
PQL III Single Relation Factors
FACTOR Normal(p.weight,75.0,25.0) FROM People p
24
PQL IV Cross Relation Factors
FACTOR Normal(g.weight, p.weight, 1.0) FROM
People p, DrVisit g WHERE p.PersonID g.PersonID
25
PQL as a Unifying Platform
26
TrueSkill
Joint work with Tom Minka Phillip Trelford
27
TrueSkill
  • Given
  • Match outcomes Orderings among k teams
    consisting of n1, n2 , ..., nk players,
    respectively
  • Questions
  • Skill si for each player such that
  • Global ranking among all players
  • Fair matches between teams of players

28
Efficient Approximate Inference
Gaussian Prior Factors
s1
s2
s3
s4
Ranking Likelihood Factors
Fast and efficient approximate message passing
using Expectation Propagation
t1
t2
t3
y12
y23
29
TrueSkill Superfast convergence to True Skills
40
35
30
25
Level
20
15
char (TrueSkill)
10
SQLwildman (TrueSkill)
char (Halo 2 Beta)
5
SQLwildman (Halo 2 Beta)
0
Games played
0
100
200
300
400
30
Applications to Online Gaming
  • Leaderboard
  • Global ranking of all players
  • Matchmaking
  • For gamers Most uncertain outcome
  • For inference Most informative
  • Both are equivalent!

31
Trueskill in Xbox 360 and Halo 3
32
AdPredictor
Joint work with Joaquin Quiñonero Candela, Onno
Zoeter, Tom Borchert , Phillip Trelford
33
Why Predict Probability-of-Click?
  • Display (according to expected revenue)
  • Charge (per click)

1.00
10
0.10
0.80
  • Advantages of improved probability estimates
  • Increase user satisfaction by better targeting
  • Fairer charges to advertisers
  • Increase revenue by showing ads with high
    click-thru rate

2.00
4
0.08
1.25
0.10
50
0.05
0.05
34
adPredictor Details
102.34.12.201
15.70.165.9
Client IP
221.98.2.187
92.154.3.86
P(pClick)

Match Type
Exact Match
Broad Match
ML-1
Position
SB-1
SB-2
35
Training Algorithm in Action
w2
w1

s
c
No Click
Click
36
Client IP Mean Variance
Low clickers
High clickers
37
UserAgent Mean Posterior Effects
38
AdPredictor in Bing Search Engine
  • AdPredictor is now running 100 Paid Search
    traffic in Microsofts Bing Search Engine
  • Relevance and Click-Through Rate of Ads improved
  • Calibrated CTR prediction provides solid
    foundation for further improvements
  • AdPredictor explored for other tasks such as
    contextual and display advertising

39
Matchbox
Joint work with David Stern and Ralf Herbrich
40
Collaborative Filtering
Items
1
2
3
4
5
6
Metadata?
A
B
Users
C
?
?
?
D
41
Map Sparse Features To Trait Space
42
Message Passing For Matchbox
u11
u21
v11
v21
u01
s1
t1



u12
u22
v12
v22
u02
s2
t2




r
Message update functions powered by Infer.net
43
User/Item Taste Space
44
Applications
45
Conclusions
46
Conclusions
  • Great variety of data sources and tasks
  • Challenges privacy, incentives, exploration
  • Tools SQL, No-SQL , HPC
  • Modelling platform (Factor Graphs PQL)
  • Represent uncertainty
  • Composable models
  • Distributed, data-centric computation
  • Applications TrueSkill, AdPredictor, Matchbox
  • Thanks!
Write a Comment
User Comments (0)
About PowerShow.com