Title: Analysing and Modelling Large-Scale Enterprise Data
1Analysing and Modelling Large-Scale Enterprise
Data
- Thore Graepel
- Online Services and Advertising Group
- Microsoft Research Cambridge
2Overview
- Complex large-scale data in the enterprise
- What kind of data is available?
- What technologies are used?
- Tasks and enterprise-specific challenges?
- Methodology
- Bayesian Inference in Factor Graph Models
- PQL Using SQL to describe probability models
- Applications
- Gamer Rating and Matchmaking TrueSkill
- Click-Through Rate Prediction AdPredictor
- Large-Scale Recommendations Matchbox
3Complex Data
Joint work with Tom Minka Phillip Trelford
4Data Sources at Microsoft (External)
- Online Services Division
- Web index
- Search and Ad click logs (12-15 TB / day)
- Hotmail, Instant messaging, Internet Explorer
(100s million users) - MSN portal and Bing maps
- Xbox Live Gaming Service
- User transaction log data
- Ranking and matchmaking data
- Game instrumentation for user testing
5Data Sources at Microsoft (Internal)
- Development and Software Instrumentation
- Watson (customer feedback data)
- Source depot (MS source code, e.g., Office,
Windows) - Multilingual technical documentation
- Business
- Customer databases
- Sales and Marketing
6Data-Intensive Tasks at Microsoft
- Prediction of user behaviour and preferences
- Improve web search
- Improve targeting for advertising
- Spam filtering and content prioritisation
- Improve user experience
- Matchmaking for games
- Multi-modal user interfaces (Natal, speech)
- Improve software development process
- Improve productivity of developers
- Analyse software for defects
7Technical Infrastructure
- Relational Databases/SQL
- Great agility for analysis and reliability for
business - Limited scalability
- Need to import data into SQL
- Windows HPC
- Complex computations / fine grained parallelism
- Need to move data to HPC cluster
- Cosmos
- Take the computation to the data
- Super efficient stream based computations
8Cosmos Architecture
SCOPE
DryadLINQ
Sputnik
Dryad
Cosmos
Stream
Stream
Stream
Stream
9Enterprise/Online specific challenges
- Privacy
- Privacy limit the ways in which data can be used
- Interesting trade-offs (differential privacy)
- Incentives
- Data produced by self-interested agents
- Need to design incentive compatible mechanisms
- Exploration/Exploitation
- Results of inference feed back into business
process and determine future observations. - Need to aim at long-term benefits
10Factor Graphs
11Factor Graphs / Trees
- Definition Graphical representation of product
structure of a function (Wiberg, 1996) - Nodes Factors Variables
- Edges Dependencies of factors on variables.
- Question
- What are the marginals of the function (all but
one variable are summed out)? - What is the mode of the function?
12Factor Graphs and Bayesian Inference
- Bayes law
- Factorising prior
- Factorising likelihood
- Sum out latent variables
s2
s1
s
t1
t2
d
y
13Factor Trees Separation
y
f3(x,y)
v
w
x
f1(v,w)
f2(w,x)
z
f4(x,z)
- Observation Sum of products becomes product of
sums of all messages from neighbouring factors to
variable!
14Messages From Factors To Variables
y
f3(x,y)
w
x
f2(w,x)
z
f4(x,z)
- Observation Factors only need to sum out all
their local variables!
15Messages From Variables To Factors
y
f3(x,y)
x
f2(w,x)
z
f4(x,z)
- Observation Variables pass on the product of all
incoming messages!
16The Sum-Product Algorithm
- Three update equations (Aji McEliece, 1997)
- Update equations can be directly derived from the
distributive law. - Efficient for messages in the exponential family.
- Calculate all marginals at the same time.
17Approximate Message Passing
- Problem The exact messages from factors to
variables may not be closed under products. - Solution Approximate the marginal as well as
possible in the sense of minimal KL divergence. - Expectation Propagation (Minka, 2001)
Approximate the marginal by moment-matching
resulting in
18Distributed Message Passing
- Map-Reduce for IID data
- Map Nodes compute messages mfi?s from data yi
and mfi?s - Reduce Combine messages mfi?s into ps by
multiplication - Caveats
- All approximate data factors need the incoming
message ms?fi! - All messages m fi ?s need to be stored if the
same data point is considered multiple times
s
y1
y2
y3
19PQL
Joint work with Ralf Herbrich Jurgen Van Gael
20PQL as a Platform
21PQL I Augmenting Schemas
People AUGMENT DB.People ADD weight FLOAT
22PQL II Factor Types
Single Relation Cross Relation Cross Entity
23PQL III Single Relation Factors
FACTOR Normal(p.weight,75.0,25.0) FROM People p
24PQL IV Cross Relation Factors
FACTOR Normal(g.weight, p.weight, 1.0) FROM
People p, DrVisit g WHERE p.PersonID g.PersonID
25PQL as a Unifying Platform
26TrueSkill
Joint work with Tom Minka Phillip Trelford
27TrueSkill
- Given
- Match outcomes Orderings among k teams
consisting of n1, n2 , ..., nk players,
respectively - Questions
- Skill si for each player such that
- Global ranking among all players
- Fair matches between teams of players
28Efficient Approximate Inference
Gaussian Prior Factors
s1
s2
s3
s4
Ranking Likelihood Factors
Fast and efficient approximate message passing
using Expectation Propagation
t1
t2
t3
y12
y23
29TrueSkill Superfast convergence to True Skills
40
35
30
25
Level
20
15
char (TrueSkill)
10
SQLwildman (TrueSkill)
char (Halo 2 Beta)
5
SQLwildman (Halo 2 Beta)
0
Games played
0
100
200
300
400
30Applications to Online Gaming
- Leaderboard
- Global ranking of all players
- Matchmaking
- For gamers Most uncertain outcome
- For inference Most informative
- Both are equivalent!
31Trueskill in Xbox 360 and Halo 3
32AdPredictor
Joint work with Joaquin Quiñonero Candela, Onno
Zoeter, Tom Borchert , Phillip Trelford
33Why Predict Probability-of-Click?
- Display (according to expected revenue)
-
- Charge (per click)
-
-
1.00
10
0.10
0.80
- Advantages of improved probability estimates
- Increase user satisfaction by better targeting
- Fairer charges to advertisers
- Increase revenue by showing ads with high
click-thru rate
2.00
4
0.08
1.25
0.10
50
0.05
0.05
34adPredictor Details
102.34.12.201
15.70.165.9
Client IP
221.98.2.187
92.154.3.86
P(pClick)
Match Type
Exact Match
Broad Match
ML-1
Position
SB-1
SB-2
35Training Algorithm in Action
w2
w1
s
c
No Click
Click
36Client IP Mean Variance
Low clickers
High clickers
37UserAgent Mean Posterior Effects
38AdPredictor in Bing Search Engine
- AdPredictor is now running 100 Paid Search
traffic in Microsofts Bing Search Engine - Relevance and Click-Through Rate of Ads improved
- Calibrated CTR prediction provides solid
foundation for further improvements - AdPredictor explored for other tasks such as
contextual and display advertising
39Matchbox
Joint work with David Stern and Ralf Herbrich
40Collaborative Filtering
Items
1
2
3
4
5
6
Metadata?
A
B
Users
C
?
?
?
D
41Map Sparse Features To Trait Space
42Message Passing For Matchbox
u11
u21
v11
v21
u01
s1
t1
u12
u22
v12
v22
u02
s2
t2
r
Message update functions powered by Infer.net
43User/Item Taste Space
44Applications
45Conclusions
46Conclusions
- Great variety of data sources and tasks
- Challenges privacy, incentives, exploration
- Tools SQL, No-SQL , HPC
- Modelling platform (Factor Graphs PQL)
- Represent uncertainty
- Composable models
- Distributed, data-centric computation
- Applications TrueSkill, AdPredictor, Matchbox
- Thanks!