Title: Big Data and Predictive Analytics
1Big Data and Predictive Analytics
In God we trust, all others must bring data
Antarip Biswas Sept 26th 2013
2Agenda / Table of Contents
3Use Cases and Success Stories
4Success Stories - FareCast
- Air fare prediction
- For an online airfare predicts whether the fare
will go UP or DOWN or STAY SAME in the future - Acquired for 100M by Microsoft
- Employed machine learning technologies over big
data
5Tesco Loyalty Program
- Done by Dunnhumby
- Data
- Data for Loyalty Program
- Basic demographic information such as address,
age, gender, the number of members in a household
and their ages, dietary habits. - Purchase history appended
- Summary attributes
- Cluster analysis
- Crucible
- a massive database of not only applicant
information and purchase history, but also
information purchased and collected elsewhere
about participating consumers. Credit reports,
loan applications, magazine subscription lists,
Office for National Statistics, and the Land
Registry are all sources of additional
information that is stored in Crucible.
6Tesco Loyalty Program - Benefits
1. Loyalty 2. Cross-sells 3. Inventory,
distribution and store network planning 4.
Optimal targeting and use of manufacturer
promotions 5. Consumer insight generation and
marketing those insights
Tesco has achieved a 3.6 factor increase in
coupon redemption rates by using big-data
predictive analytics to predict which consumers
are more likely to redeem which coupons !
7Big Data Success Story
8Netflix Recommendations
- Existing recommendation system Cinematch
- Korbell Team winner
- 107 algorithms explored
- Machine learning and Data mining
- Employed SVD and RBM
- Achieved 8.43 improvement in recommendations
over existing system
9Google Flu Spread Prediction
- Prediction of the spread of flu in real time
during H1N1 2009 - Google tested a mammoth of 450 million different
mathematical models to test the search terms,
comparing their predictions against the actual
flu cases 45 important parameters were founds - Model was tested when H1N1 crisis struck in 2009
and gave more meaningful and valuable real time
information than any public health official system
10Prediction High Frequency Trading
- Objective predict impact of earnings
announcement on stock prices - Use historical financial data to get a time
series of quarterly expected and actual earnings
announcements - Use historical financial data of stock price
movements after the announcement - Approach
- Categorize stocks based on market capital so that
similar sized companies are grouped together - Split the historical data into in sample
(training) set and out sample (validation) set - Fit a linear regression model on sample data
where the independent variable (feature) is the
difference between the actual and estimated
earnings, the dependent variable is the impact on
stock price
Achieved return of 1 or 100 basis points
11Predictive Analytics for Couponing
Run the same campaign on both lists
Test Group List of households from Analytic
engine
Control Group List of households getting the same
offer
Evaluate impact Control Group vs. Test Group
Measure results Redemption (primary), Clips
(secondary)
Verify efficacy of household recommendation
demonstrating significant variance from Control
Group
12Improve Recommendations/Allocations
Customer deviation in buying behavior refined by
customer profile changes
Customer Transactions
Customer 360
Association Clustering
Time Series
- Taxonomy based approach to identify business
semantic - Major events that determine change in buying
pattern Location change, change in marital
status, change in income group, birth of child, - Source for this information social channels,
purchase deviation, - Identify specific product categories relevant for
the major event - Association of product categories to various
customer classification - For instance customers with kids buy candies or
customers with pets buy pet-food
Exploratory techniques
Cluster assignments
Products eligible for recommendation
Refine classifiers
Time specific product and associated prods
Customer groups based on classifications
Product classification and Customer segment
association
Products List For target customers cluster
Campaign results
Matching / Filtering
Probabilistic product affinities based on
segments behavior
Personalized Recommendation List
Target Recommendation
13Improve Recommendations/Allocations
Products bought by similar customers, but not by
current customer
- Identification of similar customers more
accurately with availability of extensive profile
information - Classification of customers by predetermined
attributes - Usage of exploratory techniques to identify
clusters of similar customers - Identify product propensity for specific segments
- Determined by clustering and classification
techniques
Customer Transactions
Customer 360 - NoSQL
Association Clustering
Exploratory techniques
Cluster assignments
Products eligible for recommendation
Refine classifiers
Segment specific Product lists
Customer groups based on classifications
Products List For target customers cluster
Campaign results
Matching / Filtering
Probabilistic product affinities based on
segments behavior
Personalized Recommendation List
Target Recommendation
14Improve Recommendations/Allocations
Determine correlated items not bought by current
customer
Customer Transactions
Customer 360 - NoSQL
Association Clustering
Association rules
- Link association to determine products that are
bought together bread and butter, wine and
cheese, - Identify products bought by customer, but not the
correlated item - Recommendation based on absence of product
Exploratory techniques
Cluster assignments
Products eligible for recommendation
Refine classifiers
Segment specific product and associated prods
Customer groups based on classifications
Products List For target customers cluster
Campaign results
Matching / Filtering
Probabilistic product affinities based on
segments behavior
Personalized Recommendation List
Target Recommendation
15Identify what customers want and when
Sample technique
Cross-tabulated data
- Salary,
- Zipcode,
- No of kids,
- House owner
- Gender
- Brand1, Brand2, Brandn
- Weight, Size, Volume,
- Brand
- Category1, Categgory2, ..
- Offer clipped category1,
Transaction details merged with customer data to
provide contextual information as required for
inference
Transaction details for filtered customer list
Buyers of Cat food/ Cat food Generic 4 oz
Affinity models
Models generated using historical data by the
analytic engine to identify affinity of specific
variables
Associated Variables Single or multiple
variables by different segments
using multi-model approach
Prediction models
Application of variable affinity to customer list
to identify probability of non-purchasers to
purchase cat food / cat food Generic 4 oz
Customer list by probability
16Contextualize information, correlate facts,
predict and improve
Information from multiple operational and data
warehousing systems that contain customer data,
purchase details,
Information from social channels that provide
supporting information to create detailed
customer profile
Rule sets from knowledgebase accumulated over the
years
Advanced Analytics - Product association
Filter
Customer list, probability
Buyer of Cat Food / Generic Cat food 4 ounce
Transaction details for this customer list
Filtered high vol. categories
Associated products by affinity confidence
Inferred rules
17Obama for America Campaign 2012
Canvassing from older generation
Canvassing from youth
18Obama for America Campaign 2012
- Obama for America data science team used social
media as a tool to efficiently recruit human
resources it needed leading into the elections
home stretch - Primary objective - determine who were the best
messengers, who they might be able to persuade,
and what actions they might be willing to take - Reason to harness social media -
- Youth majority unreachable on phone calls or
neighborhood canvassing, but always connected to
some form of social media - Optimize resources by enabling to transform voter
intelligence to actionable intelligence.
19Traffic Congestion Control
- Big Data Analytics used for traffic congestion
control - Enables travellers to plan their routes to their
destinations - Enables traffic controllers to effectively route
cars in order to avoid as much congestion as
possible - Implemented in LA by a joint initiative of Xerox
and the LA transport department
20DNA Sequencing and Cancer Therapies
- Previously small portions of peoples genes
sequenced - Big Data technology enables entire DNA to be
sequenced which is largely helpful for cancer
patients - Enabled selecting therapies based on genetic
markers and person-specific genetic makeup - If one treatment became ineffective due to cancer
mutation, use different therapies based on other
gene markers. - Steve Jobs one of the first people in the world
to have entire DNA sequenced
21Thank You