Title: Analytics on Microsoft Excel
1Analytics on Microsoft Excel
Overview of solutions around the platform from a
real world perspective
- Alberto Guillén
- 03. June 2008
2- Capgemini is a leading company with long
experience in technology services - We are one of the biggest actors in Business
Intelligence in Norway - A major demand from our clients is delivering
solutions in Microsoft Excel, we have
continuously updated our efforts to adapt
clients needs -
Alberto Guillén Consultant Risk Management
Compliance MSc. Mathematics MSc.
Statistics Phone 47 46444721 E-Mail alberto.gu
illen_at_capgemini.com
3Excel is more than worksheet functions and tables
Basic statistical environment
Data Visualization and Data Mining
Desktop for analytical solutions
EXCEL
Front for Data Warehouse Collection (SQL Server)
GUI for in-house coded programs (VBA)
Reporting and Monitoring Tool (Performance Point)
Excel is the Industry standard for end-user
calculations, and also as front interface
4Analytical solutions can be created on different
complexity layers beyond basic Excel
Statistical Programming Languages
Analysis Tool pack
Third- party Add-ins
Data Mining Add-in
Excel
Solver
VBA
5Analytical solutions can be created on different
complexity layers beyond basic Excel
Statistical Programming Languages
Analysis Tool pack
Third- party Add-ins
Data Mining Add-in
Excel
Solver
VBA
6Analytical solutions can be created on different
complexity layers beyond basic Excel
Statistical Programming Languages
Analysis Tool pack
Third- party Add-ins
Data Mining Add-in
Excel
Solver
VBA
7Analytical solutions can be created on different
complexity layers beyond basic Excel
Statistical Programming Languages
Analysis Tool pack
Third- party Add-ins
Data Mining Add-in
Excel
Solver
VBA
8Analytical solutions can be created on different
complexity layers beyond basic Excel
Statistical Programming Languages
Analysis Tool pack
Third- party Add-ins
Data Mining Add-in
Excel
Solver
VBA
9Analytical solutions can be created on different
complexity layers beyond basic Excel
Statistical Programming Languages
Analysis Tool pack
Third- party Add-ins
Data Mining Add-in
Excel
Solver
VBA
10The average user masters the standard Excel tools
BASIC EXCEL
Tables and Filters
Worksheet functions
Charts
What If Analysis
Pivot Tables
Standard Excel allows direct interaction with the
raw data
11The Analysis Toolpack makes rigorous analysis
possible
ANALYSIS TOOLPACK
- Free Microsoft download
- Created for statisticians
- ANOVA
- Correlation
- Covariance
- Descriptive Statistics
- Exponential Smoothing
- F-Test Two-Sample for Variances
- Fourier Analysis
- Histogram
- Moving Average
- Random Number Generation
- Rank and Percentile
- Regression
- Sampling
- t-Test
Basic statistical analyses are available through
Analysis Toolpack
12Solver leverages computational abilities
SOLVER
- Free Microsoft download
- Optimizing and root finding set of algorithms
- Can be called on the background from VBA
- Practical but slow in heavy calculations
- Not exact convergence sometimes!
- Can be tuned
- Needs good seeds
Solver implements standard algorithms for
mathematical optimization problems
13Solver leverages computational abilities
SOLVER
- Free Microsoft download
- Optimizing and root finding set of algorithms
- Can be called on the background from VBA
- Practical but slow in heavy calculations
- Not exact convergence sometimes!
- Can be tuned
- Needs good seeds
Solver implements standard algorithms for
mathematical optimization problems
14Third party add-ins provide easily new
functionalities
ADD-INS
- Cheap
- Simple
- Easy to use
- No development efforts
Many small software developers use Excel as their
GUI
15There are several third party add-ins offering
solutions on quantitative analysis and Monte
Carlo simulation
ADD-INS
- Monte Carlo add-ins
- Crystal Ball
- Quantechs _at_Risk 5.0
- RiskAMP
- Lumenaut
- Simtools
-
- Also free .xla on the web
Hundreds of free or cheap add-ins offer various
solutions on fields like Risk Management
16The Table Analysis Tools add-in brings data
mining capabilities
DATA MINING
- Included in the data mining add-in
- Scenario analysis differs from the standard tool
- Brings the power of data mining to end-users
Data Mining is embedded into table functionality
17The Data Mining add-in easies data mining to
business analysts
DATA MINING
- What is Data Mining?
- Data mining is frequently described as "the
process of extracting valid, authentic, and
actionable information from large databases. - Microsofts approach to Data Mining
- Business Intelligence with a user-friendly
- interface, accessible to end-users and
- developers
- Software
- SQL Server 2005/2008 (Visual Studio BI)
- Excel/Visio add-ins
- DMX
- ADOMD.Net / AMO
Microsoft brings Data Mining to business users
for the first time
18Microsoft takes a different approach to Data
Mining
DATA MINING
-
- Donald Farmer - Principal Program Manager for
Microsofts Data Mining - "We don't have all the functionality of something
like a SAS or an SPSS, because that's just not
our market. Our market just has to be a much
larger market - We have a huge database marketing team who do
classic customer analysis. These guys were all
SAS users, but when they joined Microsoft, they
started using our tools. , they actually use
the Excel data mining add-ins to do it. It's not
that there's nothing they don't miss, it's that
they are able to achieve the same business
results using our tools. - "For a function such as 'Detect Categories,' what
the add-in is doing is building a clustering
model in the background , but we don't expect
the Excel user to understand that. We just call
it 'Build Categories,' - "We're seeing a lot of interest in the Excel-side
data mining,for one thing, but we're also seeing
interest in the embed-ability, too. The people
who are actually pushing this are from the
developer side.
Microsoft will not compete with traditional DM
vendors, Microsoft targets other users
19Data Mining assists in various business processes
DATA MINING
- Top Business Scenarios for DM
- Cross-sell and up-sell
- Campaign management
- Customer acquisition
- Budget and forecasting
- Customer retention
- New fields manufacturing, retail and entertaiment
- Main DM tasks
- Classification
- Estimation
- Prediction
- Association
- Clustering
Data Mining is being used in several business
areas
20Data Mining is performed in SQL Server 2005 / 2008
DATA MINING
SQL Server Business Development Studio and DMX
code is the natural environment
21Data Mining is also accessible through Excel 2007
DATA MINING
- The Excel add-in acts as a client to an instance
of Analysis Services - Both Excel and SQL Server Analysis Services
support the full DM Cycle
Data understanding
Data preparation
Modeling
Validation
Deployment
Excel sends DM queries and data directly to SQL
Server Analysis Services
22Data Mining is an iterative process
DATA MINING
- A mining model is part of a larger process that
includes everything -
- This process can be defined by using the
following six basic steps - Defining the problem
- Preparing Data
- Exploring Data
- Building Models
- Exploring and Validating Models
- Deploying and Updating
Problem
Working environment
?
Deployment
Although the process is illustrated as circular,
creating a data mining model is a dynamic and
iterative process
23There are 9 available Data Mining algorithms on
Excel
DATA MINING
- Decision/Regression Trees
- Clustering
- Naïve Bayes
- Association rules
- Sequence clustering
- Time series
- Neural Networks
- Logistic regression
- Linear regression
- Plug-in algorithms
- Third-party or self programmed implementing a set
of COM interfaces
9 built-in algorithms can be tuned to obtain new
ones
24Decision and Regression trees find natural splits
DATA MINING
- Decision trees classify and find associations
- Regression trees build segmented regressions
- Example
- Identify potential buyers
Decision trees give decision rules that are
suitable to business understanding
25Clustering finds homogeneous groups
DATA MINING
- Example Find segments of similar clients
age
income
Clustering can find hidden classes and identify
outliers
26Clustering finds homogeneous groups
DATA MINING
- Example Find segments of similar clients
Middle age Many cars and children
older age 2 cars no children
age
Young people No children
income
Clustering can find hidden classes and identify
outliers
27Naïve Bayes provides probabilities of group
membership
DATA MINING
- Example marketing campaign
?
?
Naîve Bayes is an efficient method to asses
probability of classification
28Association rules unveils hidden logic
DATA MINING
Association rules visualizes logical rules that
underly your business
29Sequence clustering finds event patterns in time
DATA MINING
Sequence clustering identifies clusters of
similarly ordered events in a sequence
30Time series forecasts processes in time
DATA MINING
- ARTx Microsoft proprietary algorithm
- ARIMA available in SQL Server 2008
- Example forecast seasonal sales to keep suitable
stock
historical
predicted
The past patterns that it discovers can be used
to predict values for future time steps.
31Time series forecasts processes in time
DATA MINING
- ARTx Microsoft proprietary algorithm
- ARIMA available in SQL Server 2008
- Example forecast seasonal sales to keep suitable
stock
historical
predicted
The past patterns that it discovers can be used
to predict values for future time steps.
32Neural networks discovers predictive patterns by
learning
DATA MINING
Neural networks learns in an uncontrolled manner
33Logistic regression predicts binary responses
DATA MINING
- Microsoft Logistic Regression is implemented as a
trivial neural network - Example Probability of credit default based on
personal information
Logistic regression gives probabilities of
YES/NO given some attributes
34Linear regression is of course also available
DATA MINING
- It is however extended by Regression Trees
- (Linear Regression is implemented as a
particular case) - Example extrapolate the influence of oil price
on house prices
The classical linear regression is also
integrated in the add-in
35Chosen examples vs. real life problems
DATA MINING
- So far, we have seen chosen examples
- Shopping basket
- Web navigation
- Market segmentation
-
- Unfortunately, it is not that easy data Mining
is a creative and unclear process. Sometimes
there is no answer with data mining. - Books dont show examples on when not to use the
algorithms - Time series long forecasts
- Classification trees credit scoring
- Sequence clustering non-markovian processes
-
Bottom line understand the statistical models
behind the icons
36Data Mining using the Data Mining add-in to
forecast Credit Default
DATA MINING
- Logistic regression
- The algorithm uses Z-scores transformations
- Cut-offs should be calculated when deploying
Ranked classes
Number of payments
A
score
Income
B
Probability of default
C
Age
D
Civil status
After training the algorithm, probabilities of
default can be predicted for new applicants
37Data Mining using the Data Mining add-in to
forecast Debt Recovery
DATA MINING
- Problem need the algorithm k-nearest neighbours
- Can be implemented as a plug-in algorithm
recovered
t
Some problems require creative approaches
38Data Mining using the Data Mining add-in to
forecast Debt Recovery
DATA MINING
- Hybrid between time series and regression trees
-
- Problem need the algorithm k-nearest neighbours
- Can be implemented as a plug-in algorithm
recovered
Training
period
t
Some problems require creative approaches
39Data Mining using the Data Mining add-in to
forecast Debt Recovery
DATA MINING
- Hybrid between time series and regression trees
-
- Problem need the algorithm k-nearest neighbours
- Can be implemented as a plug-in algorithm
recovered
Training
period
t
Some problems require creative approaches
40Data Mining using the Data Mining add-in to
forecast Debt Recovery
DATA MINING
- Hybrid between time series and regression trees
-
- Problem need the algorithm k-nearest neighbours
- Can be implemented as a plug-in algorithm
recovered
Training
period
t
Some problems require creative approaches
41Data Mining using the Data Mining add-in to
forecast Debt Recovery
DATA MINING
- Hybrid between time series and regression trees
-
- Problem need the algorithm k-nearest neighbours
- Can be implemented as a plug-in algorithm
recovered
Training
period
t
Some problems require creative approaches
42Data Mining using the Data Mining add-in to
forecast Debt Recovery
DATA MINING
- Hybrid between time series and regression trees
-
- Problem need the algorithm k-nearest neighbours
- Can be implemented as a plug-in algorithm
recovered
Training
period
t
Some problems require creative approaches
43Data Mining using the Data Mining add-in to
forecast Debt Recovery
DATA MINING
- Hybrid between time series and regression trees
-
- Problem need the algorithm k-nearest neighbours
- Can be implemented as a plug-in algorithm
recovered
Predicting
One period older Period Period 1 Age Age
period
t
Some problems require creative approaches
44Data Mining using the Data Mining add-in to
forecast Debt Recovery
DATA MINING
- Hybrid between time series and regression trees
-
- Problem need the algorithm k-nearest neighbours
- Can be implemented as a plug-in algorithm
recovered
x
Training
period
t
Some problems require creative approaches
45Data Mining using the Data Mining add-in to
forecast Debt Recovery
DATA MINING
- Hybrid between time series and regression trees
-
- Problem need the algorithm k-nearest neighbours
- Can be implemented as a plug-in algorithm
recovered
x
Predicting
period
t
Some problems require creative approaches
46VBA for Excel is the main tool for automated
solutions
VBA
- Communication with other software (COM Server)
- Build algorithms not available in Excel
- Automation of processes (Macro programming)
- Easy and quick interaction with the solution
through ActiveX Buttons and Userforms - Possibility to embed analytical solutions in a
simple user-front to end users without the right
competence
VBA allows in-house built solutions
47VBA Building a statistical tool for analyzing
and forecasting Debt Collection
VBA
- With VBA it is possible to deliver customized
solutions to end users - Problems a lot of work to implement statistical
algorithms, Solver can get slow
VBA is the tool to use to provide end-users with
an interactive work station
48There are no limits with statistical programming
languages
R
Statistical Programming Languages (COM Server)
Analysis Tool pack
Third- party Add-ins
Data Mining Add-in
Excel
Solver
VBA
49There are no limits with statistical programming
languages
R
Statistical Programming Languages (COM Server)
Analysis Tool pack
DDE
Third- party Add-ins
Data Mining Add-in
OLE
1991
Excel
ActiveX
Solver
1996
DCOM
COM
1999
.Net
VBA
WCF
50R is becoming the standard in the scientific
community
R
- R is a statistical programming language with
syntax similar to S-plus - R is free (under GNU license)
- R uses statistical libraries created by
statisticians all over the world - R communicates with Excel through a COM server
- COM is a set of interfaces that covers OLE,
ActiveX, DCOM, ... - R Excel add-in
- Background mode
- Small Ribbon toolbar
- Fast
- Code embedded in
- Worksheet functions
- VBA
- Cells
R allows analysts to implement the most advanced
mathematical models
51Histograms are a dangerous tool to approximate
empirical pdfs
R
With standard Excel
52Histograms are a dangerous tool to approximate
empirical pdfs
R
With standard Excel
53Histograms are a dangerous tool to approximate
empirical pdfs
R
With standard Excel
54With the R add-in, advanced semiparametric
methods are available
R
With the R add-in for Excel
Empirical probability distribution functions are
easily approximated with the R add-in
55With the R add-in, advanced semiparametric
methods are available
R
With the R add-in for Excel
Empirical probability distribution functions are
easily approximated with the R add-in
56With the R add-in, advanced semiparametric
methods are available
R
With the R add-in for Excel
Empirical probability distribution functions are
easily approximated with the R add-in
57With the R add-in, advanced semiparametric
methods are available
R
With the R add-in for Excel
Empirical probability distribution functions are
easily approximated with the R add-in
58With the R add-in, advanced semiparametric
methods are available
R
With the R add-in for Excel
Empirical probability distribution functions are
easily approximated with the R add-in
59Statistical programming languages problem case
R
- Forecasting Oslo Børs Hovedindeks
?
Some problems demand advanced statistical
approaches
60Statistical programming languages problem case
R
- Using Monte Carlo simulation to predict default
in Specialized Lending
NIBOR
Oil price
Complex multivariate Monte Carlo models are
developed fast in R
61Industrial solutions are another alternative
Industry vendors
Industrialized Vendors
Statistical Programming Languages (COM Server)
Analysis Tool pack
Third- party Add-ins
Data Mining Add-in
Excel
Solver
VBA
62Industrial solutions should be chosen only if the
area requires it
Industry vendors
- When to consider Industrial Solutions
- Big companies (important deployment)
- Special industrial subject area
- Data warehouse integration
- Highly competent staff within Analytics
- Expensive investment study worthiness
- Whatever vendor, check Excel compatibilities
(reporting, platform migrations, )
The previously presented alternatives for Excel
can do the their job at end-user level
63Further references
- Capgemini
- www.no.capgemini.com (alberto.guillen_at_capgemini.co
m) - Microsoft Data Mining
- http//www.sqlserverdatamining.com
- http//www.microsoft.com/sqlserver/2008/en/us/data
-mining-addins.aspx - R
- http//www.r-project.org
- http//sunsite.univie.ac.at/rcom/