Title: MD 240 Data Management: Warehousing, Analyzing, Mining and Visualization
1MD 240Data Management Warehousing, Analyzing,
Mining and Visualization
2Agenda
- Background
- Data Management
- Data Collection
- Data Cleaning, Preparation Warehousing
- Data Analysis
- Visual Methods for Discovery Presentation
- Marketing Transaction Databases
3Background
- Until recently, it was difficult for analysts and
managers to perform analyses related to their
business activities - With the spread of PCs and networked devices
- it has become easier than ever to collect data
about activities in an organization - it has become more feasible to transform analysis
from a task of the statistician in the back
office to salespeople, managers, and analysts
closer to the front office
4Background
- Difficulties with data analysis for business
intelligence - Data amount increasing exponentially
- Multiple sources of data increasing all the
time - Only a small portion of the total data collected
are usually useful for making a decision - Increasing need for external data
- Differing legal requirements about data
collection in different countries - Selection of data management tool from the many
available tools - Data security, quality, integrity, etc.
5Data Management
6Data ManagementData Management Process
- Data Life Cycle Process
- Data collection
- Data stored in databases
- Pre-process databases
- Clean out junk
- Get data close to what decision-makers need
- Transformation of data
- Make it ready for analysis
- Store in data warehouse
- Use data mining tools to discover patterns
- Create knowledge
- Presentation of results
7Data ManagementData Management Process
Step 1 Raw Data Collection
Step 5 Store in Data Warehouse
Step 6 Discover Patterns w/ Data Mining
Step 7 Interpret, Present, Use Results
Step 2 Data Selection
Step 3 Pre-Process Data
Step 4 Transform Data
Raw Data
Interesting Data
Clean, Usable Data
Data Warehouse
Data
Transform
KDD Analysis
Act on Results
X1,X2
V1 X1/X2
V1 V2 V3
8Data ManagementData Management Process
9Data Collection
10Step 1 Data CollectionData Sources
11Step 1 Data CollectionData Strategy
- Fundamental philosophy guiding data collection
- GIGO garbage in, garbage out
12Step 1 Data CollectionData Sources
- Internal data
- data/info. about organizational activities
- Personal data
- data/info. documenting employees activities
- External data
- government, competitors, suppliers
- The Internet
- screen scraping data out of the browser
- Commercial database services
- Online databases
13Step 1 Data CollectionData Capture and Input
- Past
- Type in by hand
- time consuming
- costly
- many typing errors
- Now
- Objective is to automate
- save paper storage costs of leasing warehouses
- faster access to documents and information in
documents - Document Management Systems
- scanners for digitizing archived paper documents
- databases for archiving, search, retrieval
14Step 1 Data CollectionData Quality (DQ)
- Intrinsic DQ
- Accuracy, objectivity, believability, and
reputation - Accessibility DQ
- Accessibility and access security
- Contextual DQ
- Relevance, value added, timeliness, completeness
- Representation DQ
- Interpretability, ease of understanding, concise
representation, and consistent representation
15Data Cleaning, Preparation Warehousing
16Steps 2-5 Data WarehousingTransactional
Processing
- Store data in databases
- Objectives of TPS
- Standardized transactions
- Simple computations
- non-complex
- not very mathematical or statistically oriented
- High volume
- Low cost
17Steps 2-5 Data WarehousingTransaction vs.
Analytical Processing
- Task objectives for a useful analytical data
delivery system - Easy data access by end users
- Quicker decision making
- Accurate and effective decision making
- Flexible decision making
18Steps 2-5 Data Warehousing Transaction vs.
Analytical Processing
- Characteristics of a useful analytical data
delivery system - Business representation of data for end users
- Client-server or Web-based environment that
provides end users with query and reporting
capability - Server-based repository (data warehouse)
19Steps 2-5 Data Warehousing Data Warehouse and
Data Marts
- Data Warehouse
- establishes a data repository, that ...
- makes operational data accessible in a form
readily acceptable for analytical processing
activities - Metadata
- data summaries for faster indexing and searching
within data warehouse - data summaries
- information on how the data have been organized
- Data Mart
- dedicated to a functional area, or ...
- dedicated to a regional area
20Steps 2-5 Data Warehousing Data Warehouse and
Data Marts
21Steps 2-5 Data Warehousing Characteristics of
Data Warehousing
- Desirable Characteristics for a Data Warehouse
- Organization
- organized by subject extraneous items removed
- Consistency
- identical measurement and representation of same
data - Time variant
- varies over time time-series data
- Nonvolatile
- data are not updated once entered
- Relational
- table-based structure (RDBMS)
22Steps 2-5 Data Warehousing Characteristics of
Data Warehousing
- Data Warehousing is most suitable for
organizations in which - End users need to access large amounts of data
- Operational data are stored in several different
systems - Different systems represent the same data in
different formats - Management relies on information for decision
making - There is a large, diverse customer base
- Extensive end-user computing is performed
23Data Analysis
24Step 6 Data AnalysisKnowledge Discovery in
Databases (KDD)
- Foundations of KDD
- Massive data collection
- Powerful multiprocessor computers
- Intelligent data mining algorithms
- Analyst/manager activities
- Ad-Hoc Queries
- OLAP Queries
- Data Mining
25Step 6 Data AnalysisAd Hoc Queries
- Ad Hoc Queries
- Let users access, navigate, and explore data in
real time to make business decisions - Ad hoc query tool requirements
- Query creation is easy
- Customized query creation
- Easy to use interfaces for performing queries
- Many data sources are supported
- Seamless integration between analysis and
reporting
26Step 6 Data AnalysisOLAP Queries
- OLAP
- An approach by which important queries and
calculations are turned into online tools that
managers can use over and over again - Decision support software that allows the user to
quickly analyze information that has been
summarized into multidimensional views and
hierarchies - MOLAP multidimensional OLAP
- ROLAP OLAP using relational databases
- WOLAP web-based OLAP
27Step 6 Data AnalysisOLAP Queries
- Capabilities of Online Analytical Processing
(OLAP) - Access very large amounts of data
- Analyze the relationships between many types of
business elements - Involve aggregated data
- Compare aggregated data over hierarchical time
periods - Present data in different perspectives
- Involve complex calculations between data
elements - Able to respond quickly to user requests
28Step 6 Data AnalysisOLAP Queries
- OLAP Advantages
- Adapt existing decision making tools to the WWW,
integrate them with distributed data stores - facilitates drill-down
- OLAP Shortcomings
- Retrospective in nature
- More of a reporting-oriented tool
- A discovery-oriented tool for flexible data
analysis of data already known to have importance - Less of a prediction-oriented tool
29Step 6 Data AnalysisData Mining
- Objectives of Data Mining
- Automate discovery of previously unknown patterns
- Automate prediction of
- trends
- behaviors
- events
30Step 6 Data AnalysisData Mining
- Nature and Characteristics
- Data often buried deep within large databases
- Data wants to be Free!
- Data may be consolidated in data warehouse or
kept in internet and intranet servers - Usually client-server architecture
31Step 6 Data AnalysisData Mining
- Nature and Characteristics (contd)
- Data mining tools extract information buried in
corporate files or archived public records - The miner is often an end user
- Striking it rich usually involves finding
unexpected, valuable results - Parallel processing computers often needed to
make this analysis fast enough to be useful to
manager
32Step 6 Data AnalysisData Mining
- Common types of data mining
- Mining of numerical data
- Text mining group documents or identify themes
or information within documents - Documents
- Web pages
- Web site clickstream/event mining
33Step 6 Data AnalysisData Mining
- Data Mining yields five types of information
- Association
- e.g., correlation 0.5 slope between X and Y
0.73 - Sequences
- e.g., biggest, second biggest, etc.
- Classifications
- e.g., There are 3 types of competitors, use data
mining to classify Firm X as a Type 1
competitor - Clusters
- e.g., We dont know how many types of customers
there are lets try to discover if we can
identify some similar customer groups - Forecasting
34Step 6 Data Analysis Data Mining
Techniques/Tools
- Computer Science
- Case-based reasoning
- Neural computing
- Intelligent agents
- Others decision trees, genetic algorithms,
nearest neighbor method, and rule reduction - Statistics
- Cluster analysis
- Most standard statistical tools (SAS, SPSS)
- Optimization
35Step 6 Data Analysis Data Mining
Techniques/Tools
36Step 6 Data Analysis Data Mining Vendors
- Vendors
- SAS Enterprise Miner
- SPSS Business Intelligence
- Insightful (www.insightful.com)
- Microsoft Research
- IBM
- Blue Martini
- Amdocs
- DBMiner (www.dbminer.com)
- PrudSys (www.prudsys.de)
- Boston Area Torrent (www.torrent.com),
ThinkAnalytics (www.thinkanalytics.com) - Learning Resources
- Association of Computing Machinery (ACM) SIGKDD
- KDD2002 conference (July 2002)
37Visual Methods for Discovery Presentation
38Steps 67 Data Visualization Multidimensionalit
y
- Multidimensionality
- real-world data typically have more than 2 or 3
dimensions - managerial analyses may require presentation of
up to 7 or 8 dimensions to fully communicate
discoveries - Three factors
- dimensions
- measures
- time
- Solution
- technology that is flexible enough so that data
can be organized the way managers prefer to see
the data
39Steps 67 Data Visualization Examples of
Variables
- Dimensions
- Products, salespeople, market segments, business
units, geographical locations - Measures
- Money, sales volume, head count, inventory,
profit, actual versus forecasted - Time
- Daily, weekly, monthly, quarterly, yearly
40Steps 67 Data VisualizationPresenting
Multidimensional Data
- Data visualization involves presentation of data
by digital technology - graphical user interfaces
- digital images
- geographical information systems
- multidimensional tables and graphs
- virtual reality
- three-dimensional presentations
- animation
41Steps 67 Data Visualization Presenting
Multidimensional Data
- Low Tech Solutions for a few dimensions
- Multidimensional Tables
- reduce many dimensions down to 2D table format
- Slicing and Dicing
- Data rotation
- ability to easily switch the 3 variables being
analyzed and rotate 3D graphs on a computer
screen - High Tech Solutions for many dimensions
- See Edward Tuftes books
- The Visual Display of Quantitative Information
- Envisioning Information
- Visual Explanations
42Steps 67 Data Visualization Geographical
Information Systems (GIS)
- GIS
- A computer-based system for capturing, storing,
checking, integrating, manipulating, and
displaying data using digitized maps. - Plot data or present data analysis findings by
- latitude and longitude
- cities, major metropolitan areas
- counties
- states
- nations
43Steps 67 Data Visualization Geographical
Information Systems (GIS)
- Emerging GIS Applications
- Sophisticated user interfaces
- Multimedia, 3D graphics, animated and interactive
maps - Integration of GIS and GPS
- Reengineer aviation and shipping industries
- Intelligent GIS (integration of GIS and ES)
- Hand-held applications
- Deploy mapping tools to PDAs and Java-based cell
phones - Web applications
- ESRIs ArcData GIS
44Steps 67 Data Visualization Geographical
Information Systems (GIS)
- Vendors
- ESRI (www.esri.com)
- Arc/Info
- ArcData Online (www.esri.com/data/online/index.htm
l) - Resources
- www.gis.com
- www.gisday.com
- www.state.ma.us/mgis/
- www.northeastarc.org
45Steps 67 Data Visualization Other
Visualization Tools
- Visual Interactive Modeling
- visual modeling of a system
- Visual Interactive Simulation
- a visual front end to a simulation program
- presents animation of system activities and
statistical results during a simulation run - Real-time simulation users can interact with
the simulation model (prototyping, training,
entertainment, video games) - Virtual Reality
- Fake environments that attempt to fool the viewer
into perceiving that they are within a 3D world - Usually involves a headset, gloves, and other
forms of sensory input/output devices
46Marketing Transaction Databases
47Application Area MarketingMarketing Transaction
Database (MTD)
- a new kind of database, oriented toward
targeting and personalizing marketing messages in
real time.
48Application Area MarketingMarketing Transaction
Database (MTD)
- Purpose targeting and personalization
- Structure liquid - driven by real-time marketing
- Updates real-time
- Data level individual detail
- Data type demographic (descriptive), behavioral,
derivative - Advantages allows real-time analysis and
decision-making, CRM - Issues emerging, no standards, not integrated
with other systems