Title: Data Warehousing Version 6.0 - 04/18/2000
1Data WarehousingVersion 6.0 - 04/18/2000
2Can your database answer questions like these?
- What is the cost of staff to break into a new
line of business? - What are the travel routes of my competitions
inventory? - At what velocity is my competitor moving toward a
common goal? - How will a transaction on a certain date be
affected by currency exchange rates? - Is a foreign labor source likely to produce a
higher quality product? - Which 20 of the problem creates 80 of the
problems?
3Can your database answer questions like these?
- By product and location, how can we regain a lost
customer base? - Which skill and staff levels are most likely to
accept the voluntary layoff package?
4Data is Difficult to Manage
- Amount of data increases exponentially past data
must be kept for long times new data are added
rapidly. - Data are scattered throughout organizations
collected by many individuals using different
methods and devices. - Only small portions of an organizations data are
relevant for specific decisions. - An ever-increasing amount of external data needs
to be considered in making organizational
decisions.
5Data is Difficult to Manage - contd.
- Raw data may be stored in different computing
systems, formats, and human and computer
languages. - Legal requirements relating to data differ among
countries and change frequently. - Selecting data management tools can be difficult
because of the large number of tools available. - Data security, quality, and integrity are
critical, yet easily jeopardized.
6Data and Knowledge Managementment
- Businesses do not run on data. They run on
information and their knowledge of how to put
that information to use successfully.
7Some Information Concepts
- Data Unorganized facts and figures. (raw
material) - Information Data that has been processed into a
form that is meaningful to the recipient and is
of real of perceived value in current or
prospective actions or decisions. - Information
- adds to a representation
- corrects or confirms previous information
- has surprise value in that it tells us
something we did not know, or could not predict. - What is a finished product to one, may be raw
materials to someone else.
8Definitions Information vs. Knowledge
- Knowledge a combination of instincts, ideas,
rules, and procedures that guide actions and
decisions. - Helping to provide the best available knowledge
to decision-making is another role of information
systems
9Relationship Between Data, Information, and
Knowledge
- The difference between data and information is
easy to remember. - It is often cited as the reason why systems that
collect large amounts of information fail to meet
managements information needs. - There are many methods of converting data into
information for decision making. - Managers take action based on information about a
current situation plus their accumulated
knowledge. Actions taken feed the process of
accumulating more knowledge (experience). - Example How do medical students become competent
physicians?
10Relationship Between Data, Information, and
Knowledge
11Attributes of Quality Information
- Timeliness
- Completeness
- Conciseness
- Relevance
- Accuracy
- Precision
- Appropriateness of Form
12Special Characteristics of Information
- Usefulness - depends on combination of
quality,accessibility,and presentation. - One persons information may be another persons
noise. - Soft data may be as important as hard data.
- Ownership of information may be hard to maintain.
- More information is not always better
(information overload). - Politics can often hide or distort information.
13Sources of Data
- Internal Data
- Data about people, products, services, processes.
- Often stored in corporate data bases (e.g. Sales
or HR). - Some data may be disparate in different regions,
but accessible by networks. - Personal Data
- Individuals document expertise by creating
personal data - subjective estimates. Some is
kept in heads, or mental models. - Can be store on PCs, or available on the Web.
- External Data
- Many sources.
14Some Sources of Business External Data
- Federal Publications
- Survey of Current Business
- Monthly Labor Review
- Federal Reserve Bulletin
- Employment and Earnings
- Commerce Business Daily
- Census Bureau
- Other
- International Monetary Fund
- Moodys
- Standard Poors
- Advertising Age
- Dialog and Lexis/Nexis
- Other-contd.
- ABI/Inform
- Annual Editor Publisher Market Guide
- Thomas Register On-line
- Indexes
- Encyclopedia of Business Information Sources
15What is a data warehouse?
- A data warehouse is a pool of data organized in a
format that enables users to interpret data and
convert it into useful information to gain
knowledge from this interpretation. - It is a single place that contains complete and
consistent data from multiple sources. - Data warehousing is the act of a business person
extracting business value from the data stored in
the data warehouse.
16Collecting Raw Data
- Is not Easy
- collect in the field
- elicit from people
- collect manually, electronically, or by sensors.
- Data collection technology has not kept pace with
advances of data storage technology. - Data collection from external sources is not easy
either. - Bottom Line Garbage IN, Garbage OUT - GIGO.
- Data Quality is an Important Issue.
17Data Quality Issues
- Intrinsic DQ
- accuracy, objectivity, believability, reputation.
- Accessibility DQ
- Accessibility and access security
- Contextual DQ
- relevancy, value-added, timeliness,completeness,
amount of data. - Representation DQ
- Interpretability, ease of understanding,concise
representation, consistent representation
18Why Data Warehousing?
- Managers do not make decisions that are good or
bad, they make decisions on the basis of good
or bad information. - Management information
- a. the right information
- b. in the right form
- c. at the right time.
- Most transaction-based information systems have
difficulty delivering this information.
19Why Data Warehousing - 2
- Not the right information
- data not easily accessible
- meaning is subtly (or significantly different)
from the question context. - Information is presented with too much or too
little detail, covers the wrong time spans, or is
in the wrong intervals.
20Why Data Warehousing - 3
- Not the right time
- Getting this information may require the efforts
of highly skilled professionals who are not
generally available at the whim of business
managers. - Data comes from a variety of different systems
which are resident on a variety of different
technology platforms.
21Why Data Warehousing - 4
- Not the right format
- If data is extracted, merged, and converted into
a meaningful information, often it is not in a
usable format. - Users will want it loadable into a particular PC
tool or spreadsheet with which he/she is
familiar. - Printouts weighing 10 pounds are not in the right
format. - a diskette with a COBOL file description is not
in the right format.
22The Dilemma for Corporate IT
- How to control scarce IT resources consumed by
insatiable user demand for ad-hoc reports. - Each ad-hoc report generated by IT and analyzed
by the user generates three more reports to
further illuminate the insights gleamed in the
first. - Often the extract programs have few reusable
components. - The user is on a voyage of discovery in a sea of
data.
23The Response of Corporate IT
- New methodologies Align the IT systems with the
business goals and requirements. - These techniques concentrate on business process
requirements, not decision support requirements. - Transaction systems must be rigorously specified
in advance. The are an intersection between the
organization and the customer. - These systems should not be a voyage of
discovery for either.
24Transaction Systems vs. Analytical Support Systems
- Transaction Systems
- Insert an order for 300 baseballs
- Update this passengers airline reservation.
- close-out accounts payable records for this
vendor. - What is the current checking account balance for
this customer? - Analytical Support Systems
- Did the sales promotion last quarter do better
than the same promotion last year? - Is the five-day moving average for this security
leading or trailing actual prices? - Which product line sells best in middle-America
and how does this correlate to demographic data.
25Analytical Processing
- Analytical Processing today includes what in the
past have been called - DSS (Decision Support Systems)
- EIS (Executive Information Systems)
- ESS (Executive Support Systems)
- It is an evolution of End-User Computing
- Placing strategic data access in the hands of
decision makers aids productivity and enables
them to be better decision makers.
26Key Difference OLTP vs. OLAP
- OLTP Processing specific functions
- OLAP providing flexibility for undetermined
analysis.
27A Multidimensional database
28Data for Decision Support
- The data must be integrated - requires data from
many separate internal corporate databases. - The data must be enriched - through integration
with other external data. - The data must be available - and not constrained
by machine resources.
29Sources of Data
- Internal Data
- Financial Systems
- Logistics Systems
- Sales Systems
- Production Systems
- Personnel Systems
- Billing Systems
- Information Systems
- External Data Needs
- to recognize opportunities
- to detect threats
- to identify synergies
30Sources of Data - 2
- External Data Categories
- Competitor Data
- Economic Data
- Industry Data
- Credit Data
- Commodity Data
- Econometric Data
- Psychometric Data
- Meteorological Data
- Demographic Data
- Sales Marketing Data
31Operational Control vs. Operational Strategy
- Data is a source not just of operational control,
but of operational strategy. - Operational strategy is an attempt to describe
the need, in a competitive and turbulent market,
to continually innovate and re-align strategy
with time scales too short to be comprehended by
strategic planning in the conventional corporate
sense.
32Comparison of Control and Strategy Data
- Operational Data
- short-lived, rapidly changing
- requires record-level access
- repetitive standard transactions and access
patterns - updated in real-time
- event-driven process generates data
- Strategic Data
- long-living, static
- data aggregated into sets (which is why warehouse
data is friendly to RDBMS). - ad-hoc queries with some periodic reporting
- updated periodically with mass loads
- data-driven data governs process.
33Information Requirements by Management
Level(Source Gorry and Scott Morton)
34Dimensional Modeling
- Dimensional Modeling gives us a way to visualize
data. - The CEOs perspective
- We sell products in various markets, and we
measure our performance over time. - From the data warehouse designers perspective,
we hear three dimensions - We sell Products
- in various Markets
- and measure performance over time.
35Dimensional Modeling - contd.
- Management may be interested in examining sales
figures in a certain city by product, by time
period, by salesperson, and by store. - The more dimensions involved, the more difficult
it is to represent in a single table or graph. - The ability to add and modify the dimensions used
in a table or graph is often known as slicing
and dicing the data.
36Dimensional Model of the Business
e
m
i
T
M a r k e t
Product
37Data Dependencies Model of a Business
Product
Ship To
Shipper
Ship Type
Product Line
District Credit
Contact Location
Order Item
Sales Order
Customer Location
Product Group
Contact
Contract
Customer
Contract Type
Sales Rep
Sales District
Sales Region
Sales Division
38Transaction Processing
- The Relation Model was full of promises for equal
access to data. - In the early 1980s the relational model was a
dream. Typical transaction rates were one per
second. - Today the SABRE system typically processes 4,000
transactions per second, with peak bursts of
13,000 per second. - OLTP - (On-line Transaction Processing) The point
is to get data in to the database.
39Segregating Operational and Warehouse Data
- In the past, data administrators were constantly
told to build data sharing, normalization, and
non-redundant corporate databases. - Early attempts at data warehousing tried to share
the data with transaction-based systems. This
resulted in LONG response times for complex
queries. - The idea today is to keep the two separate.
- Separate databases, and perhaps separate DBMS
products and processor platforms are used. - Controlled and practical redundancy is better
than out-of-control theoretical purity.
40Fundamental Obstacles With Traditional Systems
- Systems Integration - Disintegration grew slowly
from islands of automation. - ownership, planning, economic, organizational
development issues all contribute. - Hardware Architecture
- Inconsistent Data
- Data Pollution
- Bad Application Design (semantic and syntactical
differences). - Ownership
- Data Entry Conventions
41The Data Warehouse
- Active, tactical, and current events flow from
the operational systems to the data warehouse to
become static, strategic, and historical data. - The data warehouse becomes a middle ground
where a large number of disparate and
incompatible legacy systems are tied to an
equally diverse collection of end-user
workstations. - Legacy systems usually comprise a hodge-podge of
assorted hardware, software, and operational
systems accumulated over many decades, are by
nature, incompatible with one another and unique
to each organization.
42Practical Facts About the Warehouse
- The chances are remote that any single vendor
will be able to develop a product that can
interface with all legacy systems painlessly
and seamlessly and at the same time, combine
data from modern platforms and data external to
the organization. - Instead warehouse product vendors develop
specialized capabilities to work with various
environments.
43Typical Dimensional Model
Product Dimension
Sales Fact
Product_key description brand category
Time_key product_key store_key dollars_sold units_
sold dollars_cost
Time Dimension
Time_key day-of-week month quarter year holiday_fl
ag
Store Dimension
Store_key store_name address floor_plan_type
44Fact Table
- Fact Table is where numerical measurements of the
business are stored. - Each measurement is taken at the intersection of
all the dimensions. - The best facts are numeric, continuously valued
and additive. - For every query made against the fact table may
use hundreds of thousands of individual records
to construct an answer set.
45Dimension Tables
- Dimension tables are where textual descriptions
of the business are stored. - Each textual description helps to describe a
member of the dimension. - Example each member in the product dimension is
a specific product. The product dimension
database has many attributes to describe the
product. A key role of the dimension table
attribute is to serve as the source of
constraints in a query.
46Example
47Example Query
- Find all product brands that were sold in the
first quarter of 1995 and present the total
dollar sales as well as the number of units. - Brand is a collection of individual products.
- To construct
- A. Drag attribute brand from product dimension.
Place as Row Header. - B. Drag Dollar Sales and Units Sold from the
Fact Table, and place to the right of the Brand
row header. - C. Specify row constraint 1st Q 1995 on the
quarter attribute in the Time Dimension Table.
48Multidimensionality
- Examples of dimensions
- products, salespeople, market segments, business
units, geographical locations, distribution
channels, country, industry - Examples of Facts or Measures
- money, sales volume, head count, inventory,
profit, actual vs. forecast. - Examples of Time
- daily, weekly, monthly, quarterly, yearly
49Components of a Data Warehouse - 1
- Acquisition - The first component handles
acquisition of data from legacy systems and
outside sources. - Data is identified, copied, formatted, and
prepared for loading into a warehouse. - Vendors provide tools for extraction and
preparation.
50Components of a Data Warehouse - 2
- Storage Area - The second component is the
storage area managed by relational databases,
multi-dimensional databases, specialized hardware
- symmetric multiprocessor (SMP) or massively
parallel processors (MPP) machines - or by
software. - The storage component hold the data so that many
different data mining, executive information and
decision support systems can make use of it
effectively.
51Components of a Data Warehouse - 3
- Access - The third component of the warehouse is
the access area. - Different end-user PCs and workstations draw data
from the warehouse with the help of
multi-dimensional analysis tools, neural
networks, data discover tools, or analysis tools. - These smart data-mining tools are the driving
force behind the data warehouse concept. - What good is it to store all the information
without some way to understand it in new and
different ways.
52Data Warehouse Access Tools
- Intelligent Agents and Agencies - tools work and
think for user. - Query Facilities and Managed Query environments.
- Statistical Analysis - One of the biggest
surprises in the data warehousing marketplace is
the resurgence of interest in traditional
statistical analysis, and the concomitant
resurrection of the popularity to products like
SAS and SPSS.
53Data Warehouse Access Tools - 2
- Data Discovery - A large class of tools formerly
classified as decision support, artificial
intelligence and expert systems. They now make
use of neural networks, fuzzy logic, decision
trees, and other tools from advanced mathematics
to allow a user to sift through massive amounts
of raw data to discover new, interesting,
insightful, and in many cases useful things about
the organization, its operations, and its
markets. - Currently there are nearly 60 different data
discovery tools/products on the market.
54Data Warehouse Access Tools - 3
- OLAP - On-line Analytical Processing often uses
multi-dimensional spreadsheet tools allowing
users to look at information from many different
angles. - Users are able to slice and dice reports and to
look at the same kinds of information at
different levels at the same time. - Typical OLAP application might allow a product
manager to view sales figures for a given product
at the national level, see them broken down by
division, drill down to see territories within a
division, check sales numbers for each store
within a territory, and then compare them against
sales of stores from another territory.
55Data Warehouse Access Tools - 4
- Data Visualization - These tools turn ugly,
boring numbers into exciting visual
presentations. - These tools bring graphical representation to new
heights. Example Geographical information
systems turn data about stores, individuals, or
anything else into compelling, easy to
understand, dynamic maps. - PC-based Geographic Information systems have the
ability to display spatial occurrences and the
relationship between and among geographically
specific variables.
56Developing the Data Warehouse
- The most expensive warehousing ventures involve
major new hardware acquisitions and significant
investments in training, analysis, and systems
development costs. - Typical startup projects allocate 60 of budget
for hardware and software for creation of a
powerful storage component. 30 on data mining
and acquisition tools. - Budgeting for Systems Analysis and Development
has 50 of budget on acquisition capabilities,
30 fund user solutions, 20 creation of
databases in the storage component.
57Developing the Data Warehouse
- Clarify what you want to do with the Warehouse -
How Will It be Used. - Scrutinize the offerings of vendors and systems
integrators. Make sure you understand which
functions they provide, and which you must build. - Most successful projects start as small, tightly
defined tactical systems to solve pressing
business needs, and develop into larger systems
over time.
58DW Summary Key Concepts
- The DW is a collection of integrated,
subject-oriented databases designed to support
the decision support function where each unit of
data is non-volatile and relevant to some moment
in time (W.H. Inmon, 1992). - Implicit Assumptions
- physically separate from operational systems
- hold aggregated data and transactional (atomic)
data for management separate from those used for
OLTP.
59DW Summary Characteristics
- Subject-orientation
- integrated
- non-volatile (i.e. not updated)
- time variant (kept for long periods, for
forecasting and trend analysis) - summarized
- large volume
- not normalized
- metadata
- data sources
60Data Mining