Title: ITEC 423 Data Warehousing and Data Mining
1ITEC 423 Data Warehousing and Data Mining
DATA WAREHOUSEBUILDING BLOCKS
2(No Transcript)
3What is a data warehouse?
- A data warehouse is a
- subject-oriented,
- Integrated (consolidated)
- time-variant, and
- nonvolatile
- collection of data in support of managements
decision-making process. - W. H. Inmon
4Subject-Oriented Data
- Warehouse is organized around major subjects of
the enterprise rather than major application
areas - Contains decision-support data rather than
application-oriented data. - The focus of the design is
- providing users easy access to data so that
current and future questions can be answered
5Application-Oriented vs Subject-Oriented
6Integrated or Consolidated Data
- Integrates corporate level application-oriented
data from different source systems - data is often inconsistent or missing
- Integrated data source must be made consistent to
present a unified view of the data to the users.
7Integrated Data
8Time-Variant Data
This is necessary to support trending,
forecasting, and time-based performance
reporting, such as current year versus previous
year.
- Data in the warehouse is only accurate and valid
at some point in time or over some time interval.
- contains slices of data across different periods
of time. - With these data slices, the user can view current
and past reports. - data represents a series of snapshots.
- Time-variance is also shown in the extended time
that data is stored - contains several years worth of data
- implicit or explicit association of time with all
data
9Non-Volatile Data
- Data in the warehouse is not updated in real-time
but is refreshed from operational systems on a
regular basis - New data is always added as a supplement to the
database, rather than a replacement.
10Data Granularity
- operational database
- data is usually kept at the lowest level of
detail - the units of sale are captured at the level of
units of a product per transaction at the check
-out counter. - the quantity ordered is captured and stored at
the level of units of a product per order
received from the customer. - If summary data is needed, the individual
transactions are grouped. - data warehouse
- Initial requests are for summary data to use for
analysis. - total sale units of a product in an entire
region. - Progessively more details may be required
- breakdown by states in the region.
- Examine sale units at individual stores.
11Data Granularity
- Is the level of detail
- keep data summarized at different levels of
detail
THREE DATA LEVELS IN A BANKING DATA WAREHOUSE
12Typical Properties of a data warehouse
- A data warehouse is housed on an enterprise
mainframe server. - Data from various online transaction processing
(OLTP) applications and other sources is
selectively extracted and organized - Read-only Copy
- re-structured
- Data warehouse database is used for processing
analytical applications and user queries.
13OLTP vs. Warehousing
- Organized by transactions vs. Organized by
particular subject - More number of users vs. less
- Accesses few records vs. entire table
- Smaller database vs. Large database
- Normalized data structure vs. Unnormalized
- Continuous update vs. periodic update (load)
14Data Warehouse Compared to OLTP
PROPERTY OLTP DATA WAREHOUSE
ACTIVITIES Processes Analysis
RESPONSE TIME Subseconds to seconds Seconds to hours
OPERATIONS DML Primarily read-only
NATURE OFDATA 30-60 days Snapshots over time
DATA ORGANIZE. By application By subject,time
SIZE Small to large Large to very large
DATA SOURCES Operational Internal Operational, Internal, External
15Data Warehouse Compared to OLTP
PROPERTY
OLTP
DATA WAREHOUSE
ACTIVITIES
Processes
Analysis
RESPONSE TIME
Subseconds
to
Seconds to hours
seconds
OPERATIONS
DML
Primarily read
-
only
NATURE OF
30
-
60 days
Snapshots over time
DATA
DATA ORGANIZE.
By application
By
subject,time
SIZE
Small to large
Large to very large
DATA SOURCES
Operational
Operational,
Internal
Internal, External
Predictable
Unpredictable
USAGE CURVEE
16Data Warehouse or Data Mart
17Data Warehouse Compared to Data Mart
Property Data Warehouse Data Mart
Scope Enterprise Department
Subjcts Multiple Single-subject, line of business (LOB)
Data Source Many Few
Size (typical) 100 GB to gt 1 TB lt 100 GB
Implementationtime Months to years Months
Data Warehouse
Data Mart
18Data Warehouse Compared to Data Mart
Property
Data Warehouse
Data Mart
Scope
Enterprise
Department
Subjcts
Multiple
Single
-
subject, line
of business (LOB)
Data Source
Many
Few
Size (typical)
100 GB to gt 1 TB
lt 100 GB
Implementation
Months to years
Months
time
Data Warehouse
Data Mart
19Which one to build first? Data warehouse or Data
Mart?
20Top Down Approach
21Top Down Approach
- Build the overall, big, enterprise- wide data
warehouse. - Instead of collection of fragmented islands of
information. - Data warehouse is large and integrated.
- would take longer to build and has a high risk
of failure. - If you do not have experienced professionals on
your team, this approach could be hazardous. - Difficult to sell this approach to senior
management and sponsors - They are not likely to see results soon enough.
22Pros and Cons of Top Down Approach
- A truly corporate effort, an enterprise view of
data - Inherently architected, not a union of disparate
data marts - Single, central storage of data about the content
- Centralized rules and control
- May see quick results if implemented with
iterations
- Takes longer to build even with an iterative
method - High exposure to risk of failure
- Needs high level of cross-functional skills
- High outlay without proof of concept
23Bottom Up Approach
24Bottom Up Approach
- Build departmental data marts one by one based on
priority - Collection of data marts make up the data
warehouse - Beware of data fragmentation
- Independent data marts are blind to the overall
requirements of the entire organization. - Data marts contain
- data at the lowest level of granularity
- summaries depending on the needs for analysis
- Data marts are joined or unioned together by
conforming the dimensions
25Pros and Cons of Bottom Up Approach
- Faster and easier implementation of manageable
pieces - Favorable return on investment and proof of
concept - Less risk of failure
- Inherently incremental
- can schedule important data marts first
- Allows project team to learn and grow
- Each data mart has its own narrow view of data
- Permeates redundant data in every data mart
- Perpetuates inconsistent and irreconcilable data
- Proliferates unmanageable interfaces
26Architectural Types
27Architectural Types
- Centralized Data Warehouse
- Takes into account the enterprise-level
information requirements - Atomic level data at the lowest level of
granularity is stored - Some summarized data may be included
- Queries and applications access the central data
warehouse. - No separate data marts
- Independent Data Marts
- Evolves in companies where the organizational
units develop their own data marts for their own
specific purposes - Each data mart serves a particular organizational
unit - More than one version of the truth may be found
- Data marts are independent of one another
- Different data marts may have inconsistent data
definitions and standards - Such variances hinder analysis of data across
data marts.
28Architectural Types
- Federated
- An existing legacy of an assortment of DSS in the
form of operational systems, extracted datasets,
primitive data marts, - May not be possible to discard investment and
start from scratch - Practical solution is a federated architectural
type - data may be physically or logically integrated
through shared key fields, overall global
metadata , distributed queries, and such other
methods - No one overall data warehouse
- Data-Mart Bus
- Conformed supermarts approach
- Analyzing requirements for a specific business
subject such as orders, shipments, billings,
insurance claims, car rentals, and ... - Build the first data mart (supermart) using
business dimensions and metrics - These business dimensions will be shared in the
future data marts. - Conform dimensions among the various data marts
- Result would be logically integrated supermarts
that will provide an enterprise view of the data - Data marts contain atomic data organized as a
dimensional data model - Results from adopting an enhanced bottom-up
approach to data warehouse development
29Architectural Types
- Hub-and-S poke
- Similar to the centralized data warehouse
architecture - Overall enterprise-wide data warehouse
- Atomic data is stored in the centralized data
warehouse - Major and useful difference is the presence of
dependent data marts in this architectural type - Dependent data marts obtain data from the
centralized data warehouse - The centralized data warehouse forms the hub to
feed data to the data marts on the spokes - Dependent data marts may be developed for a
variety of purposes - departmental analytical needs, specialized
queries, data mining, and ... - Dependent data mart may have normalized,
denormalized, summarized, or dimensional data
structures based on individual requirements - Most queries are directed to the dependent data
marts although the centralized data warehouse may
itself be used for querying - Result s from adopting a top-down approach to
data warehouse development.
30Building Blocks of Data Warehouses
31What is OLAP?
- online analytical processing
- Approach to answer multi-dimensional analytical
queries. - part of the broader category of
- business intelligence reporting, data mining.
- Applications include
- business reporting for sales,
- management reporting,
- budgeting
- forecasting
32Typical Data Warehousing Process
Phase I - STRATEGY Identify business
requirements Define objectives purpose of DW
Phase II - DEFINITION Project scoping and
planning Using building block approach
Phase III - ANALYSIS Information requirements are
defined
Phase IV - DESIGN Database structures to hold
base data and summaries are created Translation
mechanisms are designed
Phase V - BUILD DOCUMENT The warehouse is built
and documentation is developed
Phase VI - POPULATE, TEST TRAIN The warehouse
is populated and tested the users are trained on
system and tools
Iterative
Phase VII - DISCOVERY EVOLUTION The warehouse
is monitored and adjustments are applied, or
future extensions are planned
33What Does All This Mean?
- On a daily basis, organizations turn to their
data warehouses to answer a limitless variety of
questions. - Nothing is free
- these benefits do come with a cost.
- The value of a data warehouse is a result of the
new and changed business processes it enables. - There are limitations
- A DW cannot correct problems with the data,
although it may help to clearly identify them.
34Comparison of Typical DW Costs and Benefits
- Costs
- Hardware, software, development personnel and
consultant costs. - Operational costs like ongoing systems
maintenance. - Benefits
- Added Revenue
- Will the new (business objective) process
generate new customers (what is the estimated
value?) - Will the new (business objective) process
increase the buying propensity of existing
customers (by how much?) - Is the new process necessary to ensure that the
competition doesn't offer a demanded service that
you can't match? - Reduced costs
- What costs of current systems will be eliminated?
- Is the new process intended to make some
operation more efficient? If so, how and what is
the dollar value?
35The Cost of Warehousing Data
- Expenditures can be categorized as one-time
initial costs or as recurring, ongoing costs. - The initial costs can further be identified as
for hardware or software. - Expenditures can also be categorized as capital
costs (associated with acquisition of the
warehouse) or as operational costs (associated
with running and maintaining the warehouse)
36Expenditures Associated with Building a DW
Recurring Costs One-Time Costs
Capital Hardware maintenance Software maintenance Terminal analysis Middleware Hardware Software Disk DBMS CPU Terminal analysis Network Middleware Terminal analysis Network Log utility Processing Metadata Infrastructure
Operational Ongoing refreshment Integration transformation Data model maintenance Record identification maintenance Metadata infrastructure maintenance Archival of data Data aging within the DW Integration/transformation processing specification Metadata infrastructure population System of record definition Data dictionary language definition Network transfer definition CASE/Repository interface Initial data warehouse population Data model definition Database design definition
37Cost is Highly Variable
- A company that spends less money for their data
warehouse is often happier with it. - The main justification for the development
expense is that a DW reduces the cost of
accessing the information owned by the
organization. - Since information has to be retrieved just once
(when it is placed in the warehouse), DW users
see a lower cost on each report generated.
38Typical Multidatabase Report and Screen Generation
Data download and transformation contribute to
retrieval costs for every report or screen
generated
39Typical DW Report and Screen Generation
Data upload and transformation costs occur just
once. Retrieval costs are lower.
40Farmers and Explorers
- Every corporation has two types of DW users.
- Farmers know what they want before they set out
to find it. They submit small queries and
retrieve small nuggets of information. - Explorers are quite unpredictable. They often
submit large queries. Sometimes they find
nothing, sometimes they find priceless nuggets. - Cost justification for the DW is usually done on
the basis of the results obtained by farmers
since explorers are unpredictable.
41Data Marts and the Data Warehouse
Legacy systems feed data to the warehouse. The
warehouse feeds specialized information to
departments.
42The Data Mart is More Specialized
The data mart serves the needs of one business
unit, not the organization.
43Foundations of Data Mining
- Data mining is the process of using raw data to
infer important business relationships. - Despite a consensus on the value of data mining,
a great deal of confusion exists about what it
is. - It is a collection of powerful techniques
intended for analyzing large datasets. - There is no single data mining approach, but
rather a set of techniques that can be used in
combination with each other.
44The Roots of Data Mining
- The approach has roots in practice dating back
over 30 years. - In the early 1960s, data mining was called
statistical analysis, and the pioneers were
statistical software companies such as SAS and
SPSS. - By the 1980s, the traditional techniques had been
augmented by new methods such as fuzzy logic,
heuristics and neural networks.
45A General Approach
- Although all data mining endeavors are unique,
they possess a common set of process steps - Infrastructure preparation choice of hardware
platform, the database system and one or more
mining tools - Exploration looking at summary data, sampling
and applying intuition - Analysis each discovered pattern is analyzed
for significance and trends
46A General Approach (continued)
- Interpretation Once patterns have been
discovered and analyzed, the next step is to
interpret them. Considerations include business
cycles, seasonality and the population the
pattern applies to. - Exploitation this is both a business and a
technical activity. One way to exploit a pattern
is to use it for prediction. Others are to
package, price or advertise the product in a
different way.
47Review Vocabulary
- Data warehouse
- Data mart
- OLTP
- OLAP
- Dimensional Model
- Subject Oriented
- Time variant
- Non volatile
- Integrated/consolidate