Title: Data Warehouse and Design: Team 3A
1Data Warehouse and DesignTeam 3A
- Shane Brown
- Angela Daniels
- Charl Welman
2Outline of Team 3A
- Beginning with Operational Data p.82-87
- Data/Process Models and the Architected
Environment p.89-101 - The Data Model and Iterative Development p.102
- The Data Warehouse and Data Models p.89-101
- Normalization/ Denormalization p. 102- 112
3Beginning with Operational Data
- Some Integration Issues
- Data that is not encoded consistently
- Field transformation
- Legacy Data existing under many different formats
on many different DBMSs
SAVINGS
DDA
LOANS
TRUST
Figure 3.2 Data across the different applications
is severely not integrated
4Beginning with Operational Data (cont.)
- Three Types of Loads that are made into the Data
Warehouse from the Operational Environment - Archival Data
- Data currently contained in the OE
- Ongoing changes to the data warehouse environment
from the changes (updates) that have occurred in
the operational environment since the last
refresh.
5Beginning with Operational Data (cont.)
- Five common Techniques of Scanning at Point of
Refresh - Scanning data that has been time-stamped in the
OE - Limiting the data to be scanned is to scan a
delta file. - Scan a log file or an audit file created as a
by-product of transaction processing. - For managing the amount of data scanned is to
modify application code - Rubbing a before and an after image of the
operational file together.
6Data/Process Models and the Architected
Environment
- A Process Model Consists of
- Functional decomposition
- Context-level zero diagram
- Data flow diagram
- Structure chart
- State transition diagram
- HIPO chart
- Pseudocode
7The Data Model and Iterative Development
- Why an Iterative Development is Important
- The industry track record of success strongly
suggests it - The end user is unable to articulate requirements
until the first iteration is done - Management will not make a full commitment until
actual results are tangible and obvious - Visible results must be seen quickly
- Development efforts
- The Data Warehouse serves as a roadmap for each
of the development efforts. - Developer is confident that (s)he will intersect
his or her effort with the first development
effort
8The Data Warehouse And Data Models
9Corporate Data Models
- Corporate Data Model focuses on and represents
only primitive data - Performance factors are added into the corporate
data model as the model is transported to the
existing system environment - Figure 3.8 page 92
10Data Warehouse and Data Models
11Corporate Data Model Changes
- Data that is used purely in the operational
environment is removed - Key structures are enhanced with time elements
- Derived data is added to the corporate data model
- Data relationships in the operational environment
are turned in to artifacts in the data warehouse
12Corporate Data Model Changes
- Stability Analysis involves
- Grouping attributes of data together based on
their propensity for change - See Figure 3.9 on pager 91
13Data Warehouse Data Modeling
- Three levels of Data Modeling
- High-Level Modeling (ERD)
- Midlevel Modeling (Data item set or DIS)
- Low-Level Modeling (Physical Model)
14High-Level Modeling
- Features entities and relationships
- The entities that are shown in the ERD level are
the highest level of abstraction
Represents an entity or a major subject
Represents a 1n relationship
Represents an m n relationship
Represents a 1 1 relationship
15Midlevel Data Model
- There are four constructs that make up the
midlevel data model - Primary grouping exist once for each major
subject area - Secondary grouping holds data attributes that can
exist multiple times - A Connector relates data from one group to
another - Type of data is indicated by a line leading to
the right of a group
16Physical Data Model
- Created from the midlevel data model by including
keys and physical characteristics
- Last design step is to factor in performance
characteristic - granularity and partitioning of the data
- Physical input and output
17NORMALIZATION/ DENORMALIZATION
18- Output of data model produces number of tables
with moderate data - Performance suffers
- I/O issues when all programs use many I/O
- Small tables becomes costly to operate.
- Rational approach is to merge smaller tables to
reduce I/O and so cost.
19What strategy?
- Merging tables is one
- Creating arrays of data is another
- have normalized data residing in different
physical locations and retrieval requires a
physical I/O - if data were placed in a single row in an
array then one I/O would do to retrieve data -
20When to do Arrays
- Stable number of occurrences
- Data is accessed in sequence
- Data is created and/or updated in a statistically
well-behaved sequence - Occurs regularly in data warehouse because of the
time-based orientation of the data - Creating arrays by month is frequent
21Deliberate Redundancy
- In fully normalized table access is usually done
through the base table - Table is large and access becomes expensive
- If data is widely used and it is stable( no
worries about updates) then denormalize data
22Separation/Calculation of Data
- Different parts of data is accessed more
frequentlyseparate the data (bank balance) - Normalize in two separate tables allows for
compact storage - Derived data can reduce I/O needed-store the
calculated data, so calculated only once and its
always available without more calculations
23Creative Indexes
- Called a creative profile
- Occurs as operational data is moved to data
warehouse, requires little overhead to create - Does a profile on items of interest like largest
purchases, latest shipments - If management anticipates interest a creative
index should be built
24Referential Integrity
- Appears as artifacts of relationships in data
warehouse environment - can be managed independently, are efficient
to access, do not require update - Dont replicate referential integrity in data
warehouse environment
25Snapshots
- Each type of data warehouse centers around a
structure of data called a snapshot - Basic components
- a key
- a unit of time
- primary data related to key only
- secondary data captured as part of snapshot
26Snapshots continued
- Created as result of some event occurring
- Two types of triggers
- discrete activitysome business occurrence
- time event-end of day, week, or month
27Shanes Question
- Discuss some integration issues and give
examples.
28Angelas Questions
- What are the four constructs that make up the
midlevel data model? Explain
29Charls Question
- When do you use arrays to reduce I/O?