Title: Data Warehouse
1Data Warehouse
2Multiple Files Problem
- Added complexity of multiple source files
- Start simple
Logic to detect Correct source
Multiple Source files
Extracted data
3Transforming Data from Multiple files
File
File
File
File
File
File
File
File
File
4Missing Values Problem
- Solution
- Ignore
- Wait
- Mark rows
- Extract when time-stamped
A
If NULL then FieldA
5Duplicate Value Problem
- Solution
- SQL self-join techniques
- RDMBS constrains utilities
SELECT FROM table_a, table_b WHERE
table_a.key()table_b.key UNION SELECT FROM
table_a, table_b WHERE table_a.keytable_b.key()
ACME Inc
ACME Inc
ACME Inc
ACME Inc
ACME Inc
6Element Names Problem
Customer
Customer
Client
Contact
Name
7Element Meaning Problem
All customer details
All details Except name
Customers name
- Avoid misinterpretation
- Complex solution
- Document meaning in metadata
8Input Format Problem
EBCDIC
ASCII
12373
123-73
9Referential Integrity Problem
- Solution
- SQL anti-join
- Server constraints
- Dedicated tools
- Emp Name Department
- Smith 10
- Jones 20
- Doe 50
- 6787 Harris 60
Department 10 20 30 40
10Name and Address Problem
- No unique key
- Missing values
- Personal and commercial names mixed
- Different addresses for same member
- Different names and spelling for same number
- Many names on one line
- One name on two lines
11Name and Address Problem
- Single-field format
- Multiple-field format
Mr.J.Smith, 100 Main St., Bigtown, County Luth,
23565
Name Street Town County Code Mr.J.Smith 100 Main St. Bigtown County Luth 23565
12Clean and Organize
- Create atomic values.
- Standardize formats.
- Verify data accuracy.
- Match with other records.
- Identify private and commercial addresses and
inhabitants. - Document in metadata.
- Requires sophisticated tools and techniques
13Merging Data
- Operational transactions do not usually map
one-to-one with warehouse data - Data for the warehouse is merged to provide
information for analysis
Pizza sales/return by day, hour, seconds
Sale 1/2/98 120001 Ham Pizza
10.00
Sale 1/2/98 120002 Cheese Pizza
15.00
Sale 1/2/98 120002 Anchovy Pizza
12.00
Return 1/2/98 120003 Ham Pizza
-12.00
Sale 1/2/98 120004 Sausage Pizza
11.00
14Merging Data
Sale 1/2/98 120001 Ham Pizza
10.00
Sale 1/2/98 120002 Cheese Pizza
15.00
Sale 1/2/98 120002 Anchovy Pizza
12.00
Return 1/2/98 120003 Ham Pizza
-12.00
Sale 1/2/98 120004 Sausage Pizza
11.00
Sale 1/2/98 120001 Ham Pizza
10.00
Sale 1/2/98 120002 Cheese Pizza
10.00
Sale 1/2/98 120004 Sausage Pizza
11.00
15Adding a Date Stamp
- Enables time analysis
- Label loaded data with a date stamp
- Add time to fact and dimension data
16Adding a Date Stamp
Store Table Store_id District_id Time_key
Product Table Product_id Time_key Product_desc
Sales Fact Table Item_id Store_id Time_key Sales_d
ollars Sales_units
Item_Table Item_id Dept_id Time_key
Time Table Week_id Period_id Year_id Time_key
17Adding a Date Stamp
- Fact table
- - Add triggers
- - Recode applications
- - Compare tables
- Dimension table
- Time representation
- - Point in time
- - Time span
18Adding Keys to Data
1 Sale 1/2/98 120001 Ham Pizza
10.00
2 Sale 1/2/98 120002 Cheese Pizza
15.00
3 Sale 1/2/98 120002 Anchovy Pizza
12.00
4 Sale 1/2/98 120003 Ham Pizza
-12.00
5 Sale 1/2/98 120004 Sausage Pizza
11.00
Data values or artificial keys
dw1 Sale 1/2/98 120001 Ham Pizza
10.00
dw2 Sale 1/2/98 120002 Cheese Pizza
10.00
dw3 Sale 1/2/98 120004 Sausage Pizza
11.00
19Summarizing Data
- During extraction on staging area
- After loading onto the warehouse server
Warehouse database
Operational databases
Staging area
20Maintaining Transformation Metadata
Contains transformation rules, algorithms, and
routines
Sources Stages Rules
Publish Extract Transform
Load Query
21Transformation Timing and Location
- Transformation is performed
- - Before load
- - In parallel
- May be initiated at different points
Unlikely
Probable
Possible
22Choosing a Transformation Point
- Workload Network bandwidth
- Environment Parallel execution
- CPU use Load window time
- Disk space User information
- needs
23Monitoring and Tracking
- Transformations should
- Be self-documenting
- Provides summary statistics
- Handle process exceptions
24Designing Transformation Processes
- Analysis
- - Sources and target mappings, business
- rules
- - Key users, metadata, grain
- Design options PL/SQL, replication, custom,
third-party tools - Design issues
- - Performance
- - Size of the staging area
- - Exception handling, integrity maintenance
25Transformation Tools
- Purchased
- SQLLoader
- In-house developed
26Data Management, Quality, and Auditing Tools
- Data management
- - Innovative Systems
- - Postalsoft
- - Vality Technology
- Data quality and auditing
- - Innovative Systems
- - Vality Technology
27Summary
- This lesson discussed the following topics
- Importance of data quality
- Transformation processes
- Data transformation issuess
- Data anomalies
- Name and address management
- Tools