Title: Data Quality Issues: ''' by Example
1Data Quality Issues... by Example
Wolfgang Lehner Dresden University of
Technology Faculty of Computer Science Database
Technology Group (SyA)
Dagstuhl Seminar on Data Quality
2 my personal background
- Statistical and Scientific Databases
- very long tradition (1st Berkeley Workshop in
1981) - modeling perspective
- come up with statistical data models
- capture semantics of micro and macro data
- processing perspective
- "workflow" collection, preprocessing, analysis
- provide database technology for efficient data
analysis - a small subset of SSDB-techniques are used by
- Data-Warehouse-Systems
3Example 1 Processing Perspective of DQ
Non-Food Tracking Retail panelbased marketing
information for manufacturers and retailers in
consumer technology industries
GfK Group
- Monitoring the global markets for
- consumer goods
- Which are the top ranking brands of mobile phones
at present? - How large is the market share of digital
television sets? - How much do consumer spent on computer in average?
periodical monitoring
4Working Areas
Data - IN
Data - Preparation
MDM
IDAS
Data Warehouse(Extrapolation, Reports)
DWH
Creating value through knowledge
5Data Control Flow from 10.000 feet
DWH2Tools
GIMWinCosSeparation
WebTAS
MDMitems shops
DHW Suite
DWH Explorer
DWH Extrapolation /Projection DWH Builder
IDAS InfoSystem
Preprocessing
Fact Tool
IDASOutput Pool
DWHLoader
IDAS2DWH
Data flow
IDAS2MF
Control flow
6Data Control Flow more detailed
7The SIMPLE business problem
local terms per shop
PRODUCTS
shops
data delivery
SHOPS
global reporting terms
product groups
8Issues for Data Quality Discussion
- Observation
- subsequent production steps depend on the success
and DQ of the result set produced by preceding
steps(simplest form count() gt x, in most cases
expert knowledge) - data context (more technical primary key)
changes(very hard to trace outliers at the end
back to the incoming data object) - DQ determines production process, e.g. TODO-lists
of gt100 workers - DQ is key factor for Production Optimization
- impact on demand-driven production
- Example need report of cell phone sales by end
of next week - report quality depends on type of customer
premium customer ? higher data quality - ? identify important data providers, prioritize
single jobs/data orders/ - ? propagate individual deadline to participating
working steps - impact on error correction
- more raw data material manual article
identification (extremely expensive) - better raw data material manually "correct"
data, re-order tracking data - ? data lineage is a critical issues(fine
granular data quality causes data explosion!!!)
9Example 2 Integration Perspective of DQ
- Perform Analysis across different Data Sources
- What a similar sub-sequences of amino acid
residues? - What are stable/typical conformations?
- Current Situation
- independent (mostly non-relational) data sources
- no integrity constraints (within/across)
different data sources - DQ is key factor for integrated data access
- problems beyond "regular" integration issues
data sets are growing
10Example Protein Structure
? the happy day scenario
1N
Protein Code
Atom Positions
? the real world scenario
11Issues for Data Quality Discussion
- Observation
- in theory "nice" entities and relationshipsin
practice many exceptions due to the experimental
nature of the data collection process - have to relax the schema constraints? bad "data
schema quality" - Data Schema Quality
- impact on data analysisstatistical analysis
process requires constraints as a guideline of
data exploration (e.g. dimensional structures in
OLAP) - need outlier management at the conceptual level?
schema exceptions
12Cooperative Conceptual Database Design
- simple integration scenario of SwissPROT and
Protein Data Bank (PDB) - SwissPROT as WikiWiki core
13Cooperative Database Voting Concept
14Summary and Conclusion
- Data Quality
- key issue in statistically analyzing huge data
sets - data analysis meanscomplex often DQ driven -
process of transforming micro into macro data - Solution
- a) no size fits it all ...
- b) need a general framework considering multiple
aspects - statistical metadata
- data lineage (most critical from my perspective)
- users with expert knowledge (voting concept?)