Data Quality Issues: ''' by Example - PowerPoint PPT Presentation

1 / 11
About This Presentation
Title:

Data Quality Issues: ''' by Example

Description:

come up with statistical data models. capture semantics of micro ... provide database technology for ... Which are the top ranking brands of mobile ... – PowerPoint PPT presentation

Number of Views:121
Avg rating:3.0/5.0
Slides: 12
Provided by: schm186
Category:

less

Transcript and Presenter's Notes

Title: Data Quality Issues: ''' by Example


1
Data Quality Issues... by Example
Wolfgang Lehner Dresden University of
Technology Faculty of Computer Science Database
Technology Group (SyA)
Dagstuhl Seminar on Data Quality
2
my personal background
  • Statistical and Scientific Databases
  • very long tradition (1st Berkeley Workshop in
    1981)
  • modeling perspective
  • come up with statistical data models
  • capture semantics of micro and macro data
  • processing perspective
  • "workflow" collection, preprocessing, analysis
  • provide database technology for efficient data
    analysis
  • a small subset of SSDB-techniques are used by
  • Data-Warehouse-Systems

3
Example 1 Processing Perspective of DQ
Non-Food Tracking Retail panelbased marketing
information for manufacturers and retailers in
consumer technology industries
GfK Group
  • Monitoring the global markets for
  • consumer goods
  • Which are the top ranking brands of mobile phones
    at present?
  • How large is the market share of digital
    television sets?
  • How much do consumer spent on computer in average?

periodical monitoring
4
Working Areas
Data - IN
Data - Preparation
MDM
IDAS
Data Warehouse(Extrapolation, Reports)
DWH
Creating value through knowledge
5
Data Control Flow from 10.000 feet
DWH2Tools
GIMWinCosSeparation
WebTAS
MDMitems shops
DHW Suite
DWH Explorer
DWH Extrapolation /Projection DWH Builder
IDAS InfoSystem
Preprocessing
Fact Tool
IDASOutput Pool
DWHLoader
IDAS2DWH
Data flow
IDAS2MF
Control flow
6
Data Control Flow more detailed
7
The SIMPLE business problem
local terms per shop
PRODUCTS
shops
data delivery
SHOPS
global reporting terms
product groups
8
Issues for Data Quality Discussion
  • Observation
  • subsequent production steps depend on the success
    and DQ of the result set produced by preceding
    steps(simplest form count() gt x, in most cases
    expert knowledge)
  • data context (more technical primary key)
    changes(very hard to trace outliers at the end
    back to the incoming data object)
  • DQ determines production process, e.g. TODO-lists
    of gt100 workers
  • DQ is key factor for Production Optimization
  • impact on demand-driven production
  • Example need report of cell phone sales by end
    of next week
  • report quality depends on type of customer
    premium customer ? higher data quality
  • ? identify important data providers, prioritize
    single jobs/data orders/
  • ? propagate individual deadline to participating
    working steps
  • impact on error correction
  • more raw data material manual article
    identification (extremely expensive)
  • better raw data material manually "correct"
    data, re-order tracking data
  • ? data lineage is a critical issues(fine
    granular data quality causes data explosion!!!)

9
Example 2 Integration Perspective of DQ
  • Perform Analysis across different Data Sources
  • What a similar sub-sequences of amino acid
    residues?
  • What are stable/typical conformations?
  • Current Situation
  • independent (mostly non-relational) data sources
  • no integrity constraints (within/across)
    different data sources
  • DQ is key factor for integrated data access
  • problems beyond "regular" integration issues

data sets are growing
10
Example Protein Structure
? the happy day scenario
1N
Protein Code
Atom Positions
? the real world scenario
11
Issues for Data Quality Discussion
  • Observation
  • in theory "nice" entities and relationshipsin
    practice many exceptions due to the experimental
    nature of the data collection process
  • have to relax the schema constraints? bad "data
    schema quality"
  • Data Schema Quality
  • impact on data analysisstatistical analysis
    process requires constraints as a guideline of
    data exploration (e.g. dimensional structures in
    OLAP)
  • need outlier management at the conceptual level?
    schema exceptions

12
Cooperative Conceptual Database Design
  • simple integration scenario of SwissPROT and
    Protein Data Bank (PDB)
  • SwissPROT as WikiWiki core

13
Cooperative Database Voting Concept
14
Summary and Conclusion
  • Data Quality
  • key issue in statistically analyzing huge data
    sets
  • data analysis meanscomplex often DQ driven -
    process of transforming micro into macro data
  • Solution
  • a) no size fits it all ...
  • b) need a general framework considering multiple
    aspects
  • statistical metadata
  • data lineage (most critical from my perspective)
  • users with expert knowledge (voting concept?)
Write a Comment
User Comments (0)
About PowerShow.com