Title: 615644 Data Warehousing
1615-644 Data Warehousing
Week 5 Understanding Data Quality
2Data Quality
- Data quality problems are widespread in practice
and have significant economic impacts - Databases have significant error rates
- between 1 and 10 percent of data in critical
organisational databases are estimated to be
inaccurate Klein, Goodhue and Davis (1996) - About 500,000 dead people were found to
registered for Medicare benefits because of
inaccurate data The Australian (March 15,
2005)
3Data Quality Spot the Errors
Customer
Valid Until Date
Name
Suburb
Postcode
Gender
Credit
Bill Smith
Caulfield
3145
M
10,000
31/12/99
ACME Pty Ltd
Malvern
3286
D
1f,000
31/10/99
Mary Whiete
Caulfield
5000
F
3145
30/12/99
Richmondd
2467
F
2,000
28/2/99
Bill Smyth
4Understanding Data Quality
- List of desirable dimensions
- Accurate, complete, timely (eg Redman)
- Expert opinion
- trust me! (eg English)
- Empirical
- Survey data practitioners (eg Wang and Strong)
- Theoretical
- Rigorous (eg Wand and Wang)
5English (1999) Framework
- Inherent
- Definition conformance, validity of business rule
conformance, completeness (of values), accuracy
(to surrogate source) . - Pragmatic
- Timeliness, contextual clarity, derivation
integrity, usability .
6Wang and Strong (1996) Framework
- Survey data consumers as to their opinions on
data quality factors - Four groups of data quality factors resulted
- intrinsic (accuracy, reputation )
- contextual (timeliness, completeness )
- representational (concise, understandable )
- accessibility (accessible, security)
7Kahn et al. (1997) Framework
Conforms to Specifications
Meets or Exceeds Customer Expectations
Sound Information (complete, timely, believable,
consistent)
Useful Information (objective, relevant,
understandable,reputation)
Product Quality
Usable Information (secure, timely, concise,
accessible, consistent)
Effective Information (value-added, appropriate
amount)
Service Quality
8Wang and Wand (1996) Framework
- Theoretical based on Bunges ontology
Complete, Unambiguous, Meaningful, Non-redundant,
Correct
Database
Real World
9Limitations of Existing Approaches
- Quality dimensions are vaguely defined,
overlapping, ambiguous - Limited rigor
- Varying scope
10Semiotic Information Quality Framework
- Must be soundly based in theory (rigorous)
- Semiotics, ontology, service quality
- Must be usable and useful (relevant)
- Must include different perspectives (scope)
- Product view (of stored data)
- Service view (of received information)
- Must be clearly structured
- Coherent categories and criteria
11Steps in Developing the Framework
- Use semiotic theory to
- Define data quality categories
- Determine criteria derivation method per category
- Integrate theoretical empirical research
approaches - Integrate objective subjective IQ views
- IT Practitioner Academic focus groups
- To refine the framework
12Semiotics (Pierce and Morris)
- Three Components
- sign actual representation (perceivable)
- referent intended meaning, represented
phenomenon - interpretation received meaning, use of sign
13Defining Data Quality Categories
Theory (semiotic level)
Application (IS equivalent)
14Data Quality -Syntactic level (rule conformance)
- Conforming to metadata
- Data must conform to data integrity rules
- Data quality can be checked using computer-based
tools - Use integrity theory (eg. relational)
- Domain integrity, Entity integrity, Referential
integrity, Application specific integrity rules
15Data Quality -Semantic level (external mapping)
- Mapping from real world to database
- Mapped Completely
- Mapped Unambiguously
- Mapped Correctly (Phenomena)
- Mapped Correctly (Properties)
- Mapped consistently
- Mapped meaningfully
- Use Bunge Wand Weber ontology theory
16Data Quality -Semantic level (external mapping)
- Mapped completely
- Every external phenomena is represented
Incomplete
DB
RW
17Data Quality -Semantic level (external mapping)
- Mapped unambiguously
- Each identifiable data unit represents at most
one specific external phenomenon
Ambiguous
DB
RW
18Data Quality -Semantic level (external mapping)
- Mapped correctly (Phenomena)
- Each identifiable data unit maps to the correct
external phenomenon
Incorrect
DB
RW
19Data Quality -Semantic level (external mapping)
- Mapped correctly (Properties)
- Non-key attribute values in an identifiable data
unit match the property values for the
represented external phenomenon
Correct
xxxx
xxxx
yyyy
yyyy
DB
RW
20Data Quality -Semantic level (external mapping)
- Mapped consistently
- Each external phenomena is either represented by
one identifiable data unit or by multiple but
consistent identifiable units
Consistent
DB
RW
21Data Quality -Semantic level (external mapping)
- Mapped meaningfully
- each identifiable data unit represents at least
one specific external phenomenon
Non-meaningful
DB
RW
22Data Quality -Pragmatic level (user perspective)
- User perceptions of usefulness
- Subjective set of quality criteria
- Derive from previous work and focus groups
- Use service quality theory
- Compare expected with perceived actual quality
assessments
23Data Quality -Pragmatic level (user perspective)
- Accessible
- Data is easy and quick to retrieve
- Timely
- The currency (age) of data is appropriate for its
use - Understandable
- Data is presented in an intelligible manner
24Data Quality -Pragmatic level (user perspective)
- Secure
- Data is appropriately protected from damage or
abuse - Suitably presented
- Data is presented in a manner appropriate for its
use - Flexibly presented
- Data can be easily manipulated and the
presentation customised as needed
25Data Quality -Pragmatic level (user perspective)
- Allowing access to relevant metadata
- Appropriate metadata is available to define,
constrain and document data - Perceptions of syntactic and semantic criteria
26Use of the Framework
- Developing data quality assessment instrument
- Assessing data quality
- Defining strategies for improving data quality