Title: Advanced Database Systems F24DS2 / F29AT2
1Advanced Database SystemsF24DS2 / F29AT2
- Data Quality and Data Cleaning 1
2Acknowledgements
- I adapted this material from various sources,
most notably - A ppt presentation called Data Quality and Data
Cleaning An Overview by Tamrapani Dasu and
Theodore Johnson, at AT T Labs - A paper called Data Cleaning Problems and
Current Approaches, by Erhard Rahm and Hong Hai
Do, University of Leipzig, Germany. - My thanks to these researchers for making their
materials freely available online.
3On Data Quality
- Suppose you have a database sitting in front of
you, and I ask Is it a good quality database?
- What is your answer? What does quality depend
on? - Note this is about the data themselves, not the
system in use to access it.
4A Conventional Definition of Data Quality
- Good quality data are
- Accurate, Complete, Unique,
- Up-to-date, and Consistent
- meaning
5A Conventional Definition of Data Quality,
continued
- Accurate This refers to how the data were
recorded in the first place. What might be the
inaccurately recorded datum in the following
table?
Barratt John 22 Maths BSc Male
Burns Robert 24 CS BSc Male
Carter Laura 20 Physics MSc Female
Davies Michael 12 CS BSc Male
6A Conventional Definition of Data Quality,
continued
- Complete This refers to whether or not the
database really contains everything it is
supposed to contain. - E.g. a patients medical records should contain
references to all medication prescribed to date
for that patient. - The BBC Licensing DB should contain an entry for
every address in the country. Does it?
7A Conventional Definition of Data Quality,
continued
- Unique Every separate datum appears only once.
How many Data Quality errors can you find in
the following table, and what types are they?
Surname Firstname DoB Driving test passed
Smith J. 17/12/85 17/12/05
Smith Jack 17/12/85 17/12/2005
Smith Jock 17/12/95 17/12/2005
8A Conventional Definition of Data Quality,
continued
- Up-to-date The data are kept up to date.
- The post office has just changed my postcode from
EX6 8RA to EX6 8NU. Why does this make it
difficult for me to get a sensible quote for home
insurance or car insurance? - Can you think of a DB where it doesnt matter
whether or not the data are kept up to date??
9A Conventional Definition of Data Quality,
continued
- Consistent The data contains no logical errors
or impossibilities. It makes sense in and of
itself. - Why is the following mini DB inconsistent?
Date Sales Returns Net income
23rd Nov 25,609 1,003 24,506
24th Nov 26,202 1,601 24,601
25th Nov 28,936 1,178 25,758
10The Several Problems with the Conventional (or
any?) Definition
- Fine, we can define Data Quality (DQ) in a way
that makes it clear what kinds of issues we are
thinking about. But is the definition any use
beyond that? Can it be used - to Measure DQ, so that we can say whether or not
one DB is better than another, or so we can gauge
improvements in DQ over time. - on a wide range of different DBs, of widely
different purposes, sizes, etc.
11No Four problems
- There are at least four specific problems with
this definition if we are going to use it for any
other purpose than just discussing DQ issues - Measurability,
- rigidity,
- context independence,
- unhelpfulness
12DQ Def problems measurability
- It is not clear how we might measure quality
according to some of these items. - Completeness How will we know??
- Uniqueness It is hard to tell whether two
entries are similar, or duplicates! - Up-to-date-ness How do we know?
- Consistent consistency errors can be very
hard to find, especially in a very large DB
13DQ Def problems rigidity
- The conventional definition is rigid and static.
- It is easy to think of two DBs that are high
quality according to this definition, but are
nevertheless of entirely different quality. - E.g. one may contain much more potentially
relevant information. One may have better
metadata. Etc
14DQ Def problems context independence
- The conventional definition does not take into
account the context of the DB. Different things
may be important for quality in different
circumstances. - E.g. a DB for public access must be highly
accessible and interpretable in a DB of
experimental results, accuracy is more important
than consistency (why?), etc.
15DQ Def problems unhelpfulness/vagueness
- Partly as a result of the fact that we need
different DQ measures of different DBs, depending
on context, the conventional definition alone is
vague and unhelpful when it comes to using it to
provide clues for how to improve data quality. - How do we improve accuracy? Depends on how the
data are recorded. - How do we improve consistency? Depends immensely
on the kind of data we have, etc
16Towards modern definitions of DQ
- An ideal definition of data quality
- Is backed up by a good understanding of how and
why errors occur. - Relates to how the data are used
- Is helpful in leading to improvements in
processes - Is helpful in leading to usable measures/metrics
17The Data Quality Continuum
- Its rare that a datum is entered once into a DB
and then left alone. Usually, a datum has a long
and varied life, into which errors can arise at
each and every stage. The continuum is - Data gathering
- Data delivery
- Data storage
- Data integration
- Data retrieval
- Data analysis
- So, if we want to monitor DQ, we need to monitor
it at each of these stages
18DQ Continuum Example
- This is an example I am familiar with, helping to
illustrate the DQ continuum. - The International Seismological Centre (ISC) is
in Thatcham, in Berkshire. Its a charity funded
by various governments. Their role is to be the
repository for recording all earthquake events on
the planet.
19DQ Continuum ISC example gathering
ISC
Data gathering centres
20DQ Continuum ISC example delivery
- Raw seismograph data from local collection points
to DG centres. - Email or ftp to ISC some centres provide raw
data, some provide interpreted data (e.g. maybe
wont send some data if they believe it in error
in the first place)
21DQ Continuum ISC example integration
- The ISCs role is actually to figure out where
and when the Earth tremors were (there are
hundreds per month) based on reverse engineering
from seismograph readings. They integrate the raw
data and attempt to do this, largely by hand and
brain, and record their findings in archival CDs
22DQ Continuum ISC example retrieval/analysis
- You can get a CD from ISC anytime, for the earth
tremor activity on any particular day. - Im not sure whether you can get the raw data
from them. - Naturally, you can analyse the data and see if
you can find inconsistencies or errors.
23The ISC DQ Continuum
Where might there occur errors, of Accuracy?
Completeness? Uniqueness? Timeliness?
Consisency? What else is important in this case?
24Where DQ problems occur (gathering)
- Manual data entry (how can we improve this?)
- Lack of uniform standards for format and content.
- Duplicates arising from parallel entry
- Approximations, alternatives, entries altered in
order to cope with s/w and/or h/w constraints. - Measurement errors.
25Where DQ problems occur (delivery)
- Multiple hops from source to DB problems can
happen anywhere - Inappropriate pre-processing (e.g. removing some
small seismograph readings before sending on to
ISC rounding up or down, when the destination
needs more accurate data). - Transmission problems buffer overflows, checks
(did all files arrive, and all correctly?)
26Where DQ problems occur (storage)
- Poor, out of date or inappropriate metadata
- Missing timestamps
- conversion to storage format (e.g. to excel
files, to higher/lower precision
27Where DQ problems occur (integration)
- This is the business of combining datasets e.g.
from different parts of a company, from
(previously) different companies following an
acquisition from different government agencies,
etc. - Different keys, different fields, different
formats - Different definitions (customer, income, )
- Sociological factors reluctance to share!
28Where DQ problems occur (retrieval/analysis)
- The problem here is usually the quality of DBs
that store the retrieved data, or the use of the
retrieved data in general. Problems arise
because - The source DB is not properly understood!
- Straightforward mistakes in the queries that
retrieve the relevant data. - E.g. A database of genes contains entries that
indicate whether or not each gene has a known or
suspected link with cancer. A retrieval/analysis
task leads to publishing a list of genes that are
not relevant to cancer. What is the problem here?
29What this lecture was about
- A conventional definition of DQ
- Problems with that definition
- The DQ Continuum
- Where errors might occur along that continuum