Advanced Database Systems F24DS2 / F29AT2 - PowerPoint PPT Presentation

About This Presentation
Title:

Advanced Database Systems F24DS2 / F29AT2

Description:

David Corne, room EM G.39, x 3410, dwcorne_at_macs.hw.ac.uk / any ... Data Quality and ... for me to get a sensible quote for home insurance or car insurance? ... – PowerPoint PPT presentation

Number of Views:17
Avg rating:3.0/5.0
Slides: 30
Provided by: macs3
Category:

less

Transcript and Presenter's Notes

Title: Advanced Database Systems F24DS2 / F29AT2


1
Advanced Database SystemsF24DS2 / F29AT2
  • Data Quality and Data Cleaning 1

2
Acknowledgements
  • I adapted this material from various sources,
    most notably
  • A ppt presentation called Data Quality and Data
    Cleaning An Overview by Tamrapani Dasu and
    Theodore Johnson, at AT T Labs
  • A paper called Data Cleaning Problems and
    Current Approaches, by Erhard Rahm and Hong Hai
    Do, University of Leipzig, Germany.
  • My thanks to these researchers for making their
    materials freely available online.

3
On Data Quality
  • Suppose you have a database sitting in front of
    you, and I ask Is it a good quality database?
  • What is your answer? What does quality depend
    on?
  • Note this is about the data themselves, not the
    system in use to access it.

4
A Conventional Definition of Data Quality
  • Good quality data are
  • Accurate, Complete, Unique,
  • Up-to-date, and Consistent
  • meaning

5
A Conventional Definition of Data Quality,
continued
  • Accurate This refers to how the data were
    recorded in the first place. What might be the
    inaccurately recorded datum in the following
    table?

Barratt John 22 Maths BSc Male
Burns Robert 24 CS BSc Male
Carter Laura 20 Physics MSc Female
Davies Michael 12 CS BSc Male
6
A Conventional Definition of Data Quality,
continued
  • Complete This refers to whether or not the
    database really contains everything it is
    supposed to contain.
  • E.g. a patients medical records should contain
    references to all medication prescribed to date
    for that patient.
  • The BBC Licensing DB should contain an entry for
    every address in the country. Does it?

7
A Conventional Definition of Data Quality,
continued
  • Unique Every separate datum appears only once.
    How many Data Quality errors can you find in
    the following table, and what types are they?

Surname Firstname DoB Driving test passed
Smith J. 17/12/85 17/12/05
Smith Jack 17/12/85 17/12/2005
Smith Jock 17/12/95 17/12/2005
8
A Conventional Definition of Data Quality,
continued
  • Up-to-date The data are kept up to date.
  • The post office has just changed my postcode from
    EX6 8RA to EX6 8NU. Why does this make it
    difficult for me to get a sensible quote for home
    insurance or car insurance?
  • Can you think of a DB where it doesnt matter
    whether or not the data are kept up to date??

9
A Conventional Definition of Data Quality,
continued
  • Consistent The data contains no logical errors
    or impossibilities. It makes sense in and of
    itself.
  • Why is the following mini DB inconsistent?

Date Sales Returns Net income
23rd Nov 25,609 1,003 24,506
24th Nov 26,202 1,601 24,601
25th Nov 28,936 1,178 25,758
10
The Several Problems with the Conventional (or
any?) Definition
  • Fine, we can define Data Quality (DQ) in a way
    that makes it clear what kinds of issues we are
    thinking about. But is the definition any use
    beyond that? Can it be used
  • to Measure DQ, so that we can say whether or not
    one DB is better than another, or so we can gauge
    improvements in DQ over time.
  • on a wide range of different DBs, of widely
    different purposes, sizes, etc.

11
No Four problems
  • There are at least four specific problems with
    this definition if we are going to use it for any
    other purpose than just discussing DQ issues
  • Measurability,
  • rigidity,
  • context independence,
  • unhelpfulness

12
DQ Def problems measurability
  • It is not clear how we might measure quality
    according to some of these items.
  • Completeness How will we know??
  • Uniqueness It is hard to tell whether two
    entries are similar, or duplicates!
  • Up-to-date-ness How do we know?
  • Consistent consistency errors can be very
    hard to find, especially in a very large DB

13
DQ Def problems rigidity
  • The conventional definition is rigid and static.
  • It is easy to think of two DBs that are high
    quality according to this definition, but are
    nevertheless of entirely different quality.
  • E.g. one may contain much more potentially
    relevant information. One may have better
    metadata. Etc

14
DQ Def problems context independence
  • The conventional definition does not take into
    account the context of the DB. Different things
    may be important for quality in different
    circumstances.
  • E.g. a DB for public access must be highly
    accessible and interpretable in a DB of
    experimental results, accuracy is more important
    than consistency (why?), etc.

15
DQ Def problems unhelpfulness/vagueness
  • Partly as a result of the fact that we need
    different DQ measures of different DBs, depending
    on context, the conventional definition alone is
    vague and unhelpful when it comes to using it to
    provide clues for how to improve data quality.
  • How do we improve accuracy? Depends on how the
    data are recorded.
  • How do we improve consistency? Depends immensely
    on the kind of data we have, etc

16
Towards modern definitions of DQ
  • An ideal definition of data quality
  • Is backed up by a good understanding of how and
    why errors occur.
  • Relates to how the data are used
  • Is helpful in leading to improvements in
    processes
  • Is helpful in leading to usable measures/metrics

17
The Data Quality Continuum
  • Its rare that a datum is entered once into a DB
    and then left alone. Usually, a datum has a long
    and varied life, into which errors can arise at
    each and every stage. The continuum is
  • Data gathering
  • Data delivery
  • Data storage
  • Data integration
  • Data retrieval
  • Data analysis
  • So, if we want to monitor DQ, we need to monitor
    it at each of these stages

18
DQ Continuum Example
  • This is an example I am familiar with, helping to
    illustrate the DQ continuum.
  • The International Seismological Centre (ISC) is
    in Thatcham, in Berkshire. Its a charity funded
    by various governments. Their role is to be the
    repository for recording all earthquake events on
    the planet.

19
DQ Continuum ISC example gathering
ISC
Data gathering centres

20
DQ Continuum ISC example delivery
  • Raw seismograph data from local collection points
    to DG centres.
  • Email or ftp to ISC some centres provide raw
    data, some provide interpreted data (e.g. maybe
    wont send some data if they believe it in error
    in the first place)

21
DQ Continuum ISC example integration
  • The ISCs role is actually to figure out where
    and when the Earth tremors were (there are
    hundreds per month) based on reverse engineering
    from seismograph readings. They integrate the raw
    data and attempt to do this, largely by hand and
    brain, and record their findings in archival CDs

22
DQ Continuum ISC example retrieval/analysis
  • You can get a CD from ISC anytime, for the earth
    tremor activity on any particular day.
  • Im not sure whether you can get the raw data
    from them.
  • Naturally, you can analyse the data and see if
    you can find inconsistencies or errors.

23
The ISC DQ Continuum

Where might there occur errors, of Accuracy?
Completeness? Uniqueness? Timeliness?
Consisency? What else is important in this case?
24
Where DQ problems occur (gathering)
  • Manual data entry (how can we improve this?)
  • Lack of uniform standards for format and content.
  • Duplicates arising from parallel entry
  • Approximations, alternatives, entries altered in
    order to cope with s/w and/or h/w constraints.
  • Measurement errors.

25
Where DQ problems occur (delivery)
  • Multiple hops from source to DB problems can
    happen anywhere
  • Inappropriate pre-processing (e.g. removing some
    small seismograph readings before sending on to
    ISC rounding up or down, when the destination
    needs more accurate data).
  • Transmission problems buffer overflows, checks
    (did all files arrive, and all correctly?)

26
Where DQ problems occur (storage)
  • Poor, out of date or inappropriate metadata
  • Missing timestamps
  • conversion to storage format (e.g. to excel
    files, to higher/lower precision

27
Where DQ problems occur (integration)
  • This is the business of combining datasets e.g.
    from different parts of a company, from
    (previously) different companies following an
    acquisition from different government agencies,
    etc.
  • Different keys, different fields, different
    formats
  • Different definitions (customer, income, )
  • Sociological factors reluctance to share!

28
Where DQ problems occur (retrieval/analysis)
  • The problem here is usually the quality of DBs
    that store the retrieved data, or the use of the
    retrieved data in general. Problems arise
    because
  • The source DB is not properly understood!
  • Straightforward mistakes in the queries that
    retrieve the relevant data.
  • E.g. A database of genes contains entries that
    indicate whether or not each gene has a known or
    suspected link with cancer. A retrieval/analysis
    task leads to publishing a list of genes that are
    not relevant to cancer. What is the problem here?

29
What this lecture was about
  • A conventional definition of DQ
  • Problems with that definition
  • The DQ Continuum
  • Where errors might occur along that continuum
Write a Comment
User Comments (0)
About PowerShow.com