Data Quality - PowerPoint PPT Presentation

About This Presentation
Title:

Data Quality

Description:

Data quality is measured using anecdotes 'Hazy' feeling of ... Data Domain: US State abbreviations. 62 possible members. Subclassed data domain: 'New England' ... – PowerPoint PPT presentation

Number of Views:898
Avg rating:3.0/5.0
Slides: 31
Provided by: davidl116
Learn more at: https://cs.nyu.edu
Category:

less

Transcript and Presenter's Notes

Title: Data Quality


1
Data Quality
  • Class 2
  • David Loshin

2
Goals
  • Cost of low data quality
  • Mapping the information chain
  • Data Quality impacts
  • Economic measures
  • Impact domains
  • Building the Data Quality ROI Model

3
Goals 2
  • Data Cleansing Project
  • Goal of the application
  • Components of the application

4
Cost of Low Data Quality
  • Data quality is measured using anecdotes
  • Hazy feeling of wrongness
  • Desire to gauge the true cost of poor data quality

5
5 Steps
  • Map the Information Chain
  • Categorize costs associated with low data quality
  • Identify and estimate actual effect
  • Determine cost of fixing problem
  • Calculate Return on Investment (ROI)

6
Evidence of Economic Impact
  • Frequent service interruptions and system
    failures
  • Drop in productivity vs. volume
  • High employee turnover
  • High new business/continued business ratio
  • Increased customer service requirements
  • Customer Attrition

7
The Information Chain
  • Data flow model
  • Processing stages
  • Communication/data transfer

8
The Information Chain 2
  • Data Supply
  • Data Acquisition
  • Data Creation
  • Data Processing
  • Data Packaging
  • Decision Making
  • Decision Implementation
  • Data Delivery
  • Data Consumption

9
Information Chain 3
  • Information chain data flow graph
  • Processing stages are vertices in graph
  • Directed message-passing channels directed
    edges
  • Examples

10
Impacts of Low Data Quality
  • Hard impacts can be estimated and/or measured
  • Soft impacts hard to measure, but definitely are
    evident

11
Hard Impacts
  • Customer attrition
  • Costs attributed to error detection
  • Costs attributed to error rework
  • Costs attributed to prevention of errors
  • Costs associated with customer service
  • Costs associated with fixing customer problems
  • Costs associated with enterprisewide data
    inconsistency
  • Costs attributable to delays in processing

12
Soft Impacts
  • Difficulty in decision making
  • Time delays in operation
  • Organizational mistrust
  • Lowered ability to effectively compete
  • Data ownership conflicts
  • Lowered employee satisfaction

13
Economic Measures
  • Cost Increase
  • Revenue Decrease
  • Cost Decrease
  • Revenue Increase
  • Delay
  • Speedup
  • Increase Satisfaction
  • Decrease Satisfaction

14
Impact Domains
  • Operational
  • Tactical/Strategic

15
Operational Impacts
  • Detection
  • Correction
  • Rollback
  • Rework
  • Prevention
  • Warranty
  • Reduction
  • Attrition
  • Blockading.

16
Tactical/Strategic Impacts
  • Delays
  • Preemption
  • Idling
  • Increased Difficulty
  • Lost opportunities
  • Organizational mistrust
  • Alignment
  • Acquisition overhead
  • Decay
  • Infrastructure

17
Putting it Together
  • Map the information chain
  • Conduct interviews to locate data quality
    problems
  • Annotate information chain with location of data
    qualty problems
  • Identify impact domains for each problem
  • Characterize economic impact (cost!)
  • Aggregate totals

18
ROI Model
  • Create a spreadsheet with assigned costs
  • Add in costs of improvements
  • Determine best return on investment

19
Data Cleansing Project
  • Write an application to cleanse data
  • Record Parsing
  • Metadata cleansing
  • Data standardization
  • Data correction
  • Data enhancement

20
Record Parsing
  • Data element types
  • first names
  • last names
  • honorifics
  • titles
  • street names
  • directions
  • business words
  • etc.

21
Data Domains
  • Data types
  • Subclassed data types domains
  • Mappings between domains

22
Data Domains 2
  • Data type char(2)
  • 676 possible non-punctuation members
  • Data Domain US State abbreviations
  • 62 possible members
  • Subclassed data domain New England
  • ME, NH, VT, MA, CT, RI

23
Data Domains 3
  • Enumerated domains
  • All values are explicit
  • Rule-based domains
  • Domain definition is generative

24
Record Parsing
  • Tokenizing data elements within an attribute
  • Assign meaning to tokens
  • Domain membership
  • Patterns
  • Context

25
Tokenizing
  • Straightforward
  • white-space separated
  • punctuation important or not?
  • Result stream of tokens

26
Domain Membership
  • Can each token be assigned to a domain?
  • Based strictly on token value
  • Based on patterns
  • Based on context

27
Domain Membership 2
  • Domains can be maintained in memory using hash
    tables
  • Search for domain membership is the same as hash
    table lookups
  • What if a token belongs to more than one domain?

28
Patterns
  • Certain kinds of data attributes are organized
    around token patterns
  • Example names can appear using these kinds of
    patterns
  • (title) (first) (middle) (last)
  • (title) (first) (initial) (last)
  • (first) (middle) (last)
  • (last) (comma) first) (middle)
  • etc.

29
Context
  • What happens when a token belongs to more than
    one domain?
  • We can use context to infer decision
  • Build weights based on frequency training

30
Next Week
  • Dimensions of Data Quality
  • Project specification
Write a Comment
User Comments (0)
About PowerShow.com