Data Quality - PowerPoint PPT Presentation

About This Presentation
Title:

Data Quality

Description:

Decision and Classification Trees. Each node in tree represents a question ... The candidates are pruned if a subset of the items in each candidate does not ... – PowerPoint PPT presentation

Number of Views:20
Avg rating:3.0/5.0
Slides: 41
Provided by: davidl116
Category:
Tags: data | pruned | quality

less

Transcript and Presenter's Notes

Title: Data Quality


1
Data Quality
  • Class 9

2
Agenda
  • Project 4 weeks left
  • Exam
  • Rule Discovery
  • Data Linkage

3
Project
  • Please email me a schema by Friday
  • I will review the schema
  • By next Friday
  • Application to parse a file and generate table
    entries

4
Exam
  • Answers to questions

5
Rule Discovery
  • Decision and ClassificationTrees
  • Association Rules

6
Decision and Classification Trees
  • Each node in tree represents a question
  • Decision as to which path to take from that node
    is dependent on the answer to the question
  • At each step along the path from the root of the
    tree to the leaves, the set of records that
    conform to the answers along the way continues to
    grow smaller

7
Decision and Classification 2
  • At each node in the tree, we have a
    representative set of records that conform to the
    answers to the questions along that path
  • Each node in the tree represents a segmenting
    question, which subdivides the current
    representative set into two smaller segments
  • Every path is unique
  • Each node in the tree also represents the
    expression of a rule

8
Example
9
Decision and Classification 3
10
CART
  • Classification and Regression Tree
  • Grab a training set
  • Subselect some records that we know already share
    some attribute properties in common
  • All other data attributes become independent
    variables
  • The results of the decision tree process are to
    be applied to the entire data set at a later date

11
CART 2
  • Decide which of the independent variables is the
    best for splitting the records
  • The choice for the next split is based on
    choosing the criteria that divide the records
    into sets where, in each set, a single
    characteristic predominates
  • Evaluate the possible ways to split based on each
    independent variable, measuring how good that
    split will be

12
Selection Heuristics
  • Gini maximize the set differentiated by a split,
    with the goal of isolating records with that
    class from other records
  • Twoing tries to evenly distribute the records at
    each split opportunity
  • There are other heuristics

13
CART 3
  • The complete tree is built by recursively
    splitting the data at each decision point in the
    tree
  • At each step, if we find that for a certain
    attribute all values are the same, we eliminate
    that attribute from future consideration
  • When we reach a point where no appropriate split
    can be found, we determine that node to be a leaf
    node
  • When the tree is complete, the splitting
    properties at each internal node can be evaluated
    and assigned some meaning

14
Example 2
15
Rules
  • If (monthly_bill gt 100) AND (PayPerViews lt 2)
  • If (monthly_bill gt 100) AND (PerPerViews gt 2) AND
    (PayPerViews lt 5)
  • If (monthly_bill gt 100) AND (PayPerViews gt 5)

16
Association Rules
  • Rules of the form X ? Y, where X is a set of
    (attribute, value) pairs and Y is a set of
    (attribute, value) pairs
  • An example is 94 of the customers that purchase
    tortilla chips and cheese also purchase salsa
  • This can be used for many application domains,
    such as market basket analysis
  • Can also be used to discover data quality rules

17
Association Rules 2
  • Formally
  • Let D be a database of records
  • Each record R in D contains a set of (attribute,
    value) pairs (also called an item)
  • An itemset X is a subset of (attribute, value)
    pairs of a record R (i.e., X ? R)
  • An association rule is an implication of the form
    X ?Y, where X and Y are both itemsets, and share
    no attributes.
  • The rule holds with confidence c if c of the
    records that contain X also contain Y
  • The rule has support s if s of the records in D
    contain X or Y

18
Association Rules 3
  • Confidence is the percentage of time that the
    rule holds when X is in the record
  • Support is the percentage of time that the rule
    could hold
  • Association rules describe a relation imposed on
    individual values that appear in the data
  • Association rules with high confidence are likely
    to imply generalities about the data
  • We can infer data quality rules from the
    discovery of association rules

19
Association Rules 4
  • Example
  • (CustomerTypeBusiness) AND (total gt 1000) ?
    (managerApproval required) with confidence
    85 and support 25
  • This means that 25 of the time, the record had
    one of those attributes set with the indicated
    values
  • Of the records with (CustomerTypeBusiness) AND
    (total gt 1000), 85 of the time he attribute
    managerApproval had the value required
  • We might infer this as a more general rule, that
    business orders greater than 1000 require
    manager approval
  • This calls into question the 15 of the time it
    didnt hold true data quality problem, or is it
    not a general rule?

20
Association Rules 5
  • We can set some minimum support and minimum
    confidence levels
  • Definitions
  • Lk is the set of large sets having k items
  • Ck is the set of candidate sets having k items

21
Association Rules Algorithm
  • L1 sets with 1 item
  • for (k 2 Lk-1 not empty k) do
  • Ck generate_new_candidates(Lk-1)
  • forall records R in D do
  • CR subset(Ck, R)
  • forall candidates c in CR do
  • c.count
  • end
  • Lk c in CR c.count gt minimum support

22
Candidate Generation and Subset
  • Takes the set of all large itemsets of size (k
    1)
  • First, it joins Lk-1 with Lk-1, if they share (k
    2) items, to get a superset of the set of
    candidates
  • The candidates are pruned if a subset of the
    items in each candidate does not have minimum
    support (i.e., the subset of size (k 1) is not
    in Lk-1
  • Subset operation takes a record, and finds all
    candidate rules of iteration k within that record

23
More on Association Rules
  • We can adjust our goals for finding rules by
    quantizing the values in each attribute
  • In other words, we can assign values of
    attributes that belong to large ranges into
    quantized components, making the rule process
    less cumbersome
  • We can also use clustering to ehnace the
    association rule algorithm
  • If we dont know how to quantize to begin with,
    use clustering for values
  • Association rules can uncover interesting data
    quality and business rules

24
Record Linkage
  • Critical component of data quality applications
  • Linkage involves finding a link between a pair of
    records, either through an exact match or through
    an approximate match
  • Linkage is useful for
  • data cleansing
  • data correction
  • enhancement
  • householding

25
Record Linkage 2
  • Two records are linked when they match with
    enough weighted similarity
  • Matching can be a combination of exact matching
    on particular fields, to approximat ematching
    with some degree of similarity
  • For example two customer records can be linked
    via an account number, if account numbers are
    uniquely assigned to customers

26
Record Linkage 3
  • Example
  • David Loshin
  • 633 Evergreen Terrace
  • Montclair, NJ
  • 201-765-8293
  • vs.
  • H. David Loshin
  • 633 Evergreen
  • Montclair, NJ
  • 201-765-8293
  • In this case, we can establish a link based
    solely on telephone number

27
Record Linkage 4
  • Frequently, pivot attributes exists and can be
    used for exact matching (such as social security
    number, account number, telephone number, student
    ID)
  • Often, there is no pivot attribute, and therefore
    approximate matching techniques must be used
  • Approximate matching is a process of looking for
    similarity

28
Similarity
  • As with clustering, we use measures of similarity
    to establish measures and thresholds for linkage
  • Most interesting areas for similarity are in
    string matching

29
String Similarity
  • How do we characterize similarity between
    strings?
  • We can see it with our eyes, or hear it inside
    our heads
  • example Smith, Smythe
  • example John and Mary Jones, John Jones
  • How do we transfer this to automation?

30
Edit Distance
  • Edit distance operations
  • Insertion, where an extra character is inserted
    into the string
  • Deletion, where a character has been removed from
    the string
  • Transposition, in which two characters are
    reversed in their sequence
  • Substitution, which is an insertion followed by a
    deletion

31
Edit Distance 2
  • Strings with a small edit distance are likely to
    be similar
  • Edit distance is measured as a count of edit
    distance operations from one string to another
  • internatianl
  • transposition
  • internatinal
  • insertion
  • international
  • internatianl to international has an edit
    distance of 2

32
Computing Edit Distance
  • Use dynamic programming
  • Given two strings, x x1x2 .. xn, and y y1y2
    ..ym
  • the edit distance f(i, j) is computed as the best
    match of two substrings x1x2 .. xi and y1y2 ..yj
    where
  • f(0,0) 0
  • f(i, j) minf(i-1, j) 1), f(i, j-1) 1,
    f(i-1, j-1) d(xi,yj)

33
Phonetic Similarity
  • Words that sound the same may be misspelled
  • Phonetic similarity reduces the complexity of the
    strings
  • Effectively, it compresses the strings, with some
    loss, then performs a similarity test

34
Soundex
  • The first character of the name string is
    retained, and then numbers are assigned to
    following characters
  • The numbers are assigned using this breakdown
  • 1 B P F V
  • 2 C S K G J Q X Z
  • 3 D T
  • 4 L
  • 5 M N
  • 6 R
  • Vowels are ignored

35
Soundex 2
  • Examples
  • Fitzgerald F322
  • Smith S530
  • Smythe S530
  • Loshin L250

36
Soundex 3
  • Regular soundex is flawed
  • geared towards English names
  • cant account for incorrect first letter
  • longer names are truncated
  • Options
  • encode entire string, not just the first 4
    consonant sounds
  • reverse the words, then encode both forward and
    backward
  • Use different phonetic encodings

37
Other Phonetic Encoding
  • NYSIIS
  • Similar to soundex, but
  • does not use numeric encoding, instead uses
    mapping to smaller set of consonants
  • replaces all vowels with A
  • Metaphone
  • Tries to be more exact with multiple-letter
    sounds (sh, tio, th, etc.)

38
N-gramming
  • Another means of representing a compressed form
    of a string
  • An n-gram is a chunk of text of length n
  • We slide a window of size n across a string to
    generate the set of n-grams for that string

39
N-gram Example
  • INTERNATIONAL is comprised of these 2-grams
  • IN, NT, TE, ER, RN, NA, AT, TI, IO,ON, NA, AL
  • Compare this with INTERNATIANL
  • IN, NT, TE, ER, RN, NA, AT, TI, IA, AN, NL
  • These two strings share 8 2-grams

40
N-gram Measures
  • 1)      Absolute overlap this is the absolute
    ratio of matching n-grams to the total number of
    n-grams. This is equal to (2 ? (ngram(X) ?
    ngram(Y))) ? (ngram(X) ngram(Y))
  • 2)      Source overlap this is the number of
    matching n-grams divided by the number of n-grams
    in the source string X (ngram(X) ? ngram(Y)) ?
    ngram(X)
  • 3)      Search overlap this is the number of
    matching n-grams divided by the number of n-grams
    in the search string Y (ngram(X) ? ngram(Y)) ?
    ngram(Y)
Write a Comment
User Comments (0)
About PowerShow.com