Data Quality Assurance in Telecommunications Databases - PowerPoint PPT Presentation

About This Presentation
Title:

Data Quality Assurance in Telecommunications Databases

Description:

... the data stored in the databases against the auto-discovered information? ... C i,j : cost of a class j record being (mis)-classified to class i ... – PowerPoint PPT presentation

Number of Views:102
Avg rating:3.0/5.0
Slides: 43
Provided by: chun75
Category:

less

Transcript and Presenter's Notes

Title: Data Quality Assurance in Telecommunications Databases


1
Data Quality Assurance in Telecommunications
Databases
C.-M. Chen , M. Cochinwala chungmin,
munir_at_research.telcordia.com Applied
Research Telcordia Technologies Morristown, NJ
07960
2
Outline
  • Operation Support Systems
  • Data Quality Issues
  • Record Matching Problem Techniques
  • Example Applications
  • References

3
Operation Support Systems
  • Telecom carriers or service providers use
    multiple Operation Support Systems (OSS) for
  • network configuration
  • network engineering
  • service provisioning
  • network performance monitoring
  • customer care billing, etc.
  • Each OSS may maintain its own database using a
    DBMS

4
OSS Databases
  • The databases may overlap on the network entities
    they describe
  • The OSSs may use different model/schemas to
    describe the same entity

5
Data Quality Issues
  • Corrupted data data do not correctly reflect the
    properties/specifications of the modeled entities
  • Inaccurate data discrepancy between the data and
    the real-world entity they modeled (e.g. outdated
    information)
  • Inconsistent data
  • records in different OSS databases referencing
    the same real-world entity do not agree on the
    attribute values of the entity
  • an entity is represented by more than one record
    in a single OSS database

6
Example DSL Deployment
  • corrupted data many of the Central Offices (CO)
    do not have correct location information
    (longitude,latitude)
  • inaccurate data the database says a port or
    bandwidth is available at the CO while indeed it
    is not (or vice versa)
  • inconsistent data Service Provisioning database
    says a line is available while Network
    Configuration database says otherwise (which one
    to believe?)

7
Why is data quality an issue
  • corrupted data disrupt/delay business deployment
  • inaccurate data impede decision making
  • inconsistent data degrade operation efficiency
  • bottom line
  • increased cost
  • decreased revenue
  • customer dissatisfaction

8
Cost of Data Quality Assurance
  • To date, practice of data quality assurance is
    mostly manual-driven and labor-intensive
  • it costs a typical telephone company in the
    mid-1990s 2 per line per year to keep the
    network database reasonably accurate.
  • Now imagine the scale and complexity of the
    networks brought by the IP and wireless
    technologies and their effects on the cost of
    data quality
  • ? (Semi-) automatic tools are needed

9
Data Quality Assurance Techniques
  • data correctness and accuracy
  • database validation and correction via
    autodiscovery tools
  • data consistency (reconciliation)
  • record matching (record linkage) identify
    records that correspond to the same real-world
    entity

10
Data Accuracy
  • to assess and correct the discrepancy between
    data in the database and the modeled entities in
    the real-world
  • autodiscovery tools
  • automatically probe network elements for
    configuration parameters
  • reconstruct the network connection topology at
    different layers (from physical to application
    layers)
  • how to efficiently validate the data stored in
    the databases against the auto-discovered
    information?
  • sampling
  • graph algorithms

11
Record Matching Techniques Cochinwala01b
  • Problem Definition
  • Record Matching Phases
  • Quality Metrics
  • Searching Techniques (to reduce search space)
  • Matching Decision Rules (to determine matched
    pairs)

12
The Record Matching Problem
  • Goal To identify records in the same or
    different databases that correspond to the same
    real-world entity.
  • two records that correspond to the same entity
    are called a matched pair
  • in a relational table, no two records can hold
    identical values on all attributes ? a matched
    pair consists of two different records that refer
    to the same entity
  • the same entity may be represented differently in
    different tables/databases, with no agreement on
    key value(s)

13
Three Phases of Reconciliation
  • Data Preparation scrubbing and cleansing
  • Searching reducing search space
  • Matching finding matched pairs

14
Data Preparation
  • Parsing
  • Data transformation
  • Standardization

15
Searching Problem
  • Consider reconciling tables A and B
  • search space P A ? B (Cartesian product)
  • M matched pairs
  • U unmatched pairs
  • P M ? U
  • Problem searching P requires A?B comparisons
  • Goal reduce the search space to a smaller P ?
    P, such that M ? P.

16
Searching Techniques
  • Heuristic potentially matched records should
    fall into the same cluster
  • Relational join operators nest-loop join,
    merge-sort join, hash join, band join
    Hernadez96
  • Blocking Soundex Code, NYSIIS Newcombe88
  • Windowing (sorted neighborhood) Hernadez98
  • Priority queue MongeElkan97
  • Multi-pass merge-sort joins Hernadez98

17
String Matching Techniques
  • Edit distance counts of insert, delete, replace
    Manber89
  • Smith-Waterman dynamic programming to find
    minimum edit distance SmithWaterman81

18
Record Matching Techniques
  • Given a (reduced) space of pairs, how to
    determine if a pair (x,y) is a match, i.e.,
    records x and y refer to the same real-world
    entity ?
  • Probabilistic approaches Newcombe59
  • Bayes test for minimum error
  • Bayes test for minimum cost
  • Non-probabilistic approaches
  • Supervised Learning (to generate decision rules)
    Cochinwala01a
  • Equation theory model Hernadez98
  • Distance-based techniques DeySarkarDe98

19
Bayes Test for Minimum Error
  • Cartesian product P A ? B M ? U
  • For each record (a1, a2, , an , b1, b2, , bn)
    ? P, define
  • x (x1, x2, , xn), where
  • xi 1 if ai bi
  • 0 if ai ? bi
  • a-prior prob conditional density function
  • Class M ?M p(xM)
  • Class U ?U p(xU)
  • Unconditional density function
  • p(x) ?M p(xM) ?U p(xU)

20
Bayes Test for Minimum Error (cont.)
  • Assume ?M, ?U, p(xM), and p(xU) are knowm
  • Basyes theorem the posteriori probability
  • p(Mx) ?M p(xM) / (?M p(xM) ?U p(xU))
  • Likelihood Ratio Decision Rule
  • x is decided to belong to M iff
  • p(Mx) ? p(Ux) iff
  • ?M p(xM) ? ?U p(xU) iff
  • l(x) (p(xM) / p(xU)) ? (?U / ?M )
  • The test gives min probability of error
    (misclassification)

21
Bayes Test for Minimum Cost
  • C i,j cost of a class j record being
    (mis)-classified to class i
  • Conditional cost of x being classified to class M
    given x is
  • cM(x) cMM p(Mx) cMU p(Ux)
  • Similarly,
  • cU(x) cUM p(Mx) cUU p(Ux)
  • Likelihood Ratio Decision Rule
  • x is decided to belong to M iff
  • cM(x) ? cU(x) iff
  • l(x) (p(xM) / p(xU)) ? ((cMU - cUU )?U /
    (cUM - cMM )?M )
  • The test gives min cost

22
Supervised Learning
  • Take a small sample S form A ? B
  • For every record s ?S, label it as M (matched)
    or U (unmatched)
  • Select a predictive model, along with associated
    decision rules, that best discriminates between
    classes M and U for records in S
  • Apply the selected model to classify all records
    in A ? B

23
Arroyo a data reconciliation tool Cochinwala01a
  • matches customer records across two databases
  • wireline database 860,000 records
  • wireless database 1,300,000 records
  • methodology
  • pre-processing (application dependent)
  • parameter space definition (application
    dependent)
  • matching rule generation pruning (learning,
    model selection)
  • matching

24
Preprocessing (Application Dependent)
  • elimination of stop-words
  • blanks, special characters, ...
  • Chung-Min Chen becomes ChungMin Chen
  • word re-ordering
  • 20 PO BOX becomes PO BOX 20
  • word substitution
  • St. becomes Street

25
Parameter Space Definition (Application Dependent)
  • Six parameters used
  • edit distance between the Name fields
  • edit distance between the Address fields
  • length of Name field in Wireline
  • length of Name field in Wireless
  • length of Address field in Wireline
  • length of Address field in Wireless

26
Matching Rule Generation
  • 1. Learning Set Generation
  • select records that contain the word Falcon in
    Name field
  • Wireless - 241 records
  • Wireline - 883 records
  • Choose sample first from the database that is the
    image of the higher degree onto-mapping. For
    example, Wireline ? Wireless has a higher onto
    degree than Wireless ? Wireline

27
Matching Rule Generation
  • 2. Learning Set Labeling
  • identify prospective matches
  • match if the edit distances on Name and Address
    fields are both less than 3
  • (30 false matches but no miss of true matches)
  • verdict assignment
  • each pair that is not a prospective match is
    assigned unmatch
  • each prospective match pair is manually examined
    and labeled as match, unmatch, or ambigious
    (labor intensive)

28
Matching Rule Generation
  • 3. Model Selection
  • Input 241? 883 212,803 pairs with 7 fields
    (parameters)
  • Label (verdict)
  • Edit distance between Name fields Edit distance
    between Address fields
  • Length of Name field in Wireless Length of Name
    field in Wireline
  • Length of Address in Wireless Length of Address
    in Wireline
  • Three model tried, with cross-validation (half
    sample to build the model, half sample to
    estimate the error rate)
  • Model Error Rate (average of 50 runs)
  • CART 5.1
  • Linear Discriminant Analysis 5.3
  • Vector Quantization 9.4

29
Matching Rule Generation
  • 3. Model Selection (cont.)
  • Model Error Rate (average of 50 runs)
  • CART 5.1
  • Linear Discriminant Analysis 5.3
  • Vector Quantization 9.4
  • LDA could have inefficient rules for database
    evaluation, e.g.
  • 2(Address edit distance) 1.3(Name length on
    Wireless) lt 3 ? match
  • an index on Name is useless
  • CART rules can be evaluated efficiently using
    existing database indices

30
Matching Rule Pruning
  • determines which parameters (rules) to drop to
    reduce tree complexity while retaining tree
    quality
  • maximize delta of a cost function
  • dynamic programming with a threshold
  • complexity of tree T
  • C(T) ? C(p) / ? C(q) ,
  • C(p) complexity to compute the value of
    parameter p
  • e.g.
  • C(edit distance between fields A and B)
    avg_length(A) avg_length(B)
  • Note 0 lt C(T) ? 1

p?T
q ?S
31
Matching Rule Pruning (cont.)
  • C(edit distance between fields A and B)
    avg_length(A) avg_length(B)
  • Parameter Avg length Complexity
  • Address length (wireless) 6.88 6.88
  • Address length (wireline) 6.72 6.72
  • Name length (wireless) 10.87 10.87
  • Name length (wireline) 10.70 10.70
  • Edit distance on Addresses 6.886.72 46.23
  • Edit distance on Names 10.8710.7 116.31

32
Matching Rule Pruning (cont.)
  • misclassification rate of T
  • M(T) misclassified pairs / pairs in the
    test set,
  • Note 0 ? M(T) ? 1
  • Cost of Tree T
  • J(T) w1C(T) w2M(T),
  • where w1 and w2 are weights. (We used w1
    w2 1)
  • Let T T - all rues involving parameter p),
    define
  • ?J(T) J(T) - J(T)
  • (C(T)-C(T)) (M(T) - M(T))
  • ?C(T) ?M(T)
  • Goal Find a set of parameters to drop to
    maximize ?J(T)

33
Matching Rule Pruning (cont.)
  • Dynamic programming
  • find a parameter p1 that maximizes ?J(T)
  • fix p1 and find a second parameter p2 such that
    p1,p2 maximizes ?J(T)
  • repeat the step until we reach a set of
    parameters p1,p2,,pk such that ?J(T) is less
    than a threshold value
  • Parameter ?C(T) ?M(T) ?J(T)
  • Edit dist on Name 0.5882 -0.0140 0.5742
  • Edit dist on Address 0.2338 -0.1840 0.0498
  • Len. on Name (Wireless) 0.0566 -0.0610
    -0.0044
  • Len. on Name (Wireline) 0.0541 -0.0160 0.0381
  • Len. on Address (Wireless) 0.0347 -0.0081 0.0266
  • Len. On Address (Wireline) 0.0339 -0.0070 0.0269
  • Single parameter pruning

34
Matching Rule Pruning (cont.)
  • 2nd Parameter ?C(T) ?M(T) ?J(T)
  • Len. on Name (Wireline) 0.0541 -0.0088 0.0453
  • Len. on Address (Wireless) 0.0347 -0.0072 0.0275
  • Len. On Address (Wireline) 0.0339 -0.0091 0.0248
  • ...
  • Dual parameters pruning

35
Matching Rule Pruning (cont.)
  • Reduce from 6 parameters (original tree) to 2
    parameters (final tree)
  • index on Name field can be used to evaluate the
    tree

root
Address edit distance gt 1.5
Address edit distance lt 1.5
matched
Wireless Name length lt 11.5
Wireless Name length gt 11.5
ambiguous
unmatched
36
CART Matching Result
  • x/y Matched Unmatched Ambiguous
  • Matched 0.965 0.014 0.03
  • Unmatched 0.020 0.956 0.18
  • Ambiguous 0.015 0.030 0.79
  • approximately 10 error rate
  • reduced 50 computation cost for matching
  • about 47 more correctly matched records over a
    leading commercial product

Prob (predicted as x given y)
37
Major Telephone Operating Company
  • Problem
  • Mergers and acquisitions of wireless companies
    resulted in the RBOCs inability to determine
    common customers among wireline and wireless
    businesses
  • Customer databases for each business unit use
    different schema and contain many quality
    problems
  • RBOCs experience with commercial vendors data
    reconciliation tool was unsatisfactory
  • Solution
  • Use small manually-verified data samples (100
    records) to determine appropriate matching rules
  • Use machine learning to prune the rules for
    efficient analysis of the large dataset
  • Resulted in 30 more correct matches than the
    commercial tool

38
Large Media Conglomerate
  • Problem
  • Company provides magazines to wholesalers who in
    turn provide magazines to resalers for
    distribution
  • Reconciliation of wholesaler and retailer
    databases would make it easier to track where
    gaps in reporting are occurring
  • Identify bad retailers
  • Solution
  • Group by primary keys
  • Match by secondary keys
  • E.g. 3000 C.V.S. Pharmacies are grouped and
    compared by zip code and street address
    identify bad pharmacies

39
International Government
  • Problem
  • Reconcile vital taxpayer data from several
    different sources
  • Known problems include record duplication,
    address mismatches, address obsolescence,
    distributed responsibility for database accuracy
    and updates
  • Identify causes for mistakes
  • Solution
  • Improve the process flows and architecture to
    allow for rapid modification of pre-processing
    rules and matching rules
  • Detection and classification of likely causes of
    duplication
  • Analysis and improvements reduced number of
    records that required manual verification

40
ILEC-CLEC Billing Reconciliation
  • Problem
  • ILECs charge CLECs for use of network resources
  • Verification of actual usage vs. charging
  • E.g customer changing providers
  • Identify actual usage and send verification to
    ILEC
  • Resource identification in ILEC and CLEC is
    different
  • Solution
  • Check charges in bill against actual usage
  • Common identification of resources (matching
    table)
  • Solution has only been implemented for access
    line charge

41
Future Directions
  • Reduction of manual work
  • Type based standardization, rules
  • Rules and Parameter pruning?
  • Database and tool
  • Rule execution plan
  • E.g. string length before edit distance
  • Sharing/maintenance of index across steps in
    process
  • Extending matching to structures (trees, graphs
    circuits)
  • Matching across layers in networks (OSI layers)
  • How often to discover/audit network? sampling
    techniques

42
References
  • Cochinwala01a M Cochinwala, V. Kurien, G. Lalk
    and D. Shasha, Efficient data reconciliation,
    Information Sciences, 137 (1-4) , 2001.
  • Cochinwala01b M. Cochinwala, S. Dalal, A.
    Elmagarmid and V. Verykios, Record matching
    past, present and future, submitted for
    publication.
  • DeySarkarDeD. Dey, S. Sarkar and P. De, Entity
    matching in heterogeneous databases a
    distance-based decision model, 31st Hawaii int.
    Conf. On System Sciences, 1998.
  • Hernadez96 M. Hernadez, A generalization of
    band joins and the merge/purge problem, Ph.D.
    Thesis, CS Dept, Columbia University, 1996.
  • Hernadez98 M. Hernadez and S. Stolfo,
    Real-world data is dirty data cleansing and the
    merger/purge problem, Data Mining and Knowledge
    Discovery, 2 (1), 1998.
  • Manber89 U. Manber, Introductions to
    Algorithms, Addiso_Wesley, 1989.
  • MongeElkan97 A. Monge and C. Elkan, An
    efficient domain-independent algorithm for
    detecting approximately duplicate database
    records, SIGMOD DMKD Workshop, 1997.
  • Newcombe59 H. Newcombe, J. Kennedy, S. Axford
    and A. James, Automatic linkage of vital
    records, Science 130(3381), 1959.
  • Newcombe88H. Newcombe, Handbook of Record
    Linkage, Oxford University Press, 1988.
  • SmithWaterman81 T. Smith and M. Waterman,
    Identification of common molecular
    subsequences, J. Mol. Biol. 147, 1981.
Write a Comment
User Comments (0)
About PowerShow.com