Title: Data Quality Assurance in Telecommunications Databases
1Data Quality Assurance in Telecommunications
Databases
C.-M. Chen , M. Cochinwala chungmin,
munir_at_research.telcordia.com Applied
Research Telcordia Technologies Morristown, NJ
07960
2Outline
- Operation Support Systems
- Data Quality Issues
- Record Matching Problem Techniques
- Example Applications
- References
3Operation Support Systems
- Telecom carriers or service providers use
multiple Operation Support Systems (OSS) for - network configuration
- network engineering
- service provisioning
- network performance monitoring
- customer care billing, etc.
- Each OSS may maintain its own database using a
DBMS
4OSS Databases
- The databases may overlap on the network entities
they describe - The OSSs may use different model/schemas to
describe the same entity
5Data Quality Issues
- Corrupted data data do not correctly reflect the
properties/specifications of the modeled entities - Inaccurate data discrepancy between the data and
the real-world entity they modeled (e.g. outdated
information) - Inconsistent data
- records in different OSS databases referencing
the same real-world entity do not agree on the
attribute values of the entity - an entity is represented by more than one record
in a single OSS database
6Example DSL Deployment
- corrupted data many of the Central Offices (CO)
do not have correct location information
(longitude,latitude) - inaccurate data the database says a port or
bandwidth is available at the CO while indeed it
is not (or vice versa) - inconsistent data Service Provisioning database
says a line is available while Network
Configuration database says otherwise (which one
to believe?)
7Why is data quality an issue
- corrupted data disrupt/delay business deployment
- inaccurate data impede decision making
- inconsistent data degrade operation efficiency
- bottom line
- increased cost
- decreased revenue
- customer dissatisfaction
8Cost of Data Quality Assurance
- To date, practice of data quality assurance is
mostly manual-driven and labor-intensive - it costs a typical telephone company in the
mid-1990s 2 per line per year to keep the
network database reasonably accurate. - Now imagine the scale and complexity of the
networks brought by the IP and wireless
technologies and their effects on the cost of
data quality - ? (Semi-) automatic tools are needed
9Data Quality Assurance Techniques
- data correctness and accuracy
- database validation and correction via
autodiscovery tools - data consistency (reconciliation)
- record matching (record linkage) identify
records that correspond to the same real-world
entity
10Data Accuracy
- to assess and correct the discrepancy between
data in the database and the modeled entities in
the real-world - autodiscovery tools
- automatically probe network elements for
configuration parameters - reconstruct the network connection topology at
different layers (from physical to application
layers) - how to efficiently validate the data stored in
the databases against the auto-discovered
information? - sampling
- graph algorithms
11Record Matching Techniques Cochinwala01b
- Problem Definition
- Record Matching Phases
- Quality Metrics
- Searching Techniques (to reduce search space)
- Matching Decision Rules (to determine matched
pairs)
12The Record Matching Problem
- Goal To identify records in the same or
different databases that correspond to the same
real-world entity. - two records that correspond to the same entity
are called a matched pair - in a relational table, no two records can hold
identical values on all attributes ? a matched
pair consists of two different records that refer
to the same entity - the same entity may be represented differently in
different tables/databases, with no agreement on
key value(s)
13Three Phases of Reconciliation
- Data Preparation scrubbing and cleansing
- Searching reducing search space
- Matching finding matched pairs
14Data Preparation
- Parsing
- Data transformation
- Standardization
15Searching Problem
- Consider reconciling tables A and B
- search space P A ? B (Cartesian product)
- M matched pairs
- U unmatched pairs
- P M ? U
- Problem searching P requires A?B comparisons
- Goal reduce the search space to a smaller P ?
P, such that M ? P.
16Searching Techniques
- Heuristic potentially matched records should
fall into the same cluster - Relational join operators nest-loop join,
merge-sort join, hash join, band join
Hernadez96 - Blocking Soundex Code, NYSIIS Newcombe88
- Windowing (sorted neighborhood) Hernadez98
- Priority queue MongeElkan97
- Multi-pass merge-sort joins Hernadez98
17String Matching Techniques
- Edit distance counts of insert, delete, replace
Manber89 - Smith-Waterman dynamic programming to find
minimum edit distance SmithWaterman81
18Record Matching Techniques
- Given a (reduced) space of pairs, how to
determine if a pair (x,y) is a match, i.e.,
records x and y refer to the same real-world
entity ? - Probabilistic approaches Newcombe59
- Bayes test for minimum error
- Bayes test for minimum cost
- Non-probabilistic approaches
- Supervised Learning (to generate decision rules)
Cochinwala01a - Equation theory model Hernadez98
- Distance-based techniques DeySarkarDe98
19Bayes Test for Minimum Error
- Cartesian product P A ? B M ? U
- For each record (a1, a2, , an , b1, b2, , bn)
? P, define - x (x1, x2, , xn), where
- xi 1 if ai bi
- 0 if ai ? bi
- a-prior prob conditional density function
- Class M ?M p(xM)
- Class U ?U p(xU)
- Unconditional density function
- p(x) ?M p(xM) ?U p(xU)
20Bayes Test for Minimum Error (cont.)
- Assume ?M, ?U, p(xM), and p(xU) are knowm
- Basyes theorem the posteriori probability
- p(Mx) ?M p(xM) / (?M p(xM) ?U p(xU))
- Likelihood Ratio Decision Rule
- x is decided to belong to M iff
- p(Mx) ? p(Ux) iff
- ?M p(xM) ? ?U p(xU) iff
- l(x) (p(xM) / p(xU)) ? (?U / ?M )
- The test gives min probability of error
(misclassification)
21Bayes Test for Minimum Cost
- C i,j cost of a class j record being
(mis)-classified to class i - Conditional cost of x being classified to class M
given x is - cM(x) cMM p(Mx) cMU p(Ux)
- Similarly,
- cU(x) cUM p(Mx) cUU p(Ux)
- Likelihood Ratio Decision Rule
- x is decided to belong to M iff
- cM(x) ? cU(x) iff
- l(x) (p(xM) / p(xU)) ? ((cMU - cUU )?U /
(cUM - cMM )?M ) - The test gives min cost
22Supervised Learning
- Take a small sample S form A ? B
- For every record s ?S, label it as M (matched)
or U (unmatched) - Select a predictive model, along with associated
decision rules, that best discriminates between
classes M and U for records in S - Apply the selected model to classify all records
in A ? B
23Arroyo a data reconciliation tool Cochinwala01a
- matches customer records across two databases
- wireline database 860,000 records
- wireless database 1,300,000 records
- methodology
- pre-processing (application dependent)
- parameter space definition (application
dependent) - matching rule generation pruning (learning,
model selection) - matching
24Preprocessing (Application Dependent)
- elimination of stop-words
- blanks, special characters, ...
- Chung-Min Chen becomes ChungMin Chen
- word re-ordering
- 20 PO BOX becomes PO BOX 20
- word substitution
- St. becomes Street
25Parameter Space Definition (Application Dependent)
- Six parameters used
- edit distance between the Name fields
- edit distance between the Address fields
- length of Name field in Wireline
- length of Name field in Wireless
- length of Address field in Wireline
- length of Address field in Wireless
26Matching Rule Generation
- 1. Learning Set Generation
- select records that contain the word Falcon in
Name field - Wireless - 241 records
- Wireline - 883 records
- Choose sample first from the database that is the
image of the higher degree onto-mapping. For
example, Wireline ? Wireless has a higher onto
degree than Wireless ? Wireline
27Matching Rule Generation
- 2. Learning Set Labeling
- identify prospective matches
- match if the edit distances on Name and Address
fields are both less than 3 - (30 false matches but no miss of true matches)
- verdict assignment
- each pair that is not a prospective match is
assigned unmatch - each prospective match pair is manually examined
and labeled as match, unmatch, or ambigious
(labor intensive)
28Matching Rule Generation
- 3. Model Selection
- Input 241? 883 212,803 pairs with 7 fields
(parameters) - Label (verdict)
- Edit distance between Name fields Edit distance
between Address fields - Length of Name field in Wireless Length of Name
field in Wireline - Length of Address in Wireless Length of Address
in Wireline - Three model tried, with cross-validation (half
sample to build the model, half sample to
estimate the error rate) - Model Error Rate (average of 50 runs)
- CART 5.1
- Linear Discriminant Analysis 5.3
- Vector Quantization 9.4
29Matching Rule Generation
- 3. Model Selection (cont.)
- Model Error Rate (average of 50 runs)
- CART 5.1
- Linear Discriminant Analysis 5.3
- Vector Quantization 9.4
- LDA could have inefficient rules for database
evaluation, e.g. - 2(Address edit distance) 1.3(Name length on
Wireless) lt 3 ? match - an index on Name is useless
- CART rules can be evaluated efficiently using
existing database indices
30Matching Rule Pruning
- determines which parameters (rules) to drop to
reduce tree complexity while retaining tree
quality - maximize delta of a cost function
- dynamic programming with a threshold
- complexity of tree T
- C(T) ? C(p) / ? C(q) ,
- C(p) complexity to compute the value of
parameter p - e.g.
- C(edit distance between fields A and B)
avg_length(A) avg_length(B) -
- Note 0 lt C(T) ? 1
p?T
q ?S
31Matching Rule Pruning (cont.)
- C(edit distance between fields A and B)
avg_length(A) avg_length(B) - Parameter Avg length Complexity
- Address length (wireless) 6.88 6.88
- Address length (wireline) 6.72 6.72
- Name length (wireless) 10.87 10.87
- Name length (wireline) 10.70 10.70
- Edit distance on Addresses 6.886.72 46.23
- Edit distance on Names 10.8710.7 116.31
-
-
32Matching Rule Pruning (cont.)
- misclassification rate of T
- M(T) misclassified pairs / pairs in the
test set, - Note 0 ? M(T) ? 1
- Cost of Tree T
- J(T) w1C(T) w2M(T),
- where w1 and w2 are weights. (We used w1
w2 1) - Let T T - all rues involving parameter p),
define - ?J(T) J(T) - J(T)
- (C(T)-C(T)) (M(T) - M(T))
- ?C(T) ?M(T)
- Goal Find a set of parameters to drop to
maximize ?J(T) -
33Matching Rule Pruning (cont.)
- Dynamic programming
- find a parameter p1 that maximizes ?J(T)
- fix p1 and find a second parameter p2 such that
p1,p2 maximizes ?J(T) - repeat the step until we reach a set of
parameters p1,p2,,pk such that ?J(T) is less
than a threshold value - Parameter ?C(T) ?M(T) ?J(T)
- Edit dist on Name 0.5882 -0.0140 0.5742
- Edit dist on Address 0.2338 -0.1840 0.0498
- Len. on Name (Wireless) 0.0566 -0.0610
-0.0044 - Len. on Name (Wireline) 0.0541 -0.0160 0.0381
- Len. on Address (Wireless) 0.0347 -0.0081 0.0266
- Len. On Address (Wireline) 0.0339 -0.0070 0.0269
- Single parameter pruning
-
34Matching Rule Pruning (cont.)
- 2nd Parameter ?C(T) ?M(T) ?J(T)
- Len. on Name (Wireline) 0.0541 -0.0088 0.0453
- Len. on Address (Wireless) 0.0347 -0.0072 0.0275
- Len. On Address (Wireline) 0.0339 -0.0091 0.0248
- ...
- Dual parameters pruning
-
35Matching Rule Pruning (cont.)
- Reduce from 6 parameters (original tree) to 2
parameters (final tree) - index on Name field can be used to evaluate the
tree
root
Address edit distance gt 1.5
Address edit distance lt 1.5
matched
Wireless Name length lt 11.5
Wireless Name length gt 11.5
ambiguous
unmatched
36CART Matching Result
- x/y Matched Unmatched Ambiguous
- Matched 0.965 0.014 0.03
- Unmatched 0.020 0.956 0.18
- Ambiguous 0.015 0.030 0.79
-
- approximately 10 error rate
- reduced 50 computation cost for matching
- about 47 more correctly matched records over a
leading commercial product
Prob (predicted as x given y)
37Major Telephone Operating Company
- Problem
- Mergers and acquisitions of wireless companies
resulted in the RBOCs inability to determine
common customers among wireline and wireless
businesses - Customer databases for each business unit use
different schema and contain many quality
problems - RBOCs experience with commercial vendors data
reconciliation tool was unsatisfactory - Solution
- Use small manually-verified data samples (100
records) to determine appropriate matching rules - Use machine learning to prune the rules for
efficient analysis of the large dataset - Resulted in 30 more correct matches than the
commercial tool
38Large Media Conglomerate
- Problem
- Company provides magazines to wholesalers who in
turn provide magazines to resalers for
distribution - Reconciliation of wholesaler and retailer
databases would make it easier to track where
gaps in reporting are occurring - Identify bad retailers
- Solution
- Group by primary keys
- Match by secondary keys
- E.g. 3000 C.V.S. Pharmacies are grouped and
compared by zip code and street address
identify bad pharmacies
39International Government
- Problem
- Reconcile vital taxpayer data from several
different sources - Known problems include record duplication,
address mismatches, address obsolescence,
distributed responsibility for database accuracy
and updates - Identify causes for mistakes
- Solution
- Improve the process flows and architecture to
allow for rapid modification of pre-processing
rules and matching rules - Detection and classification of likely causes of
duplication - Analysis and improvements reduced number of
records that required manual verification
40ILEC-CLEC Billing Reconciliation
- Problem
- ILECs charge CLECs for use of network resources
- Verification of actual usage vs. charging
- E.g customer changing providers
- Identify actual usage and send verification to
ILEC - Resource identification in ILEC and CLEC is
different - Solution
- Check charges in bill against actual usage
- Common identification of resources (matching
table) - Solution has only been implemented for access
line charge
41Future Directions
- Reduction of manual work
- Type based standardization, rules
- Rules and Parameter pruning?
- Database and tool
- Rule execution plan
- E.g. string length before edit distance
- Sharing/maintenance of index across steps in
process - Extending matching to structures (trees, graphs
circuits) - Matching across layers in networks (OSI layers)
- How often to discover/audit network? sampling
techniques
42References
- Cochinwala01a M Cochinwala, V. Kurien, G. Lalk
and D. Shasha, Efficient data reconciliation,
Information Sciences, 137 (1-4) , 2001. - Cochinwala01b M. Cochinwala, S. Dalal, A.
Elmagarmid and V. Verykios, Record matching
past, present and future, submitted for
publication. - DeySarkarDeD. Dey, S. Sarkar and P. De, Entity
matching in heterogeneous databases a
distance-based decision model, 31st Hawaii int.
Conf. On System Sciences, 1998. - Hernadez96 M. Hernadez, A generalization of
band joins and the merge/purge problem, Ph.D.
Thesis, CS Dept, Columbia University, 1996. - Hernadez98 M. Hernadez and S. Stolfo,
Real-world data is dirty data cleansing and the
merger/purge problem, Data Mining and Knowledge
Discovery, 2 (1), 1998. - Manber89 U. Manber, Introductions to
Algorithms, Addiso_Wesley, 1989. - MongeElkan97 A. Monge and C. Elkan, An
efficient domain-independent algorithm for
detecting approximately duplicate database
records, SIGMOD DMKD Workshop, 1997. - Newcombe59 H. Newcombe, J. Kennedy, S. Axford
and A. James, Automatic linkage of vital
records, Science 130(3381), 1959. - Newcombe88H. Newcombe, Handbook of Record
Linkage, Oxford University Press, 1988. - SmithWaterman81 T. Smith and M. Waterman,
Identification of common molecular
subsequences, J. Mol. Biol. 147, 1981.