Data Quality Assurance in Telecommunications Databases

About This Presentation

Title:

Data Quality Assurance in Telecommunications Databases

Description:

... the data stored in the databases against the auto-discovered information? ... C i,j : cost of a class j record being (mis)-classified to class i ... – PowerPoint PPT presentation

Number of Views:102

Avg rating:3.0/5.0

Slides: 43

Provided by: chun75

Learn more at: http://archive.dimacs.rutgers.edu

Category:

more less

Transcript and Presenter's Notes

Title: Data Quality Assurance in Telecommunications Databases

1
Data Quality Assurance in Telecommunications
Databases
C.-M. Chen , M. Cochinwala chungmin,
munir_at_research.telcordia.com Applied
Research Telcordia Technologies Morristown, NJ
07960
2
Outline

Operation Support Systems
Data Quality Issues
Record Matching Problem Techniques
Example Applications
References

3
Operation Support Systems

Telecom carriers or service providers use
multiple Operation Support Systems (OSS) for
network configuration
network engineering
service provisioning
network performance monitoring
customer care billing, etc.
Each OSS may maintain its own database using a
DBMS

4
OSS Databases

The databases may overlap on the network entities
they describe
The OSSs may use different model/schemas to
describe the same entity

5
Data Quality Issues

Corrupted data data do not correctly reflect the
properties/specifications of the modeled entities
Inaccurate data discrepancy between the data and
the real-world entity they modeled (e.g. outdated
information)
Inconsistent data
records in different OSS databases referencing
the same real-world entity do not agree on the
attribute values of the entity
an entity is represented by more than one record
in a single OSS database

6
Example DSL Deployment

corrupted data many of the Central Offices (CO)
do not have correct location information
(longitude,latitude)
inaccurate data the database says a port or
bandwidth is available at the CO while indeed it
is not (or vice versa)
inconsistent data Service Provisioning database
says a line is available while Network
Configuration database says otherwise (which one
to believe?)

7
Why is data quality an issue

corrupted data disrupt/delay business deployment
inaccurate data impede decision making
inconsistent data degrade operation efficiency
bottom line
increased cost
decreased revenue
customer dissatisfaction

8
Cost of Data Quality Assurance

To date, practice of data quality assurance is
mostly manual-driven and labor-intensive
it costs a typical telephone company in the
mid-1990s 2 per line per year to keep the
network database reasonably accurate.
Now imagine the scale and complexity of the
networks brought by the IP and wireless
technologies and their effects on the cost of
data quality
? (Semi-) automatic tools are needed

9
Data Quality Assurance Techniques

data correctness and accuracy
database validation and correction via
autodiscovery tools
data consistency (reconciliation)
record matching (record linkage) identify
records that correspond to the same real-world
entity

10
Data Accuracy

to assess and correct the discrepancy between
data in the database and the modeled entities in
the real-world
autodiscovery tools
automatically probe network elements for
configuration parameters
reconstruct the network connection topology at
different layers (from physical to application
layers)
how to efficiently validate the data stored in
the databases against the auto-discovered
information?
sampling
graph algorithms

11
Record Matching Techniques Cochinwala01b

Problem Definition
Record Matching Phases
Quality Metrics
Searching Techniques (to reduce search space)
Matching Decision Rules (to determine matched
pairs)

12
The Record Matching Problem

Goal To identify records in the same or
different databases that correspond to the same
real-world entity.
two records that correspond to the same entity
are called a matched pair
in a relational table, no two records can hold
identical values on all attributes ? a matched
pair consists of two different records that refer
to the same entity
the same entity may be represented differently in
different tables/databases, with no agreement on
key value(s)

13
Three Phases of Reconciliation

Data Preparation scrubbing and cleansing
Searching reducing search space
Matching finding matched pairs

14
Data Preparation

Parsing
Data transformation
Standardization

15
Searching Problem

Consider reconciling tables A and B
search space P A ? B (Cartesian product)
M matched pairs
U unmatched pairs
P M ? U
Problem searching P requires A?B comparisons
Goal reduce the search space to a smaller P ?
P, such that M ? P.

16
Searching Techniques

Heuristic potentially matched records should
fall into the same cluster
Relational join operators nest-loop join,
merge-sort join, hash join, band join
Hernadez96
Blocking Soundex Code, NYSIIS Newcombe88
Windowing (sorted neighborhood) Hernadez98
Priority queue MongeElkan97
Multi-pass merge-sort joins Hernadez98

17
String Matching Techniques

Edit distance counts of insert, delete, replace
Manber89
Smith-Waterman dynamic programming to find
minimum edit distance SmithWaterman81

18
Record Matching Techniques

Given a (reduced) space of pairs, how to
determine if a pair (x,y) is a match, i.e.,
records x and y refer to the same real-world
entity ?
Probabilistic approaches Newcombe59
Bayes test for minimum error
Bayes test for minimum cost
Non-probabilistic approaches
Supervised Learning (to generate decision rules)
Cochinwala01a
Equation theory model Hernadez98
Distance-based techniques DeySarkarDe98

19
Bayes Test for Minimum Error

Cartesian product P A ? B M ? U
For each record (a1, a2, , an , b1, b2, , bn)
? P, define
x (x1, x2, , xn), where
xi 1 if ai bi
0 if ai ? bi
a-prior prob conditional density function
Class M ?M p(xM)
Class U ?U p(xU)
Unconditional density function
p(x) ?M p(xM) ?U p(xU)

20
Bayes Test for Minimum Error (cont.)

Assume ?M, ?U, p(xM), and p(xU) are knowm
Basyes theorem the posteriori probability
p(Mx) ?M p(xM) / (?M p(xM) ?U p(xU))
Likelihood Ratio Decision Rule
x is decided to belong to M iff
p(Mx) ? p(Ux) iff
?M p(xM) ? ?U p(xU) iff
l(x) (p(xM) / p(xU)) ? (?U / ?M )
The test gives min probability of error
(misclassification)

21
Bayes Test for Minimum Cost

C i,j cost of a class j record being
(mis)-classified to class i
Conditional cost of x being classified to class M
given x is
cM(x) cMM p(Mx) cMU p(Ux)
Similarly,
cU(x) cUM p(Mx) cUU p(Ux)
Likelihood Ratio Decision Rule
x is decided to belong to M iff
cM(x) ? cU(x) iff
l(x) (p(xM) / p(xU)) ? ((cMU - cUU )?U /
(cUM - cMM )?M )
The test gives min cost

22
Supervised Learning

Take a small sample S form A ? B
For every record s ?S, label it as M (matched)
or U (unmatched)
Select a predictive model, along with associated
decision rules, that best discriminates between
classes M and U for records in S
Apply the selected model to classify all records
in A ? B

23
Arroyo a data reconciliation tool Cochinwala01a

matches customer records across two databases
wireline database 860,000 records
wireless database 1,300,000 records
methodology
pre-processing (application dependent)
parameter space definition (application
dependent)
matching rule generation pruning (learning,
model selection)
matching

24
Preprocessing (Application Dependent)

elimination of stop-words
blanks, special characters, ...
Chung-Min Chen becomes ChungMin Chen
word re-ordering
20 PO BOX becomes PO BOX 20
word substitution
St. becomes Street

25
Parameter Space Definition (Application Dependent)

Six parameters used
edit distance between the Name fields
edit distance between the Address fields
length of Name field in Wireline
length of Name field in Wireless
length of Address field in Wireline
length of Address field in Wireless

26
Matching Rule Generation

1. Learning Set Generation
select records that contain the word Falcon in
Name field
Wireless - 241 records
Wireline - 883 records
Choose sample first from the database that is the
image of the higher degree onto-mapping. For
example, Wireline ? Wireless has a higher onto
degree than Wireless ? Wireline

27
Matching Rule Generation

2. Learning Set Labeling
identify prospective matches
match if the edit distances on Name and Address
fields are both less than 3
(30 false matches but no miss of true matches)
verdict assignment
each pair that is not a prospective match is
assigned unmatch
each prospective match pair is manually examined
and labeled as match, unmatch, or ambigious
(labor intensive)

28
Matching Rule Generation

3. Model Selection
Input 241? 883 212,803 pairs with 7 fields
(parameters)
Label (verdict)
Edit distance between Name fields Edit distance
between Address fields
Length of Name field in Wireless Length of Name
field in Wireline
Length of Address in Wireless Length of Address
in Wireline
Three model tried, with cross-validation (half
sample to build the model, half sample to
estimate the error rate)
Model Error Rate (average of 50 runs)
CART 5.1
Linear Discriminant Analysis 5.3
Vector Quantization 9.4

29
Matching Rule Generation

3. Model Selection (cont.)
Model Error Rate (average of 50 runs)
CART 5.1
Linear Discriminant Analysis 5.3
Vector Quantization 9.4
LDA could have inefficient rules for database
evaluation, e.g.
2(Address edit distance) 1.3(Name length on
Wireless) lt 3 ? match
an index on Name is useless
CART rules can be evaluated efficiently using
existing database indices

30
Matching Rule Pruning

determines which parameters (rules) to drop to
reduce tree complexity while retaining tree
quality
maximize delta of a cost function
dynamic programming with a threshold
complexity of tree T
C(T) ? C(p) / ? C(q) ,
C(p) complexity to compute the value of
parameter p
e.g.
C(edit distance between fields A and B)
avg_length(A) avg_length(B)
Note 0 lt C(T) ? 1

p?T
q ?S
31
Matching Rule Pruning (cont.)

C(edit distance between fields A and B)
avg_length(A) avg_length(B)
Parameter Avg length Complexity
Address length (wireless) 6.88 6.88
Address length (wireline) 6.72 6.72
Name length (wireless) 10.87 10.87
Name length (wireline) 10.70 10.70
Edit distance on Addresses 6.886.72 46.23
Edit distance on Names 10.8710.7 116.31

32
Matching Rule Pruning (cont.)

misclassification rate of T
M(T) misclassified pairs / pairs in the
test set,
Note 0 ? M(T) ? 1
Cost of Tree T
J(T) w1C(T) w2M(T),
where w1 and w2 are weights. (We used w1
w2 1)
Let T T - all rues involving parameter p),
define
?J(T) J(T) - J(T)
(C(T)-C(T)) (M(T) - M(T))
?C(T) ?M(T)
Goal Find a set of parameters to drop to
maximize ?J(T)

33
Matching Rule Pruning (cont.)

Dynamic programming
find a parameter p1 that maximizes ?J(T)
fix p1 and find a second parameter p2 such that
p1,p2 maximizes ?J(T)
repeat the step until we reach a set of
parameters p1,p2,,pk such that ?J(T) is less
than a threshold value
Parameter ?C(T) ?M(T) ?J(T)
Edit dist on Name 0.5882 -0.0140 0.5742
Edit dist on Address 0.2338 -0.1840 0.0498
Len. on Name (Wireless) 0.0566 -0.0610
-0.0044
Len. on Name (Wireline) 0.0541 -0.0160 0.0381
Len. on Address (Wireless) 0.0347 -0.0081 0.0266
Len. On Address (Wireline) 0.0339 -0.0070 0.0269
Single parameter pruning

34
Matching Rule Pruning (cont.)

2nd Parameter ?C(T) ?M(T) ?J(T)
Len. on Name (Wireline) 0.0541 -0.0088 0.0453
Len. on Address (Wireless) 0.0347 -0.0072 0.0275
Len. On Address (Wireline) 0.0339 -0.0091 0.0248
...
Dual parameters pruning

35
Matching Rule Pruning (cont.)

Reduce from 6 parameters (original tree) to 2
parameters (final tree)
index on Name field can be used to evaluate the
tree

root
Address edit distance gt 1.5
Address edit distance lt 1.5
matched
Wireless Name length lt 11.5
Wireless Name length gt 11.5
ambiguous
unmatched
36
CART Matching Result

x/y Matched Unmatched Ambiguous
Matched 0.965 0.014 0.03
Unmatched 0.020 0.956 0.18
Ambiguous 0.015 0.030 0.79
approximately 10 error rate
reduced 50 computation cost for matching
about 47 more correctly matched records over a
leading commercial product

Prob (predicted as x given y)
37
Major Telephone Operating Company

Problem
Mergers and acquisitions of wireless companies
resulted in the RBOCs inability to determine
common customers among wireline and wireless
businesses
Customer databases for each business unit use
different schema and contain many quality
problems
RBOCs experience with commercial vendors data
reconciliation tool was unsatisfactory
Solution
Use small manually-verified data samples (100
records) to determine appropriate matching rules
Use machine learning to prune the rules for
efficient analysis of the large dataset
Resulted in 30 more correct matches than the
commercial tool

38
Large Media Conglomerate

Problem
Company provides magazines to wholesalers who in
turn provide magazines to resalers for
distribution
Reconciliation of wholesaler and retailer
databases would make it easier to track where
gaps in reporting are occurring
Identify bad retailers
Solution
Group by primary keys
Match by secondary keys
E.g. 3000 C.V.S. Pharmacies are grouped and
compared by zip code and street address
identify bad pharmacies

39
International Government

Problem
Reconcile vital taxpayer data from several
different sources
Known problems include record duplication,
address mismatches, address obsolescence,
distributed responsibility for database accuracy
and updates
Identify causes for mistakes
Solution
Improve the process flows and architecture to
allow for rapid modification of pre-processing
rules and matching rules
Detection and classification of likely causes of
duplication
Analysis and improvements reduced number of
records that required manual verification

40
ILEC-CLEC Billing Reconciliation

Problem
ILECs charge CLECs for use of network resources
Verification of actual usage vs. charging
E.g customer changing providers
Identify actual usage and send verification to
ILEC
Resource identification in ILEC and CLEC is
different
Solution
Check charges in bill against actual usage
Common identification of resources (matching
table)
Solution has only been implemented for access
line charge

41
Future Directions

Reduction of manual work
Type based standardization, rules
Rules and Parameter pruning?
Database and tool
Rule execution plan
E.g. string length before edit distance
Sharing/maintenance of index across steps in
process
Extending matching to structures (trees, graphs
circuits)
Matching across layers in networks (OSI layers)
How often to discover/audit network? sampling
techniques

42
References

Cochinwala01a M Cochinwala, V. Kurien, G. Lalk
and D. Shasha, Efficient data reconciliation,
Information Sciences, 137 (1-4) , 2001.
Cochinwala01b M. Cochinwala, S. Dalal, A.
Elmagarmid and V. Verykios, Record matching
past, present and future, submitted for
publication.
DeySarkarDeD. Dey, S. Sarkar and P. De, Entity
matching in heterogeneous databases a
distance-based decision model, 31st Hawaii int.
Conf. On System Sciences, 1998.
Hernadez96 M. Hernadez, A generalization of
band joins and the merge/purge problem, Ph.D.
Thesis, CS Dept, Columbia University, 1996.
Hernadez98 M. Hernadez and S. Stolfo,
Real-world data is dirty data cleansing and the
merger/purge problem, Data Mining and Knowledge
Discovery, 2 (1), 1998.
Manber89 U. Manber, Introductions to
Algorithms, Addiso_Wesley, 1989.
MongeElkan97 A. Monge and C. Elkan, An
efficient domain-independent algorithm for
detecting approximately duplicate database
records, SIGMOD DMKD Workshop, 1997.
Newcombe59 H. Newcombe, J. Kennedy, S. Axford
and A. James, Automatic linkage of vital
records, Science 130(3381), 1959.
Newcombe88H. Newcombe, Handbook of Record
Linkage, Oxford University Press, 1988.
SmithWaterman81 T. Smith and M. Waterman,
Identification of common molecular
subsequences, J. Mol. Biol. 147, 1981.