Title: Relational Data Mining with Inductive Logic Programming for Link Discovery
1Relational Data Mining with Inductive Logic
Programming for Link Discovery
- Ray Mooney, Prem Melville, Rupert Tang
- University of Texas at Austin
- Jude Shavlik, Inês de Castro Dutra, David Page,
Vítor Santos Costa - University of Wisconsin at Madison
2EELD Program
- Evidence Extraction
- Link Discovery
- Pattern Learning
3Link Discovery Task(from Jim Antonisse, GITI)
Vetted hyp cases
Evidence request(s)
Link Discovery Core Pattern Matching
Alerts based on Hypothesized cases
Pattern(s) of Interest
Domain Patterns
Legend pre-run-time processing
run-time processing
4Link Discovery
- Data is multi-relational with many people,
places, objects and actions and numerous types of
relations between them. - Link analysis in intelligence and criminology
investigates exploring and visualizing such data
as a graph with many nodes and edges of various
types. - Link discovery entails finding new links and
recognizing threatening patterns in such
highly-relational data.
5EELD Program
- Evidence Extraction
- Link Discovery
- Pattern Learning
6Pattern Learning for Link Discovery
- Automated discovery of patterns of interest
that indicate potentially threatening activities
in large amounts of heterogeneous,
multi-relational data. - Requires inducing multi-relational patterns that
characterize multiple entities and multiple links
between them.
7Limitations of Traditional Data Mining
- Traditional KDD methods assume the data to be
mined is in a single relational table and that
examples are flat tuples of attribute values. - This assumption stems from
- 1) Properties of the typical data mining tasks
like market basket analysis. - 2) Focus in machine learning and statistics on
classification or regression using feature
vectors as inputs.
8Relational Data Mining
- Data contains multiple relations.
- Patterns to be discovered contain multiple
relations. - Knowledge to be discovered may be the definition
of another relation rather than a classification
or regression function.
9Relational Data Mining Example
Male Female
Alice
Bob
Married
Mary
Jack
Jane
Tom
Parent
Carol
John
Fred
Sue
, Male(X), not(XW).
Uncle(X,Y) - Parent(Z,X), Parent(Z,W),
Parent(W,Y)
10Relational Data Mining Example (cont)
Male Female
Alice
Bob
Married
Mary
Jack
Jane
Tom
Parent
Carol
John
Fred
Sue
, Male(X), not(ZV).
11Most KDD Ignores RDM
- KDD textbooks barely mention RDM
- Han Kamber, 2001
- Hand, Mannila, Smyth, 2001
- Witten Frank, 1999
- But there is a recent edited collection on RDM
- S. Džeroski N. Lavrac, eds. Relational Data
Mining, Springer Verlag, 2001.
12Inductive Logic Programming(ILP)
- Standard formal language for representing
relational knowledge is first-order predicate
logic. - ILP studies the induction of hypotheses in
first-order predicate logic. - Logic programs (e.g. Prolog) or function-free
logic programs (e.g. Datalog), are a useful,
reasonably-tractable subset of first-order
predicate logic. - ILP is the most well-studied approach to
relational data mining.
13ILP Problem Definition
- Given
- Positive Example Set P
- Negative Example Set N
- Background Knowledge B
- Find
- Hypothesis, H, such that
-
-
P, N, B and H are all sets of rules in
first-order logic (i.e. Horn clauses, logic
programs)
14ILP Algorithms
- We have utilized two ILP systems for EELD
problems in link discovery. - Aleph (Srinivasan, 2001) A variant of the
popular Progol algorithm (Muggleton, 1995) - mFoil (Tang and Mooney, 2002) A variant of the
popular Foil algorithm (Quinlan, 1990)
15EELD Russian Nuclear Smuggling Data
- Data manually extracted from new sources about
events related to nuclear smuggling (developed by
Veridian Inc.) - Size of data set
- 40 relational tables
- 2 to 800 tuples per relation
- Translated Access database to Prolog, mapping
each relational table to a predicate. - Used Aleph to learn rules for the relation
Linked(A,B)which determines whether or not two
events are part of the same incident. - 143 positive examples
- 517 negative examples
16 Illustration of Linked Relation
New Event
Partial Incident N
Partial Incident M
17 Find Correct Incident for New Event
Partial Incident M
Expanded Incident N
18Sample Rule
linked(EventA,EventB) - lk_event_material(_,Eve
ntA,_,_,_, ConcealmentG,DescH),
lk_event_person(_,EventB,PersonD,_,C,C,_),
lk_person_material(_,PersonD,MatF,EvE,_,_,_,_,_),
lk_event_material(_,EvE,MatF,I,_,
ConcealmentG,DescH), l_relations(I,_,"Stolen").
If A is linked to a specific type of material
ltG,Hgt, and B is linked to a person linked to the
same specific type of material, through an event
in which that material was stolen, then events A
and B are linked.
19Linked(A,B)
B
A
Event Material Person
20Linked(A,B)
B
A
Material Type GH
Event Material Person
21Linked(A,B)
B
A
E
D
Material Type GH
Material Type GH
Event Material Person
22Linked(A,B)
B
A
E
D
Stolen
Material Type GH
Material Type GH
Event Material Person
23Linked(A,B)
B
A
E
D
Stolen
Material Type GH
Material Type GH
Event Material Person
24Accuracy Results for Learning Linkedfor Nuclear
Smuggling Data
- Experimental Method 5-fold cross validation.
- Also tried bagging Aleph to produce an ensemble
of 25 hypotheses.
25Synthetic Contract Killing Data
- Data generated by a plan-based simulator that
generates evidence emulating contract killings
and other types of murders (developed by IET
Inc.). - Simulator used to generate evidence from 200
murder events of three types - Murder for Hire (71 exs)
- First Degree (75 exs)
- Second Degree (54 exs)
- Use mFoil to classify events into one of these
three categories.
26Sample Rules
- Murder For Hire(A)-
- groupMemberMaleficiary(A, B),
- subEvents(A, C), crimeMotive(C, economic).
- First Degree Murder(A)-
- subEvents(A, B), performedBy(B, C),
loves(C,D). - Second Degree Murder(A)-
- subEvents(A, B), eventOccursAtLocationType
(B,publicProperty), crimeMotive(B, rival),
occurrentSubeventType(B, stealing_Generic).
27Results on Synthetic Contract Killing Data
28Recent Result from EELD Challenge Problem
- murder_for_hire(A) -
- eventOccursAt(A,B), perpetrator(A,C),
- agentPhoneNumber(C,D),callerNumber(E,D),
- accountHolder(F,C), to_Generic(G,F),
- from_Generic(G,H), to_Generic(I,H).
- Says an event is a murder for hire if it has a
recorded location and perpetrator, we have a
recorded phone call to the perpetrator, and there
was a chain of bank transfers resulting in money
reaching the perpetrators account. - 100 accuracy on a held-out test set.
- Similar pattern found manually by LD researchers
working on this challenge problem.
29Future Research
- Scaling to larger datasets
- Stochastic search
- Logic program optimization
- Integration with relational and deductive
database technology. - Integrating probabilistic reasoning
- Logic programs with Bayes-net constraints
- Active Learning
- Theory Refinement
30Related Research
- Graph-based Relational Data Mining
- Subdue (Cook Holder, UT Arlington)
- Probabilistic Relational Models
- PRMs (Koller, Stanford)
- Relational Feature Construction
- PROXIMITY (Jensen, UMass)
31Record Linkage
- Identify and merge duplicate field values and
duplicate records in a database. - Applications
- Duplicates in mailing lists
- Merging multiple databases of stores,
restaurants, etc. - Matching bibliographic references in research
papers (Cora/ResearchIndex) - Identifying individuals who are trying to hide
their identity by providing slightly erroneous
personal information.
32Record Linkage Examples
Author Title
Venue Address Year
Name Address City
Cusine
33Trainable Record Linkage
- MARLIN (Multiply Adaptive Record Linkage using
INduction) - Learn parameterized similarity metrics for
comparing each field. - Trainable edit-distance
- Use EM to set edit-operation costs
- Learn to combine multiple similarity metrics for
each field to determine equivalence. - Use SVM to decide on duplicates
34MARLIN Record Linkage Framework
Trainable duplicate detector
Trainable similarity metrics
35Conclusions
- Pattern Learning for Link Discovery is an
important application of data mining for
counter-terrorism. - Learning for Link Discovery requires Relational
Data Mining (RDM). - Other problem domains require RDM
- Bioinformatics
- Web
- Natural Language Understanding
- RDM is an important next-generation KDD
capability.