Title: Relational Learning
1Relational Learning Link Analysis
David Jensen Knowledge Discovery
LaboratoryComputer Science DepartmentUniversity
of MassachusettsAmherst, Massachusetts, USA
2Example FindingsPredicting Cell Phone Fraud
Called
- Data Phones calls from a single US city over
three months - 1.5M objects (phones)
- 7M links (call volume)
- Task Predict high-probability instances of
identity-theft fraud for a future month
CalledWeightDistanceSource
Phone
Phone
NumberTypeFraud?
Called
- Result Fraud rings exist where multiple
fraudulent numbers call the same number
3Relational knowledge discovery
- Knowledge discovery in large sets of interrelated
entities with variables on both entities and
relations.
Statistics
Relational databases
Graph drawing
Artificial intelligence
drawing on social network analysis, graph
theory, inductive logic programming, citation
analysis, web analysis, link analysis, sequence
analysis, spatial and temporal statistics, and
others. Knowledge discovery for the New
Science of Networks
4Whats unique about relational data?
- Traditional work in statistics and knowledge
discovery assume data instances form a single
table.
- Traditional statistical models assume
independence among instances (rows) and find
associations among the values of multiple
variables within a single instance.
- Relational models assume dependence among
instances in different rows and tables and find
associations among these values.
5Example tasks
- Identifying fraudulent securities brokers
- Partner National Association of Securities
Dealers - Data 650,000 brokers 5,000 firms 90,000
offices - Predicting peer-to-peer downloads
- Partners UMass Office of Information
Technologies UMass CS Secure Internet and
Group-Networking Lab - Data 2000 students 1 million files
- Catching identify theft in cellphone networks
- Partner Large wireless service provider
- Data 2 million numbers 5 million call
aggregates
6Why is relational learning useful?
- Integrate learning from multiple information
sourcesSources generate many interrelated
records with heterogeneous structure - Use context to understand informationSources may
generate many interrelated - Integrate time, space, and other
relationsPotential to produce integrated view of
many types of relations - Learning methods to match complexity of current
methods for knowledge representation reasoning
7Post-hoc analysis of 9/11 hijackers
(Krebs 2001)
8Whats new?
- Direct analysisNo need to preprocess the datato
form propositional instances - Relational inferenceInferences for one object
can inform inferences about other
objects(Neville Jensen 2000 Taskar et al.
2002) - Data instances are dependentThe assumptions of
many standard statistical approaches are
violated(Jensen Neville 2002 Perlich and
Provost 2003) - Structure and attribute values of data can be
correlatedAlgorithms need to separate the
effects of structure from attribute values
(Jensen, Neville, Hay 2003)
9Example Relational Probability Tree
(Neville, Jensen, Friedland, Hay 2003)
CV accuracy 91 AUC 85
10Autocorrelation and effective sample size
- The reliability of a statistical association
varies with sample size (N) - When evaluating the association between
characteristics of groups and their members, what
is the effective sample size? - N members
- N groups
- members N groups
A
B
A
(Jensen Neville 2002)
11Whats difficult?
- Relational learning and inferenceAccurate models
must consider at least the relational
neighborhood of a record, rather than only a
record alone - Non-independenceData instances are
non-independent, greatly complicating the
statistics of both learning and inference - Semi-structured dataGood analysis requires
frequent restructuring and reinterpretation of
the underlying structure of data - Preserving privacy in relational and distributed
data miningRecord linkage vs. aggregation
12Technical approaches
- New learning algorithmsRepresentations and
learning techniques that consider relational
structure and attributes when constructing models - New inference algorithmsMethods for applying
learning models that leverage relational
structure - Relational statisticsStatistical tests that
correctly adjust for characteristics of
relational data such as linkage and
autocorrelation - Semi-structured databases and transformation
techniquesDatabases and techniques that allow
rapid restructuring of large databases by end
users
13Where to go for more information
- 1998 AAAI Fall Symposium on AI Link
AnalysisWeb-accessible papers - AAAI 2000 IJCAI 2003 Workshops on Learning
Statistical Models from Relational
DataWeb-accessible papersConsider attending
IJCAI 2003 workshop (send email) - KDD 2002 KDD 2003 Workshops on Multi-Relational
Data MiningSpecial issue of SIGKDD Explorations
(forthcoming)Consider attending KDD 2003
workshop - DARPAs Evidence Extraction and Link Discovery
ProgramPattern learning areaWork published at
ICML and SIGKDD in past two years