Title: ARO Workshop on Abductive Systems
1Statistical Relational Learning for Abductive
Reasoning in Heterogeneous Environments
- Lise Getoor
- University of Maryland, College Park
2What is SRL?
- Traditional statistical machine learning
approaches assume - A random sample of homogeneous objects from
single relation - Traditional relational learning approaches
assume - No noise or uncertainty in data
- Real world data sets
- Multi-relational and heterogeneous
- Noisy and uncertain
- Statistical Relational Learning (SRL)
- newly emerging research area at the intersection
of statistical models and relational
learning/inductive logic programming - Sample Domains
- web data, social networks, biological data,
communication data, customer networks, sensor
networks, natural language, vision,
3SRL Theory
- Methods that combine expressive knowledge
representation formalisms such as relational and
first-order logic with principled probabilistic
and statistical approaches to inference and
learning - Directed Approaches
- Semantics based on Bayesian Networks
- Frame-based Directed Models
- Rule-based Directed Models
- Undirected Approaches
- Semantics based on Markov Networks
- Frame-based Undirected Models
- Rule-based Undirected Models
4SRL Tasks
- Entity Resolution
- Link Prediction
- Collective Classification
- Information Diffusion
- Community Discovery/Group Detection
- Ontology Alignment
5Entity Resolution
James Smith
John Smith
John Smith
Jim Smith
J Smith
James Smith
Jon Smith
Jonathan Smith
J Smith
Jonthan Smith
- Issues
- Identification
- Disambiguation
6Collective Entity Resolution
- Relational Resolution References not observed
independently use relations to improve
identification disambiguation - Links between references indicate relations
between the entities - Co-author relations for bibliographic data
- To, cc lists for email
- Collective Resolution jointly determining the
entities and mappings
Pasula et al. 03, Ananthakrishna et al. 02,
Bhattacharya Getoor 04,06,07, McCallum
Wellner 04, Li, Morie Roth 05, Culotta
McCallum 05, Kalashnikov et al. 05, Chen, Li,
Doan 05, Singla Domingos 05, Dong et al. 05
7Link Prediction
Node 1
Node 2
Email
chris_at_enron.com
liz_at_enron.com
IM
chris37
lizs22
TXT
555-450-0981
555-901-8812
8? Links in Information Graph
Node 1
Node 2
Manager
Chris
Elizabeth
Father
Tim
Steve
9Collective Classification
- Relational Classification predicting the
category of an object based on its attributes and
its links and attributes of linked objects - Collective Classification jointly predicting the
categories for a collection of connected,
unlabelled objects
Neville Jensen 00, Taskar , Abbeel Koller 02,
Lu Getoor 03, Neville, Jensen Galliger 04,
Sen Getoor TR07, Macskassy Provost 07, Gupta,
Diwam Sarawagi 07, Macskassy AAAI07, McDowell,
Gupta Aha AAAI07
10Graph Identification
Data Graph ? Information Graph
- Entity Resolution mapping email addresses to
people - Link Prediction predicting social relationship
based on communication - Collective Classification labeling nodes in the
constructed social network
HP Labs, Huberman Adamic
11Putting it all together
- Requires collective inference
- Data is not IID
- Entity resolution, link prediction and
classification decisions cannot be made
independently! - Much interesting research within the machine
learning community currently in how to put these
together effectively
12Abductive SRL
- Need to be able to use query and observations to
guide the construction of the SRL model - Need to reason about relevance, ambiguity and
costs in order to decide what information to
acquire - Using both relational background knowledge
- And statistical/probabilistic models
- Need computational mechanisms that make the value
of information computation in these rich domains
tractable
13Some first steps.
- Query-time Entity Resolution
- Bhattacharya Getoor, KDD06, AAAI06, JAIR to
appear - Cost-sensitive Markov Networks
- Sen Getoor, ICML06, DMKD to appear
- VOILA Efficient Feature-value Acquisition for
Classification - Bilgic Getoor, AAAI07
14Query-time ER
- Simple approach for resolving queries
- Use attributes
- Quick but not accurate
- Use best techniques available
- Collective resolution using relationships
- How can localize collective resolution?
- Two-phase collective resolution for query
- Extract minimal set of relevant records
- Collective resolution on extracted records
15Extracting Relevant Records
Name expansion
Name expansion
Hyper-edge expansion
Query
Level 0
Level 1
Level 2
S. Johnson
P4 Stephen C. Johnson P5 S.
Johnson P2 S. Johnson P1 S. Johnson
P4 Alfred V. Aho P5 A. Aho P4
Jefferey D. Ullman P5 J. Ullman P2 K.
McManus P2 C. Walshaw P1 C. Walshaw
P A. Aho P Alfred V. Aho P J.
Ullman P Jefferey D. Ullman P K.
McManus P K. McManus P C. Walshaw P C.
Walshaw
Start with query name or record
- Alternate between
- Name expansion For any relevant record, include
other records with that name - Hyper-edge Expansion For any relevant record,
include other related records
16Adaptive Expansion for a Query
- Too many records with unconstrained expansion
- Adaptively select records based on ambiguity
- Smith is more ambiguous than McManus
- Adaptive Name Expansion
- Expand the more ambiguous records
- They need extra evidence
- Adaptive Hyper-edge expansion
- Add fewer ambiguous records
- They lead to imprecision
17Query-time ER Results
- Unconstrained expansion
- Collective resolution more accurate
- Accuracy improves beyond depth 1
A pair-wise attributes similarity AN also
neighbors attributes transitive closure
- Adaptive expansion
- Minimal loss in accuracy
- Dramatic reduction in query processing time
AX-2 adaptive expansion at depths 2 and
beyond AX-1 adaptive expansion even at depth 1
18Cost-sensitive Markov Networks
- Need for cost-sensitive classification for
structured domains - Developed a framework for cost-sensitive maximum
entropy classifier - Evaluated on synthetic and real sensor network
data
19Sensor net data
- Used Intel Lab Dataset
- 2M records describing temperature, humidity,
light and sensor voltage - Task predict light values
- Misclassification costs based on
- if light is insufficient but predicted to be
sufficient incur occupant discomfort - if light is sufficient but predicted to be
insufficient incur excess electricity costs
20CSMN Results
21Efficient Feature Acquisition
- Problem Selecting the best attributes to
acquire, given rich cost and probabilistic
dependence structure - Requires many expected value of information
calculations - Value Of Information LAttice (VOILA) is a
directed graph whose - nodes correspond to only the relevant subsets
- exploiting constraints on the feature sets
- edges represent subset relationships between its
nodes - exploiting subset relationship for EVI
computation sharing - Different acquisition strategies FF, SS, SF
22Datasets
On average, 1/3 of VOILA nodes shared the same
EVI.
23Results - Heart
24Next Steps
- Cost-sensitive query-time adaptive information
gathering - Complexity of the integrated SRL tasks require
flexible, adaptive algorithms which retrieve
relevant information in real time - Inference and learning needs to be scalable and
real time - Methods need to take complex cost models into
account - Some related areas to keep in mind
- Visual Analytics complexity of the integrated
SRL tasks require sophisticated user interfaces
which allow user feedback and support explanation - Probabilistic Databases currently a resurgence
of work in this area in the DB community
25Thanks
httpwww.cs.umd.edu/getoor
Work sponsored by the National Science
Foundation, Google, Microsoft, KDD program and
National Geospatial Agency
26ILIADS
- Goal
- Produce high-quality integration via a flexible
method able to adapt to a wide variety of
ontology sizes and structures - Method
- Combining statistical and logical inference
- Use schema (structure) and data (instances)
effectively - Solution
- Integrated Learning In Alignment of Data and
Schema (ILIADS) - Datasets and code available athttp//www.cs.umd.
edu/linqs/projects/iliads