Feature Generation and Selection in SRL - PowerPoint PPT Presentation

About This Presentation
Title:

Feature Generation and Selection in SRL

Description:

Feature Generation and Selection in SRL Alexandrin Popescul & Lyle H. Ungar Presented By Stef Schoenmackers Overview Structural Generalized Linear Regression (SGLR ... – PowerPoint PPT presentation

Number of Views:216
Avg rating:3.0/5.0
Slides: 26
Provided by: JohnD167
Category:

less

Transcript and Presenter's Notes

Title: Feature Generation and Selection in SRL


1
Feature Generation and Selection in SRL
  • Alexandrin Popescul Lyle H. Ungar
  • Presented By Stef Schoenmackers

2
Overview
  • Structural Generalized Linear Regression (SGLR)
    Overview
  • Design Motivations
  • Experiments
  • Conclusions

3
SGLR Overview
  • Adds statistical methods to ILP
  • SQL as the logical language
  • Generalized Linear Regression as statistical
    method
  • Uses clustering to generate new relations
  • Builds discriminative models
  • Targeted at large problems where generative
    models impossible
  • Integrates feature generation and problem modeling

4
SGLR Loop
5
SGLR Method
  • Clusters data and adds clusters as new relations
  • Searches the space of SQL query refinements
  • Features are numerical SQL aggregates
  • Test feature with statistical measure (e.g. AIC,
    BIC)
  • Add only significantly predictive features
  • Examine each feature only once
  • Use current set of features to guide search

6
Overview
  • Structural Generalized Linear Regression (SGLR)
    Overview
  • Design Motivations
  • Experiments
  • Conclusions

7
SQL Motivation
  • Most of the worlds data is in relational
    databases
  • Can exploit schema and meta-information
  • SQL uses a fairly expressive language
  • Non-recursive first-order logic formulas
  • Relational DBs have been studied and optimized
    for decades, so should be more scalable than
    other alternatives

8
Clustering Motivation
  • Dimensionality reduction
  • Clusters are added as relations (new first-class
    concepts)
  • Increases expressivity of the language describing
    patterns in the data
  • Can lead to a more rapid discovery of predictive
    features
  • Done as a pre-processing step
  • cost(clustering) ltlt cost(feature search)

9
Aggregation Motivation
  • Summarizes the information in a table into scalar
    values usable by a statistical model
  • average, max, min, count, average, empty/exists
    (0/1)
  • Exploits database work into making them efficient
  • Provides a richer space of features to choose
    from

10
Dynamic Feature Generation
  • Most features do not provide useful information
  • In large domains, feature generation is
    expensive, and precomputing all possible features
    is far too time consuming
  • Solution Use a smarter search strategy and
    dynamically generate features. Let the features
    already selected influence which features are
    added
  • Focuses only on the promising areas in the search
    space

11
Feature Streams
  • Put features into different evaluation queues
  • Choose next feature from the best stream
  • If feature in multiple streams, only evaluate
    once
  • Stream design can use prior knowledge/bias

12
Refinement Graphs (in ILP)
  • Start with most general rule, and refines it to
    produce more specific clauses
  • Single variable substitution
  • Add predicate involving 1 existing variables
  • Uses top-down breadth-first search to find the
    most general rule that covers only positive
    examples
  • Performs poorly in noisy domains

13
Refinement Graphs (in SGLR)
  • Adds one relation to a query and expands it into
    all possible configurations of equality
    conditions of new attributes with a new or old
    attribute
  • Contains at least one equality condition between
    a new and old attribute
  • Any attribute can be set to a constant
  • High-level variable typing/classes are enforced
  • Not all refinements are most general, but
    simplifies pruning of equivalent subspaces
    (accounts only for the type and number of
    relations joined in a query)

14
Example Refinement Graph
Query(d)
Cites(d,d1)
Author_of(d, a)
Word_count(d, w, int)
Author_of(d, aSmith)
Cites(d,d1),Cites(d1,d2)
DB Tables
Cites(d,d1), Author_of(d1, a)
Cites(d,d1), Author_of(d1, aDomingos)
15
Overview
  • Structural Generalized Linear Regression (SGLR)
    Overview
  • Design Motivations
  • Experiments
  • Conclusions

16
Experiments
  • Used CiteSeer data
  • Citation(doc1, doc2), Author(doc, person),
    PublishedIn(doc, venue), HasWord(doc,word)
  • 60k Docs, 131k Authors, 173k Citations, 6.8M
    Words
  • Two Tasks
  • Predict the publication venue
  • Predict existence of a citation

17
Experiments
  • Cluster all many-to-many relations
  • K-means
  • Added 6 new relations
  • Use logistic regression for prediction
  • BFS of search space
  • 5k/5k- examples for venue prediction
  • 2.5k/2.5k- examples for citation prediction

18
Results
Venue (87.2)
Citation (93.1)
19
Dynamic Feature Generation
  • Query expressions generated Breadth-First
  • Baseline puts all queries into one queue
  • Dynamic strategy enqueues queries into separate
    streams
  • Stream 1 exists and count over table
  • Stream 2 other aggregates (counts of unique
    elements in individual columns)
  • Chooses next feature from stream where
    (featuresAdded1)/(featuresTried1) is max
  • Stop when a stream is empty

20
Results
Venue
No Clusters
Clusters
Citation
21
Time Results
Venue
No Clusters
Clusters
Citation
22
Domain Independent Learning
  • Most citation prediction features are
    research-area generic
  • Can we train a model for one area and test on
    another?

23
Domain Independent Results
  • Used KDD-Cup 2003 data (High Energy Physics
    papers in arXiv)

Train On Test On Accuracy
CiteSeer arXiv 92.9
CiteSeer CiteSeer 92.6
arXiv arXiv 96.0
24
Conclusions
  • Cluster-based features add expressivity, and
    apply to any domain or SRL method
  • Generating queries dynamically can reduce search
    time and increase accuracy

25
Questions?
Write a Comment
User Comments (0)
About PowerShow.com