Learning on the Test Data: Leveraging Unseen Features - PowerPoint PPT Presentation

About This Presentation
Title:

Learning on the Test Data: Leveraging Unseen Features

Description:

In many cases, the data are collected from different sources, at ... Reuters: The Reuters news articles data set contains substantial number of documents hand ... – PowerPoint PPT presentation

Number of Views:44
Avg rating:3.0/5.0
Slides: 22
Provided by: valleyl
Category:

less

Transcript and Presenter's Notes

Title: Learning on the Test Data: Leveraging Unseen Features


1
Learning on the Test DataLeveraging Unseen
Features
  • Ben Taskar Ming FaiWong
    Daphne Koller

2
Introduction
  • Most statistical learning models make the
    assumption that data instances are IID samples
    from some fixed distribution.
  • In many cases, the data are collected from
    different sources, at different times, locations
    and under different circumstances.
  • We usually build a statistical model of features
    under the assumption that future data will
    exhibit the same regularities as the training
    data.
  • In many data sets, however, there are
    scope-limited features whose predictive power is
    only applicable to a certain subset of the data.

3
Examples
  • 1. Classifying news articles chronologically
  • Suppose the task is to classify news
    articles chronologically. New
  • events, people and places appear and
    disappear) in bursts over
  • time.
  • The training data might consist of
    articles taken over some time
  • period these are only somewhat
    representative of the future
  • articles.
  • The training data may contain some
    features that are not observed
  • in the training data.
  • 2. Classifying customers into categories
  • Our training data might be collected from
    one geographical region
  • which may not represent the distribution
    in other regions.

4
We can get away with this difficulty by
mixing all the examples and selecting the
training and test sets randomly. But this
homogeneity cannot be ensured in real world task,
where only the non-representative training data
is actually available for training. The
test data may contain many features that were
never or only rarely observed in training data.
These features may be used for classification.
For ex, in the news article task these local
features might include the names of places or
people currently in the news. In the
customers ex, these local features might include
purchases of products that are specific to a
region.
5
Scoped Learning
  • Suppose we want to classify news articles
    chronologically. The phrase XXX said today
    might appear in many places in data for different
    values of XXX
  • These features are called scope limited
    features or local features.
  • Another example
  • Suppose there are 2 labels grain and trade.
    Words like corn or wheat often appear in phrase
    tons of wheat". So we can learn that if a word
    appears in the context of tons of xxx it is
    likely to be associated with grain. So if we find
    a phrase like tons of rye in the test data we
    can infer that it has some positive interaction
    with label grain.
  • Scoped learning is a probabilistic framework
    that combines the traditional IID features with
    scope limited features.

6
The intuitive procedure for
using the local features is to use the
information from the global (IID) features to
infer the rules that govern the local information
for a particular subset of data.When data
exhibits scope they found significant gains in
performance over traditional models which only
uses IID features.All the data instances within
a particular scope exhibit some structural
regularity and we assume that all the future data
will exhibit the same structural regularity.
7
General Framework
  • Notion of scope
  • We assume that data instances are sampled from
    some set of scopes, each of which is associated
    with some data distribution.
  • Different distributions share a probabilistic
    model for some set of global features, but can
    contain a different probabilistic model for a
    scope-specific set of local features.
  • These local features may be rarely or never seen
    in the scopes comprising the training data.

8
  • Let X denote global features, Z denote local
    features, and Y the class variable.For each
    global feature Xi, there is a parameter ?i.
    Additionally,for each scope and each local
    feature Zi, there isa parameter ?iS.
  • Then the distribution of Y given all the
    features and weights is

9
Probabilistic model
  • We assume that the global weights can be learned
    from training data. So their values are fixed
    when we encounter a new scope and the local
    feature weights are unknown and can be treated as
    hidden variables in the graphical model.
  • Idea
  • The evidence from global features for
    the labels of some of the instances to modify our
    beliefs about the role of the local feature
    present in these instances to be consistent with
    the labels. By learning about the roles of these
    features, we can then propagate this information
    to improve accuracy on instances that are harder
    to classify using global features alone.

10
  • To implement this idea, we define a joint
    distribution over ?S and y1, . . . , ym.
  • Why use Markov Random Fields
  • Here the association between the variables are
    correlated rather than causal. Markov random
    fields are used to model spatial interactions or
    interacting features.

11
Markov Network
  • Let V (Vd,Vc) denote a set of random variables,
    where Vd are discrete and Vc are continuous
    variables, respectively.
  • A Markov network over V defines a joint
    distribution over V, assigning a density over Vc
    for each possible assignment vd to Vd.
  • A Markov network M is an undirected graph whose
    nodes correspond to V.
  • It is parameterization by a set of potential
    functions f1(C1), . . . , fl(Cl) such that each C
    V is a fully connected subgraph, or clique, in M,
    i.e., each Vi, Vj C are connected by an edge in
    M.
  • Here we assume that the f(C) is a log-quadratic
    function
  • The Markov network then represents the
    distribution

12
  • In our case the log-quadratic model consists of 3
    types of potentials
  • 1) f(?i,Yj,Xij) exp(?iYjXij)
  • relates each global feature Xij in
    instance i to its weight ?i and the class
    variables Yj of the corresponding instance i.
  • 2) f(?i,Yj,Zij) exp(?iYjZij)
  • relates the local feature Zij to its
    weight ?i and the label Y j
  • Finally, as the local feature weights are assumed
    to be hidden, we introduce a prior over their
    values, or the form
  • Overall, our model specifies a joint distribution
    as follows

13
Markov network for two instances, two global
features and three local features
14
  • The graph can be simplified further when we
    account for varaibles whose values are fixed.
  • The global feature weights are learned from the
    training data and hence their value is fixed and
    we also know all the feature values.
  • The resulting Markov network is shown below
    (Assuming that the instance (x1, z1, y1) contains
    the features Z1 and Z2, and the instance(x2, z2,
    y2) contains the features Z2 and Z3.)
  • Y2
  • ?1 ?2
    ?3
  • Y1

15
  • This can be reduced further. When Zij0 there is
    no interaction between Yj and any of the
    variables ?i.
  • In this case we can simply omit the edge between
    ?i and Yj
  • And the resulting Markov network is shown below
  • Y2
  • ?1 ?2
    ?3
  • Y1

16
  • In this model, we can see that the labels of all
    of the instances are correlated with the local
    feature weights of features they contain, and
    thereby with each other. Thus, for example, if we
    obtain evidence (from global features) about the
    label Y 1, it would change our posterior beliefs
    about the local feature weight 2, which in turn
    would change our beliefs about the label Y 2.
    Thus, by running probabilistic inference over
    this graphical model, we obtain updated beliefs
    both about the local feature weights and about
    the instance labels.

17
Learning the Model
  • Learning Global Feature Weights
  • In this case we simply learn their parameters
    from the training data, using standard logistic
    regression. Maximum-likelihood (ML) estimation
    finds the weights ? that maximize the conditional
    likelihood of the labels given the global
    features.
  • Learning Local feature Distributions
  • We can exploit such patterns by learning a model
    that predicts the prior of the local feature
    weights using meta features features of
    features. More precisely, we learna model that
    predicts the prior mean µi for i from someset of
    meta-features mi. As our predictive model for the
    mean µi we choose to use a linear regression
    model, setting
  • µi w
    mi.

18
Using the model
  • Step1
  • Given a training set, we first learn the
    model. In the training set, there local and
    global features are treated identically. When
    applying the model to the test set, however, our
    first decision is to determine the set of local
    and global features.
  • Step 2
  • Our next step is to generate the Markov
    network for the test set. Probabilistic inference
    over this model infers the effect of local
    features.
  • Step 3
  • We use Expectation Propagation for
    inference. It maintains approximate beliefs
    (marginals) over nodes of the Markov network and
    iteratively adjusts them to achieve local
    consistency.

19
Experimental Results
  • Reuters
  • The Reuters news articles data set contains
    substantial number of documents hand labeled into
    grain, crude, trade, and money-fx.
  • Using this data set, six experimental setups are
    created, by using all possible pairings of
    categories from the four categories chosen.
  • The resulting sequence is divided into nine time
    segments with roughly the same number of
    documents in each segment.

20
(No Transcript)
21
  • WebKB2
  • This data set consists of hand-labeled web
    pages from Computer Science department web sites
    of four schools Berkeley, CMU, MIT and Stanford
    and they are categorized into faculty, student,
    course and organization.
  • Six experimental setups are created by using
    all possible pairings of categories from the four
    categories.
Write a Comment
User Comments (0)
About PowerShow.com