Variable types - PowerPoint PPT Presentation

1 / 54
About This Presentation
Title:

Variable types

Description:

Boosting Unsupervised Language Acquisition with ADIOS. Eytan Ruppin ... Maintain the ADIOS principle of context-dependent generalization, but; ... – PowerPoint PPT presentation

Number of Views:57
Avg rating:3.0/5.0
Slides: 55
Provided by: criHa
Category:
Tags: adios | types | variable

less

Transcript and Presenter's Notes

Title: Variable types


1
Boosting Unsupervised Language Acquisition with
ADIOS
Eytan Ruppin Schools of Computer Science
Medicine Tel-Aviv University December
2006 http//www.tau.ac.il/ruppin
2
Overview
  • An overview of ADIOS
  • The ConText algorithm
  • Monty Python something completely different Mex
    in the service of mankind..

3
Unsupervised Learning of Natural Languages - ADIOS
Zach Solan School of Physics and
Astronomy Tel-Aviv University http//www.tau.ac.il
/zsolan
Zach Solan, David Horn, Eytan Ruppin Tel Aviv
University Shimon Edelman Cornell
4
Previous work
  • Probabilistic Context Free Grammars
  • Supervised induction methods
  • Little work on raw data
  • Mostly work on artificial CFGs
  • ABL, EMILE, etc. Unsupervised parsing - do not
    examine the generative capacity of the grammar

5
Our goal
  • Given a corpus of raw text separated into
    sentences, we want to derive a specification of
    the underlying grammar
  • This means we want to be able to
  • Create new unseen grammatically correct sentences
    (Precision)
  • Accept new unseen grammatically correct sentences
    and reject ungrammatical ones (Recall)

6
A broad scope..
  • Natural language sentence
  • A musical piece
  • A biological sequence (DNA, proteins, . . .)
  • Successive actions of a WEB user
  • Chronological series

7
ADIOS in outline
  • Composed of three main elements
  • A representational data structure
  • A segmentation criterion (MEX)
  • A generalization ability
  • We will consider each of these in turn

8
The Model Graph representation with words as
vertices and sentences as paths.
Is that a dog?
Is that a cat?
Where is the dog?
And is that a horse?
9
ADIOS in outline
  • Composed of three main elements
  • A representational data structure
  • A segmentation criterion (MEX)
  • A generalization ability

10
Detecting significant patterns
  • Identifying patterns becomes easier on a graph
  • Sub-paths are automatically aligned

11
Motif EXtraction
12
Rewiring the graph
Once a pattern is identified as significant, the
sub-paths it subsumes are merged into a new
vertex and the graph is rewired accordingly.
Repeating this process, leads to the formation of
complex, hierarchically structured patterns.
13
ADIOS in outline
  • Composed of three main elements
  • A representational data structure
  • A segmentation criterion (MEX)
  • A generalization ability

14
Generalization defining an equivalence class
show me flights from philadelphia to san
francisco on wednesdays
list all flights from boston to san
francisco with the maximum number of stops
may i see the flights from denver to san
francisco please
show flights from dallas to san
francisco
15
Generalization
16
Context-sensitive generalization
  • Slide a context window of size L across current
    search path
  • For each 1iL
  • look at all paths that are identical with the
    search path for 1kL, except for ki
  • Define an equivalence class containing the nodes
    at index i for these paths
  • Replace ith node with equivalence class
  • Find significant patterns using MEX criterion

17
Bootstrapping
Generalized search path II
boston philadelphia denver
denver philadelphia dallas
flightflightsairfare fare
What are the cheapest from
to that stop in
atlanta
18
Bootstrapping
19
The ADIOS algorithm
  • Initialization load all data into a pseudograph
  • Until no more patterns are found
  • For each path P
  • Create equivalence class candidates
  • Detect significant patterns using MEX
  • If found, add best new pattern and its associated
    equivalence classes and rewire the graph

20
The ATIS experiments
  • ATIS-NL is a 13,043 sentence corpus of natural
    language
  • Transcribed phone calls to an airline reservation
    service
  • ADIOS was trained on 12,700 sentences of ATIS-NL
  • The remaining 343 sentences were used to assess
    recall
  • Precision was determined with the help of 8
    graduate students from Cornell University

21
The ATIS experiments
  • ADIOS performance scores
  • Single learner Recall 12, approaching 40
    with multiple learners (70 with semantic
    bootstrapping)
  • Precision 50
  • For comparison, Human-crafted ATIS-CFG reached
  • Recall 45
  • Precision - lt1(!)

22
English as Second Language test
  • A single instance of ADIOS was trained on the
    CHILDES corpus
  • 120,000 sentences of transcribed child-directed
    speech
  • Subjected to the Goteborg multiple choice ESL
    test
  • 100 sentences, each with open slot
  • Pick correct word out of three
  • ADIOS got 60 of answers correctly
  • An average ninth-grader performance

23
Language dendogram
24
Language modeling Perplexity
The lower perplexity of ADIOS, compared with
results from the standard benchmarks (McCandless
Glass 1993, Chelba 2001, and Kermorvant 2004).
25
So far so good..
26
Shortcomings of ADIOS
  • Problem I Equivalence classes may be declared
    without sufficient evidence
  • That two words are contained in an identical
    5-word window is often enough for them to be
    declared interchangeable
  • Negatively impacts the learned grammars
    precision Brian likes to go
    fishingJohn and Brian like to go fishing

27
Shortcomings of ADIOS
  • Problem II Segmentation criterion is decoupled
    from search for equivalence classes, and from the
    potential contribution of patterns to
    generalization.
  • from baltimore
  • from baltimore

san francisco to
atlanta to
28
The ConText Algorithm
Ben Sandbank School of Computer Science Tel-Aviv
University http//www.cs.tau.ac.il/sandban
Eytan Ruppin Tel Aviv University Shimon
Edelman Cornell
29
The ConText Algorithm
  • Main concepts
  • Maintain the ADIOS principle of context-dependent
    generalization, but
  • Utilize distributional clustering concepts to
    safely detect interchangeability (Problem I)
  • Declare constituents only when necessary for
    generalization (Problem II)

30
Algorithm outline
  • Calculate the context distribution of all
    word-sequences that occur more than K times
  • the context of a sequence is the l-word
    neighborhood on both its sides
  • Cluster sequences according to distance in
    context-space
  • Find optimal alignment within each cluster
  • Yielding interchangeable words and expressions
    (note that this process obviates ADIOS
    segmentation process)
  • Merge sequences within each cluster to yield a
    single generalized sequence
  • Load resulting equivalence classes back to the
    corpus in a context-sensitive way, using the
    remaining terminal words.
  • The collection of sentences and equivalence
    classes hierarchy obtained forms a context-free
    grammar.

31
Clustering procedure
  • Each word sequence is represented as a
    high-dimensional vector constructed in the
    following way
  • For each sequence, create a context vector on its
    left and on its right
  • The ith entry in the left (right) context vector
    contains the number of times the corresponding
    l-word-sequence appeared to the immediate left
    (right) of the represented sequence
  • The two vectors are linked to form one long
    vector
  • The distance metric used between two sequences is
    the angle between their corresponding vectors
  • No smoothing, TF/IDF or the like is performed
  • A parameter determines the minimum distance
    between two sequences that allows for their
    clustering in the same cluster

32
Alignment procedure
  • Aims at finding interchangeable words and word
    sequences between every pair of alignable
    sequnces
  • A straightforward extension of dynamic
    programming, where in each step we look for the
    best alignment of the i,j1 prefixes, given the
    best alignments up to the i,j prefixes (yielding
    an O(n4) complexity).
  • Input clusters of word sequences
  • Output
  • A substitution cost matrix between subsequences
  • Which may each contain more than one word
  • Insertions and deletions currently not supported
  • The optimal pairwise alignments within each
    cluster

33
Alignment procedure
  • Initialize substitution cost matrix
  • The cost of substituting subsequence of length i
    words with subsequence of length j words is
    min(i, j)
  • Unless theyre identical and the cost is zero
  • Iterate until convergence in alignment
  • Find the optimal pairwise alignment between each
    pair of sequences contained in the same cluster
  • For each substitution used in the optimal
    alignment, multiply its cost in the substitution
    matrix by alphalt1
  • Allows sharing information between clusters

34
Toy Sample - Resulting Sentence
Id like to order a flight
35
Results
  • Tested on ATIS-2
  • A natural language corpus of transcribed
    spontaneous speech
  • Contains 11,670 sentences
  • 11,370 were used for training, 300 for recall
    testing
  • Two main parameters affect quality of results
  • D - The minimum distance for two sequences to
    become clustered
  • K - The minimum number of times a sequence must
    appear in the corpus to be considered for
    clustering

36
Results K25
ADIOS results Recall 0.12 Precision 0.5
37
Results D0.7
  • Surprisingly no clear connection between K and
    recall/precision rates
  • Although a general precision trend is discernable

38
Conclusions
  • On ATIS-2, ConText is superior in performance to
    ADIOS
  • Use of distributional clustering method provides
    significantly higher precision
  • Recall is marginally improved,
  • Additional iterations (after rewiring) are
    expected to improve recall further, but when
    naively implemented hamper precision
  • Introducing asynchronous dynamics in the
    alignment procedure may enable higher recall
    through multiple learners.
  • Quality of results seems to depend more on D than
    on K

39
Functional representation of enzymes by specific
peptides
  • David Horn
  • Tel Aviv University
  • http//horn.tau.ac.il

in collaboration with Vered Kunik, Yasmine Meroz,
Zach Solan, Ben Sandbank, Eytan Ruppin
40
Application to Biology
  • Vertices of the graph 4 or 20 letters
  • Paths gene-sequences, protein-sequences.
  • Transition probabilities on the graph are
    proportional to the number of paths
  • Trial-path testing transition probabilities to
    extract motifs

41
Enzyme Function
  • The functionality of an enzyme is determined
  • according to its EC number
  • Classification Hierarchy Webb, 1992
  • EC number n1.n2.n3.n4 (a unique identifier)

n1.n2 sub-class / 2nd level
n1.n2.n3 sub-subclass / 3rd level
n1.n2.n3.n4 precise enzymatic activity
42
Methods
  • Mex motifs are extracted from the enzymes data
  • A linear SVM is applied to evaluate the
    predictive
  • power of MEX motifs
  • Enzyme sequences are randomly partitioned
  • into a training-set and a test-set (75-25)
  • 16 2nd level classification tasks
  • 32 3rd level classification tasks
  • Performance measurement Jaccard Score
  • The train-test procedure was repeated 40 times
  • to gather sufficient statistics

43
Assessing Classification Results
  • Classifications performance are compared to
  • those of two other methods
  • Smith-Waterman algorithm
  • Identification of Common Molecular
    Subsequences. JMB, 1981
  • p-value of pairwise similarity score
  • SVMProt
  • Cai et al. Nucleic Acids Research, 2003
  • physicochemical properties of AA
  • Both methods require additional information
  • apart from the raw sequential data

44
2nd Level Classification
Jaccard Score
a lt 0.01
EC Subclass
45
3rd Level Classification
Jaccard Score
a lt 0.01
EC Sub-subclass
46
EC hierarchy and specific peptides
47
questions
  • Do SPs have biological relevance?
  • Are they accounted for in the literature?
  • Generalization of the classification
  • The question of bias
  • Remote homology

48
Examples of two level 4 classes
5.1.3.20 protobacteria 5.1.3.2 bacteria and
eukaryotes
49
P67910
1,2,3 are active sites S,Y,K 4 RYFNV
50
(No Transcript)
51
(No Transcript)
52
Generalization double annotation
  • Doubly annotated enzymes were not included in the
    training set (the data)
  • 260 such enzymes exist.
  • SPs have 421 hits on 69 of them. Most agree with
    annotations. 30 disagree, leading to new
    predictions.

53
Remote homology
Examples taken from Rost 2002, pointing out
disparity between identity and function. Here
function verified by SPs.
54
Summary
  • Specific peptides (average length 8.4), extracted
    by MEX, are useful for functional representation
    of most enzymes (average length 380).
  • SPs are deterministic motifs conserved by
    evolution, hence expected to have biological
    significance.
  • SP4s are observed to cover active and binding
    sites.
  • Some other SPs seem to be important, judging by
    their spatial locations.
  • SPs are important for remote homologies
Write a Comment
User Comments (0)
About PowerShow.com