Automatic Derivation of Thesauri Thesauri Construction - PowerPoint PPT Presentation

1 / 17
About This Presentation
Title:

Automatic Derivation of Thesauri Thesauri Construction

Description:

... of machine read-able dictionaries, large corpora of natural language ... ( dictionary look-up, algorithms): CLARIT package is employed for this purpose. ... – PowerPoint PPT presentation

Number of Views:68
Avg rating:3.0/5.0
Slides: 18
Provided by: rakesh3
Category:

less

Transcript and Presenter's Notes

Title: Automatic Derivation of Thesauri Thesauri Construction


1
Automatic Derivation of Thesauri (Thesauri
Construction)
  • Source
  • Use of Syntactic Context to Produce Term
    Association Lists for Text Retrieval-Gregory
    Grefenstette (Part I)
  • Building a Lexical Domain Map From Text
    Corpora-Tomek Strzalkowski (Part II)

2
  • With the current availability of machine
    read-able dictionaries, large corpora of natural
    language text, and robust syntactic processors,
    there is a renewed interest in extracting
    knowledge automatically from large quantities of
    text.
  • As an operational definition, two words can be
    said to be related if they appear in the same
    context.
  • The document co-occurrence hypothesis is that two
    words appearing in the same document share some
    semantic relatedness. While this is certainly
    true, document co-occurrence is only one rough
    measure of a words context.
  • Document co-occurrence suffers from two problems
  • granularity, every word in the document is
    considered potentially related to every other
    word, no matter what the distance between them.
  • co-occurrence, for two words to be seen as
    similar they must physically appear in the same
    document.

3
  • New Approach
  • A technique for extracting similarity lists for
    words in a corpus for which no manual indexing
    or relevance measures might exist
  • The extraction uses only coarse syntactic
    analysis and no domain knowledge
  • Has finer granularity, recognizes words across
    different documents
  • Allows for query expansion

4
Contexts
  • Basic premise words found in the same context
    tend to share semantic similarity
  • Use of syntactic-analysis opens up a much wider
    range of contexts than simple document
    co-occurrence or co-occurrence within a window of
    words (Nouns)
  • ADJ, NN when a word is modified by an adjective
    or another noun
  • most valuable player
  • gt player, valuable lt ADJ
  • NNPREP when a word is modified by a noun via a
    preposition
  • till the end of the year
  • gt end, year lt NNPREP
  • SUBJ, DOBJ, IOBJ when a word appears as the
    subject, direct or indirect object of a verb. We
    take a simplified view of indirect objects,
    retaining the first prepositional phrase after a
    verb as its indirect object.

5
Morphological Analysis-Step I
  • (continuation..)
  • if the giants had won the nl west
  • giant, win lt SUBJ
  • someone could suggest a better formula
  • Suggest, formula lt DOBJ
  • reaching on an error
  • Reach, error lt IOBJ
  • The analysis provides the grammatical categories
    that every word may play. (dictionary look-up,
    algorithms) CLARIT package is employed for this
    purpose.
  • Original correlation between maternal and fetal
    plasma levels of glucose and free fatty acids.
  • After MA correlation sn correlation, between
    prep between, maternal adj maternal, and cnj
    and, fetal adj fetal, ,
  • levels pn level vt-pressg3 level

6
Syntactic Disambiguation-Step II
  • Each word needs to be grammatically disambiguated
  • Assign a single grammatical category to each word
  • A number of robust grammar based or methods
  • Disambiguator using a time linear stochastic
    grammar based on Brown frequencies
  • between prep between
  • maternal adj maternal
  • levels pn level

7
Noun and Verb Phrases-Step III
  • Take the disambiguated text and divide it into
    verb and noun phrases
  • Method employs lists of grammatical values which
    can start and end a verb and noun phrase, and
    precedence matrices describing legal
    continuations of verb noun phrases.

8
Extracting Structural Relations-Step IV
  • Once each sentence is divided into phrases,
    intra- and inter-phrase structural relations are
    extracted
  • Noun phrases are scanned from left to right
    hooking up articles, adjectives and modifier
    nouns to head nouns
  • Noun phrases are scanned from right to left
    connecting nouns over prepositions
  • Starring from Verb phrases, phrases are scanned
    from verb phrase for an unconnected head which
    becomes the subject, and likewise to the right of
    the verb object
  • This techniques does not address the finer points
    of syntactic analysis, such as anaphora
    resolution, multi-word verbs, garden paths etc.

9
Similarity Calculation
  • Each word is considered as an object and its
    collection of context features, attributes
  • Tanimoto measure using log-entropy weightings
    gave the best intuitive results. Each relation
    pair is given a local weighting of
    log(Frequency1) which is multiplied by a global
    weighting of the attribute involved, using
  • 1-?j ((Pij log(Pij))/(log(nbrels)))
  • Where Pij is frequency of attributej with objecti
  • number of
    attributes for objecti
  • And nbrels is the total number of non-unique
    term-attribute relations extracted from the
    corpus.

10
Similarity Calculation (contd.)
  • Our formula for the weighted Tanimoto similarity
    measure between two objects objm and objn, where
    the sums are over all the unique attributes att,
    is
  • ?att min(weight(objn ,att), weight(objn, att)
  • ?att max(weight(objn ,att), weight(objn, att)
  • Note when the weights are restricted to 0 and 1
    that this formula is equivalent to a binary
    Tanimoto formula, though this is by no means the
    only means to generalize the binary formula to
    the weighted case. This method is used in the
    evaluation phase (next).

11
Evaluation
  • Testing the similarity of words against the MED
    (database of medical abstracts) 160,000 words -gt
    extracting the syntactic contexts of nouns
    produced 71000 pairs of words. Of these, there
    were 53300 unique pairs. These pairs were
    composed of 5289 unique words using 9997 unique
    attributes. The closest terms to each
    term-attribute pairs were calculated. 684 terms
    appeared in this frequently. We iteratively
    expanded the queries by adding in the words
    closest to any of these 684 terms. By closest,
    we accepted the word with the smallest distance
    (0 to 1.0) to the term, or other word within 0.01
    of this distance.

12
Lexical Domain Map
  • A representation R for any text items D1 and D2,
    R(D1) R(D2) iff meaning(D1)meaning(D2)
  • Normalization via
  • Morphological stemming retrieving gt retriev
  • Lexicon-based normalization retrieval gt
    retrieve
  • Operator-argument representation of phrases
    information retrieval, retrieving of information
    retrieveinformation
  • Content-based term clustering into synonymy
    classes and submission hierarchies take-over is
    an acquisition

13
TTP (tagged text parser)
  • Text ? NLP ? repres ? dbase search
  • NLP tagger ? parser ? terms
  • Head-Modifier Structures
  • TTP parse structures are parsed to the phrase
    extraction module where head-modifier (including
    predicateargument) pairs are extracted and
    collected into occurrence patterns
  • A head noun and its left adjective or noun
    adjunct
  • A head noun and the head of its right adjunct
  • The main verb of a clause and the head of its
    object phrase

14
Term Correlations from Text
  • SIM(x1,x2) ?att MIN(W(x,att),W(y,att)
  • ?att MAX(W(x,att),W(y
    ,att)
  • with
  • W(x,y) GEW(x) log(fx,y)
  • GEW(x) 1 ?y fx,y/ny log( fx,y/ny)
  • log
    (N)
  • fx,y stands for absolute frequency of pair x,y
    in the corpus, ny is the frequency of term y, and
    N is the number of single-word terms

15
Global Term Specificity Measure
  • GTS(w) ICL(w) ICR(w) if both exist
  • ICR(w) only
    ICR exists
  • ICL(w)
    otherwise
  • Where with nw, dw gt 0
  • ICL(w) IC(w,_) nw .
  • dw(nwdw-1)
  • ICR(w) IC(_,w) nw .
  • dw(nwdw-1)
  • dw is the dispersion of term w understood as the
    number of distinct contexts in which w is found

16
Conclusion-I
  • The future of Information Retrieval lies in
    knowledge-based techniques. We have presented
    here a technique and mentioned others which
    provide a portion of the needed knowledge
    automatically without using previous domain
    knowledge. Our test show that despite using
  • imperfect morphological analysis
  • imperfect syntactic disambiguation
  • imperfect structural analysis
  • limited contexts
  • imperfectly understood similarity measures

17
Conclusion-II
  • Lexical relations between terms are calculated
    directly from the database and stored in the form
    of a domain map
  • Determines successful matches between documents
    and queries
  • Does not include semantic analysis
Write a Comment
User Comments (0)
About PowerShow.com