Automatic Derivation of Thesauri Thesauri Construction

1 / 17

About This Presentation

Title:

Automatic Derivation of Thesauri Thesauri Construction

Description:

... of machine read-able dictionaries, large corpora of natural language ... ( dictionary look-up, algorithms): CLARIT package is employed for this purpose. ... –

Number of Views:68

Avg rating:3.0/5.0

Slides: 18

Provided by: rakesh3

Category:

more less

Transcript and Presenter's Notes

Title: Automatic Derivation of Thesauri Thesauri Construction

1
Automatic Derivation of Thesauri (Thesauri
Construction)

Source
Use of Syntactic Context to Produce Term
Association Lists for Text Retrieval-Gregory
Grefenstette (Part I)
Building a Lexical Domain Map From Text
Corpora-Tomek Strzalkowski (Part II)

With the current availability of machine
read-able dictionaries, large corpora of natural
language text, and robust syntactic processors,
there is a renewed interest in extracting
knowledge automatically from large quantities of
text.
As an operational definition, two words can be
said to be related if they appear in the same
context.
The document co-occurrence hypothesis is that two
words appearing in the same document share some
semantic relatedness. While this is certainly
true, document co-occurrence is only one rough
measure of a words context.
Document co-occurrence suffers from two problems
granularity, every word in the document is
considered potentially related to every other
word, no matter what the distance between them.
co-occurrence, for two words to be seen as
similar they must physically appear in the same
document.

New Approach
A technique for extracting similarity lists for
words in a corpus for which no manual indexing
or relevance measures might exist
The extraction uses only coarse syntactic
analysis and no domain knowledge
Has finer granularity, recognizes words across
different documents
Allows for query expansion

4
Contexts

Basic premise words found in the same context
tend to share semantic similarity
Use of syntactic-analysis opens up a much wider
range of contexts than simple document
co-occurrence or co-occurrence within a window of
words (Nouns)
ADJ, NN when a word is modified by an adjective
or another noun
most valuable player
gt player, valuable lt ADJ
NNPREP when a word is modified by a noun via a
preposition
till the end of the year
gt end, year lt NNPREP
SUBJ, DOBJ, IOBJ when a word appears as the
subject, direct or indirect object of a verb. We
take a simplified view of indirect objects,
retaining the first prepositional phrase after a
verb as its indirect object.

5
Morphological Analysis-Step I

(continuation..)
if the giants had won the nl west
giant, win lt SUBJ
someone could suggest a better formula
Suggest, formula lt DOBJ
reaching on an error
Reach, error lt IOBJ
The analysis provides the grammatical categories
that every word may play. (dictionary look-up,
algorithms) CLARIT package is employed for this
purpose.
Original correlation between maternal and fetal
plasma levels of glucose and free fatty acids.
After MA correlation sn correlation, between
prep between, maternal adj maternal, and cnj
and, fetal adj fetal, ,
levels pn level vt-pressg3 level

6
Syntactic Disambiguation-Step II

Each word needs to be grammatically disambiguated
Assign a single grammatical category to each word
A number of robust grammar based or methods
Disambiguator using a time linear stochastic
grammar based on Brown frequencies
between prep between
maternal adj maternal
levels pn level

7
Noun and Verb Phrases-Step III

Take the disambiguated text and divide it into
verb and noun phrases
Method employs lists of grammatical values which
can start and end a verb and noun phrase, and
precedence matrices describing legal
continuations of verb noun phrases.

8
Extracting Structural Relations-Step IV

Once each sentence is divided into phrases,
intra- and inter-phrase structural relations are
extracted
Noun phrases are scanned from left to right
hooking up articles, adjectives and modifier
nouns to head nouns
Noun phrases are scanned from right to left
connecting nouns over prepositions
Starring from Verb phrases, phrases are scanned
from verb phrase for an unconnected head which
becomes the subject, and likewise to the right of
the verb object
This techniques does not address the finer points
of syntactic analysis, such as anaphora
resolution, multi-word verbs, garden paths etc.

9
Similarity Calculation

Each word is considered as an object and its
collection of context features, attributes
Tanimoto measure using log-entropy weightings
gave the best intuitive results. Each relation
pair is given a local weighting of
log(Frequency1) which is multiplied by a global
weighting of the attribute involved, using
1-?j ((Pij log(Pij))/(log(nbrels)))
Where Pij is frequency of attributej with objecti
number of
attributes for objecti
And nbrels is the total number of non-unique
term-attribute relations extracted from the
corpus.

10
Similarity Calculation (contd.)

Our formula for the weighted Tanimoto similarity
measure between two objects objm and objn, where
the sums are over all the unique attributes att,
is
?att min(weight(objn ,att), weight(objn, att)
?att max(weight(objn ,att), weight(objn, att)
Note when the weights are restricted to 0 and 1
that this formula is equivalent to a binary
Tanimoto formula, though this is by no means the
only means to generalize the binary formula to
the weighted case. This method is used in the
evaluation phase (next).

11
Evaluation

Testing the similarity of words against the MED
(database of medical abstracts) 160,000 words -gt
extracting the syntactic contexts of nouns
produced 71000 pairs of words. Of these, there
were 53300 unique pairs. These pairs were
composed of 5289 unique words using 9997 unique
attributes. The closest terms to each
term-attribute pairs were calculated. 684 terms
appeared in this frequently. We iteratively
expanded the queries by adding in the words
closest to any of these 684 terms. By closest,
we accepted the word with the smallest distance
(0 to 1.0) to the term, or other word within 0.01
of this distance.

12
Lexical Domain Map

A representation R for any text items D1 and D2,
R(D1) R(D2) iff meaning(D1)meaning(D2)
Normalization via
Morphological stemming retrieving gt retriev
Lexicon-based normalization retrieval gt
retrieve
Operator-argument representation of phrases
information retrieval, retrieving of information
retrieveinformation
Content-based term clustering into synonymy
classes and submission hierarchies take-over is
an acquisition

13
TTP (tagged text parser)

Text ? NLP ? repres ? dbase search
NLP tagger ? parser ? terms
Head-Modifier Structures
TTP parse structures are parsed to the phrase
extraction module where head-modifier (including
predicateargument) pairs are extracted and
collected into occurrence patterns
A head noun and its left adjective or noun
adjunct
A head noun and the head of its right adjunct
The main verb of a clause and the head of its
object phrase

14
Term Correlations from Text

SIM(x1,x2) ?att MIN(W(x,att),W(y,att)
?att MAX(W(x,att),W(y
,att)
with
W(x,y) GEW(x) log(fx,y)
GEW(x) 1 ?y fx,y/ny log( fx,y/ny)
log
(N)
fx,y stands for absolute frequency of pair x,y
in the corpus, ny is the frequency of term y, and
N is the number of single-word terms

15
Global Term Specificity Measure

GTS(w) ICL(w) ICR(w) if both exist
ICR(w) only
ICR exists
ICL(w)
otherwise
Where with nw, dw gt 0
ICL(w) IC(w,_) nw .
dw(nwdw-1)
ICR(w) IC(_,w) nw .
dw(nwdw-1)
dw is the dispersion of term w understood as the
number of distinct contexts in which w is found

16
Conclusion-I

The future of Information Retrieval lies in
knowledge-based techniques. We have presented
here a technique and mentioned others which
provide a portion of the needed knowledge
automatically without using previous domain
knowledge. Our test show that despite using
imperfect morphological analysis
imperfect syntactic disambiguation
imperfect structural analysis
limited contexts
imperfectly understood similarity measures

17
Conclusion-II

Lexical relations between terms are calculated
directly from the database and stored in the form
of a domain map
Determines successful matches between documents
and queries
Does not include semantic analysis

Write a Comment

User Comments (0)