Extracting Lexical Features - PowerPoint PPT Presentation

1 / 19

About This Presentation

Title:

Extracting Lexical Features

Description:

Number of Views:99

Avg rating:3.0/5.0

Slides: 20

Provided by: osirisSun

Category:

Tags: ait | extracting | features | lexical

Transcript and Presenter's Notes

Title: Extracting Lexical Features

1
Extracting Lexical Features

Development of software tools for a search engine
1. convert an arbitrary pile of textual objects
into a well-defined corpus of documents, each
containing a string of terms to be indexed.
2. invert the index, so rather than seeing all
the words contained in a particular document, we
can find all documents containing particular
keywords.
3. (later chapters) match queries to indices to
retrieve those which are most similar.

2
Interdocument Parsing

The first step is to break the corpus an
arbitrary pile of text into individually
retrievable documents.
Two text corpora are AIT (AI theses) and email
documents are abstracts (AIT) or the entire
message (email).
Filters such as DeTex for removing LATEX markup,
or ltH1gt in HTML.

3
Intradocument Parsing

Reading each character of each document, deciding
whether it is part of a meaningful token, and
deciding whether these tokens are worth indexing
is the most computationally intensive aspect of
indexing must be efficient.
Deal with text in situ and not make a second copy
for use by the indexing and retrieval system, by
creating a system of pointers to locations within
the corpus.
A lexical analyser tokenises the stream of
characters into a sequence of word-like elements
using a finite state machine (e.g. UNIX lex tool
is a lexical analyser generator, PERL, next
slide).
Fold case treating upper and lower case
interchangeably saves space.

4
(No Transcript)
5
Stemming

Stemming aims to remove surface markings (such as
number) to reveal a root form
Using a tokens root form as an index term can
give robust retrieval even when the query
contains the plural CARS while the document
contains the singular CAR
Linguists distinguish inflectional morphology
(plurals, third person singular, past tense,
-ing) from derivational morphology (e.g. teach
(verb), teacher (noun)). Weak vs. strong
stemming.

6
Plural to singular

Most common remove terminal s, but
Cant remove last s of ss, e.g. crisis ? crisi,
chess ? ches.
woman / women, leaf / leaves, ferry / ferries,
fox / foxes, alumnus / alumni.
We need a context-sensitive transformational
grammar which works reliably over groups of words
(e.g. all words ending in ch). See next page.

7
Example stemming rules

(.)SSES ? /1SS PERL-like syntax to say that
strings ending in SSES should be transformed by
taking the stem (characters before SSES) and
adding only the two characters SS.
(.)IES ? /1Y
A complete stemmer contains many such rules (60
in Lovins set), and a regime for handling
conflicts when multiple rules match the same
token, e.g. longest match, rule order.

8
Pros and Cons of Stemming

Reduces the size of the keyword vocabulary,
allowing compression of the index files of 10
50.
Increases recall a query on FOOTBALL now also
finds documents on FOOTBALLER(S), FOOTBALLING.
Reduces precision stripping away morphological
features may obscure differences in word
meanings. For example, GRAVITY has two senses
(earths pull, seriousness). GRAVITATION can only
refer to earths pull but if we stem it to
GRAVITY, it could mean either.

9
Noise words

A relatively small number of words account for a
very significant fraction of all texts bulk.
Words like IT, AND and TO can be found in
virtually every sentence.
These noise words make very poor index terms.
Users are unlikely to ask for documents about TO,
and it is hard to imagine a document about BE.
Noise words should be ignored by our lexical
analyser, e.g. by storing in a negative
dictionary or stop list.
In general, noise words are the most frequent in
the corpus. But TIME, WAR, HOME, LIFE, WATER and
WORLD are among the 200 most common words in
English literature.
The same tokens that are thrown away in IR are
precisely those function words that are most
important to the syntactic analysis of a
well-formed sentence, and are indicators of an
authors individual writing style.

10
Example Corpus 1 AIT

AIT, the Artificial Intelligence Thesis about
5000 (most) Ph.D. and Masters dissertations in
AI from 1987-1997.
structured attributes are ones for which we can
reason more formally, using database and AI
techniques (thesis number, author, year,
university, supervisor, language, degree)
Textual fields (IR) the abstract is the primary
textual element associated with each thesis,
while the title (also a textual field) will be
used as its proxy (conveying much of the material
in the abstract in a highly abbreviated form).
Proxies are important document surrogates for the
documents, e.g. when the users are presented with
hitlists of retrieved documents.

11
Example Corpus 2 Your Email

Email has structured attributes associated with
it, in its header. These include
From
To
Cc
Subject (proxy text)
Date
Other features we may associate with an email
message are incoming/outgoing and folder in which
it was stored.
Parallels between the two example corpora are
that both have well-defined authors, time-stamps,
and obvious candidates for proxy text.

12
(No Transcript)
13
Basic Algorithm for an IR system

We now assume that
Prior technology has successfully broken our
large corpus into a set of documents
Within each document we have identified
individual tokens
Noise words have been identified.
Then our basic algorithm proceeds as follows

14
Algorithm 2.1

For every doc in corpus
while (token getNonNoiseToken)
token stem(token)
save Posting(token, doc) in tree
A posting is simply a correspondence between a
particular word and a particular document,
representing the occurrence of that word in that
document.
For every token in Tree
Accumulate totdoc(token), totfreq(token)
Sort postings data in descending order of
docfreq
write token, totdoc, totfreq, Postings.
Also store a file of document lengths for
normalisation purposes.

15
(No Transcript)
16
Refinements to the postings data structure.

Once the documents postings have been sorted
into descending order of frequency, it is likely
that several of the documents in this list have
the same frequency, and we can exploit this fact
to compress their representation
Consider various keyword weighting schemes.

17
(No Transcript)
18
Splay Trees

Splay trees are an appropriate data structure for
these keywords and their postings.
A splay tree is a self-balancing binary search
tree with the additional unusual property that
recently accessed elelments are quick to access
again.
Splaying the tree for a certain element
rearranges the tree so that the element is placed
at the root of the tree (Wikipedia).

19
Fine points

Posting resolution some query languages allow
proximity operators which allow users to specify
how close two keywords must be (e.g. adjacent,
same sentence, within a k-word window) this
requires us to retain the exact position of each
keyword, not just which document its in.
Emphasising words in proxy text over those used
in the rest of the corpus, e.g. tripling the
keyword counters for title text.
Quoted email text marked by gtgt we only want to
index each piece of text once.