Extracting Lexical Features - PowerPoint PPT Presentation

1 / 19
About This Presentation
Title:

Extracting Lexical Features

Description:

... corpora are AIT (AI theses) and email documents ... Example Corpus 1: AIT. AIT, the Artificial Intelligence Thesis about 5000 (most) Ph.D. and Master's ... – PowerPoint PPT presentation

Number of Views:99
Avg rating:3.0/5.0
Slides: 20
Provided by: osirisSun
Category:

less

Transcript and Presenter's Notes

Title: Extracting Lexical Features


1
Extracting Lexical Features
  • Development of software tools for a search engine
  • 1. convert an arbitrary pile of textual objects
    into a well-defined corpus of documents, each
    containing a string of terms to be indexed.
  • 2. invert the index, so rather than seeing all
    the words contained in a particular document, we
    can find all documents containing particular
    keywords.
  • 3. (later chapters) match queries to indices to
    retrieve those which are most similar.

2
Interdocument Parsing
  • The first step is to break the corpus an
    arbitrary pile of text into individually
    retrievable documents.
  • Two text corpora are AIT (AI theses) and email
    documents are abstracts (AIT) or the entire
    message (email).
  • Filters such as DeTex for removing LATEX markup,
    or ltH1gt in HTML.

3
Intradocument Parsing
  • Reading each character of each document, deciding
    whether it is part of a meaningful token, and
    deciding whether these tokens are worth indexing
    is the most computationally intensive aspect of
    indexing must be efficient.
  • Deal with text in situ and not make a second copy
    for use by the indexing and retrieval system, by
    creating a system of pointers to locations within
    the corpus.
  • A lexical analyser tokenises the stream of
    characters into a sequence of word-like elements
    using a finite state machine (e.g. UNIX lex tool
    is a lexical analyser generator, PERL, next
    slide).
  • Fold case treating upper and lower case
    interchangeably saves space.

4
(No Transcript)
5
Stemming
  • Stemming aims to remove surface markings (such as
    number) to reveal a root form
  • Using a tokens root form as an index term can
    give robust retrieval even when the query
    contains the plural CARS while the document
    contains the singular CAR
  • Linguists distinguish inflectional morphology
    (plurals, third person singular, past tense,
    -ing) from derivational morphology (e.g. teach
    (verb), teacher (noun)). Weak vs. strong
    stemming.

6
Plural to singular
  • Most common remove terminal s, but
  • Cant remove last s of ss, e.g. crisis ? crisi,
    chess ? ches.
  • woman / women, leaf / leaves, ferry / ferries,
    fox / foxes, alumnus / alumni.
  • We need a context-sensitive transformational
    grammar which works reliably over groups of words
    (e.g. all words ending in ch). See next page.

7
Example stemming rules
  • (.)SSES ? /1SS PERL-like syntax to say that
    strings ending in SSES should be transformed by
    taking the stem (characters before SSES) and
    adding only the two characters SS.
  • (.)IES ? /1Y
  • A complete stemmer contains many such rules (60
    in Lovins set), and a regime for handling
    conflicts when multiple rules match the same
    token, e.g. longest match, rule order.

8
Pros and Cons of Stemming
  • Reduces the size of the keyword vocabulary,
    allowing compression of the index files of 10
    50.
  • Increases recall a query on FOOTBALL now also
    finds documents on FOOTBALLER(S), FOOTBALLING.
  • Reduces precision stripping away morphological
    features may obscure differences in word
    meanings. For example, GRAVITY has two senses
    (earths pull, seriousness). GRAVITATION can only
    refer to earths pull but if we stem it to
    GRAVITY, it could mean either.

9
Noise words
  • A relatively small number of words account for a
    very significant fraction of all texts bulk.
    Words like IT, AND and TO can be found in
    virtually every sentence.
  • These noise words make very poor index terms.
    Users are unlikely to ask for documents about TO,
    and it is hard to imagine a document about BE.
  • Noise words should be ignored by our lexical
    analyser, e.g. by storing in a negative
    dictionary or stop list.
  • In general, noise words are the most frequent in
    the corpus. But TIME, WAR, HOME, LIFE, WATER and
    WORLD are among the 200 most common words in
    English literature.
  • The same tokens that are thrown away in IR are
    precisely those function words that are most
    important to the syntactic analysis of a
    well-formed sentence, and are indicators of an
    authors individual writing style.

10
Example Corpus 1 AIT
  • AIT, the Artificial Intelligence Thesis about
    5000 (most) Ph.D. and Masters dissertations in
    AI from 1987-1997.
  • structured attributes are ones for which we can
    reason more formally, using database and AI
    techniques (thesis number, author, year,
    university, supervisor, language, degree)
  • Textual fields (IR) the abstract is the primary
    textual element associated with each thesis,
    while the title (also a textual field) will be
    used as its proxy (conveying much of the material
    in the abstract in a highly abbreviated form).
  • Proxies are important document surrogates for the
    documents, e.g. when the users are presented with
    hitlists of retrieved documents.

11
Example Corpus 2 Your Email
  • Email has structured attributes associated with
    it, in its header. These include
  • From
  • To
  • Cc
  • Subject (proxy text)
  • Date
  • Other features we may associate with an email
    message are incoming/outgoing and folder in which
    it was stored.
  • Parallels between the two example corpora are
    that both have well-defined authors, time-stamps,
    and obvious candidates for proxy text.

12
(No Transcript)
13
Basic Algorithm for an IR system
  • We now assume that
  • Prior technology has successfully broken our
    large corpus into a set of documents
  • Within each document we have identified
    individual tokens
  • Noise words have been identified.
  • Then our basic algorithm proceeds as follows

14
Algorithm 2.1
  • For every doc in corpus
  • while (token getNonNoiseToken)
  • token stem(token)
  • save Posting(token, doc) in tree
  • A posting is simply a correspondence between a
    particular word and a particular document,
    representing the occurrence of that word in that
    document.
  • For every token in Tree
  • Accumulate totdoc(token), totfreq(token)
  • Sort postings data in descending order of
    docfreq
  • write token, totdoc, totfreq, Postings.
  • Also store a file of document lengths for
    normalisation purposes.

15
(No Transcript)
16
Refinements to the postings data structure.
  • Once the documents postings have been sorted
    into descending order of frequency, it is likely
    that several of the documents in this list have
    the same frequency, and we can exploit this fact
    to compress their representation
  • Consider various keyword weighting schemes.

17
(No Transcript)
18
Splay Trees
  • Splay trees are an appropriate data structure for
    these keywords and their postings.
  • A splay tree is a self-balancing binary search
    tree with the additional unusual property that
    recently accessed elelments are quick to access
    again.
  • Splaying the tree for a certain element
    rearranges the tree so that the element is placed
    at the root of the tree (Wikipedia).

19
Fine points
  • Posting resolution some query languages allow
    proximity operators which allow users to specify
    how close two keywords must be (e.g. adjacent,
    same sentence, within a k-word window) this
    requires us to retain the exact position of each
    keyword, not just which document its in.
  • Emphasising words in proxy text over those used
    in the rest of the corpus, e.g. tripling the
    keyword counters for title text.
  • Quoted email text marked by gtgt we only want to
    index each piece of text once.
Write a Comment
User Comments (0)
About PowerShow.com