The%20term%20vocabulary%20and%20postings%20lists - PowerPoint PPT Presentation

About This Presentation
Title:

The%20term%20vocabulary%20and%20postings%20lists

Description:

Lecture 2: The term vocabulary and postings lists Related to Chapter 2: http://nlp.stanford.edu/IR-book/pdf/02voc.pdf – PowerPoint PPT presentation

Number of Views:403
Avg rating:3.0/5.0
Slides: 61
Provided by: Christop582
Category:

less

Transcript and Presenter's Notes

Title: The%20term%20vocabulary%20and%20postings%20lists


1
  • Lecture 2
  • The term vocabulary and postings lists
  • Related to Chapter 2
  • http//nlp.stanford.edu/IR-book/pdf/02voc.pdf

2
Recap of the previous lecture
Ch. 1
  • Basic inverted indexes
  • Structure Dictionary and Postings
  • Key step in construction Sorting
  • Boolean query processing
  • Intersection by linear time merging
  • Simple optimizations

3
Recall the basic indexing pipeline
Documents to be indexed.
Friends, Romans, countrymen.
First project
4
Plan for this lecture
  • Elaborate basic indexing
  • Preprocessing to form the term vocabulary
  • Documents
  • Tokenization
  • What terms do we put in the index?
  • Postings
  • Faster merges skip lists
  • Positional postings and phrase queries

5
Parsing a document
Sec. 2.1
  • What format is it in?
  • pdf/word/excel/html?
  • What language is it in?
  • What character set is in use?

Each of these is a classification problem, which
we will study later in the course.
But these tasks are often done heuristically
6
Complications Format/language
Sec. 2.1
  • Documents being indexed can include docs from
    many different languages
  • A single index may have to contain terms of
    several languages.
  • Sometimes a document or its components can
    contain multiple languages/formats
  • French email with a German pdf attachment.
  • Document unit

7
Tokens and Terms
8
Tokenization
  • Given a character sequence and a defined document
    unit, tokenization is the task of chopping it up
    into pieces, called tokens, perhaps at the same
    time throwing away certain characters, such as
    punctuation.

9
Tokenization
Sec. 2.2.1
  • Input university of Qom, computer department
  • Output Tokens
  • university
  • of
  • Qom
  • computer
  • department
  • A token is a sequence of characters in a document
  • Each such token is now a candidate for an index
    entry, after further processing
  • Described below
  • But what are valid tokens to emit?

10
Issues in tokenization
Sec. 2.2.1
  • Irans capital ? Iran? Irans? Irans?
  • Hyphen
  • Hewlett-Packard ? Hewlett and Packard as two
    tokens?
  • the hold-him-back-and-drag-him-away maneuver
    break up hyphenated sequence.
  • co-education
  • lowercase, lower-case, lower case ?
  • Space
  • San Francisco How do you decide it is one token?

11
Issues in tokenization
Sec. 2.2.1
  • Numbers
  • Older IR systems may not index numbers
  • But often very useful think about things like
    looking up error codes/stack traces on the web
  • (One answer is using n-grams Lecture 3)
  • Will often index meta-data separately
  • Creation date, format, etc.
  • 3/12/91 Mar. 12, 1991 12/3/91
  • 55 B.C.
  • B-52
  • My PGP key is 324a3df234cb23e
  • (800) 234-2333

12
Language issues in tokenization
Sec. 2.2.1
  • French
  • L'ensemble ? one token or two?
  • L ? L ? Le ?
  • Want lensemble to match with un ensemble
  • Until at least 2003, it didnt on Google
  • German noun compounds are not segmented
  • Lebensversicherungsgesellschaftsangestellter
  • life insurance company employee
  • German retrieval systems benefit greatly from a
    compound splitter module
  • Can give a 15 performance boost for German

13
Language issues in tokenization
Sec. 2.2.1
  • Chinese and Japanese have no spaces between
    words
  • ????????????????????
  • Not always guaranteed a unique tokenization
  • Further complicated in Japanese, with multiple
    alphabets intermingled

??????500?????????????500K(?6,000??)
14
Language issues in tokenization
Sec. 2.2.1
  • Arabic (or Hebrew) is basically written right to
    left, but with certain items like numbers written
    left to right
  • With modern Unicode representation concepts, the
    order of characters in files matches the
    conceptual order, and the reversal of displayed
    characters is handled by the rendering system,
    but this may not be true for documents in older
    encodings.
  • Other complexities that you know!

15
Stop words
Sec. 2.2.2
  • With a stop list, you exclude from the
    dictionary entirely the commonest words.
  • Intuition They have little semantic content
    the, a, and, to, be
  • Using a stop list significantly reduces the
    number of postings that a system has to store,
    because there are a lot of them.

16
Stop words
  • You need them for
  • Phrase queries President of Iran
  • Various song titles, etc. Let it be, To be or
    not to be
  • Relational queries flights to London
  • The general trend in IR systems from large stop
    lists (200300 terms) to very small stop lists
    (712 terms) to no stop list.
  • Good compression techniques (lecture 5) means the
    space for including stop words in a system is
    very small
  • Good query optimization techniques (lecture 7)
    mean you pay little at query time for including
    stop words.

17
Normalization to terms
Sec. 2.2.3
  • We want to match I.R. and IR
  • Token normalization is the process of
    canonicalizing tokens so that matches occur
    despite superficial differences in the character
    sequences of the tokens
  • Result is terms a term is a (normalized) word
    type, which is an entry in our IR system
    dictionary

18
Normalization to terms
  • One way is using Equivalence Classes
  • Searches for one term will retrieve documents
    that contain each of these members.
  • We most commonly implicitly define equivalence
    classes of terms rather than being fully
    calculated in advance (hand constructed), e.g.,
  • deleting periods to form a term
  • U.S.A., USA
  • deleting hyphens to form a term
  • anti-discriminatory, antidiscriminatory

19
Normalization other languages
Sec. 2.2.3
  • Accents e.g., French résumé vs. resume.
  • Umlauts e.g., German Tuebingen vs. Tübingen
  • Normalization of things like date forms
  • 7?30? vs. 7/30
  • Tokenization and normalization may depend on the
    language and so is intertwined with language
    detection
  • Crucial Need to normalize indexed text as well
    as query terms into the same form

20
Case folding
Sec. 2.2.3
  • Reduce all letters to lower case
  • exception upper case in mid-sentence?
  • Often best to lower case everything, since users
    will use lowercase regardless of correct
    capitalization

21
Normalization to terms
Sec. 2.2.3
  • What is the disadvantage of equivalence classing?
  • An alternative to equivalence classing is to do
    asymmetric expansion (hand constructed)
  • An example of where this may be useful
  • Enter window Search window, windows
  • Enter windows Search Windows, windows, window
  • Enter Windows Search Windows
  • Potentially more powerful, but less efficient

22
Thesauri and soundex
  • Do we handle synonyms and homonyms?
  • E.g., by hand-constructed equivalence classes
  • car automobile color colour
  • What about spelling mistakes?
  • One approach is soundex, which forms equivalence
    classes of words based on phonetic heuristics
  • More in lectures 3 and 9

23
Review
  • IR systems
  • Indexing
  • Searching
  • Indexing
  • Parsing document
  • Tokenization -gt tokens
  • Normalization -gt terms
  • Indexing -gt index

24
Review
  • Normalization consider tokens rather than query.
  • Examples
  • Case
  • Hyphen
  • Period
  • Synonyms
  • Spelling mistakes

25
Review
  • Two methods for normalization
  • Equivalence classing often implicit.
  • Asymmetric expansion
  • Query time
  • a query expansion dictionary
  • more processing at query time
  • Indexing time
  • more space for storing postings.
  • Asymmetric expansion is considerably less
    efficient than equivalence classing but more
    flexible.

26
Stemming and lemmatization
  • Documents are going to use different forms of a
    word, such as organize, organizes, and
    organizing.
  • Additionally, there are families of
    derivationally related words with similar
    meanings, such as democracy, democratic, and
    democratization.
  • Reduce terms to their roots before indexing.
  • E.g.,
  • am, are, is ? be
  • car, cars, car's, cars' ? car
  • the boy's cars are different colors ? the boy car
    be different color

27
Stemming
Sec. 2.2.4
  • Stemming suggest crude affix chopping
  • language dependent
  • Example
  • Porters algorithm
  • http//www.tartarus.org/martin/PorterStemmer
  • Lovins stemmer
  • http//www.comp.lancs.ac.uk/computing/research/ste
    mming/general/lovins.htm

28
Porters algorithm
Sec. 2.2.4
  • Commonest algorithm for stemming English
  • Results suggest its at least as good as other
    stemming options
  • Conventions 5 phases of reductions
  • phases applied sequentially
  • each phase consists of a set of commands
  • sample convention Of the rules in a compound
    command, select the one that applies to the
    longest suffix.

29
Typical rules in Porter
Sec. 2.2.4
  • sses ? ss presses ?
    press
  • ies ? I bodies ? bodi
  • ss ? ss press ? press
  • s ? cats ? cat
  • Many other rules are sensitive to the measure of
    words
  • (mgt1) EMENT ?
  • replacement ? replac
  • cement ? cement

30
Lemmatization
Sec. 2.2.4
  • Reduce inflectional/variant forms to base form
    (lemma) properly with the use of a vocabulary and
    morphological analysis of words
  • Lemmatizer a tool from Natural Language
    Processing which does full morphological analysis
    to accurately identify the lemma for each word.

31
Stemming vs. Lemmatization
  • saw
  • stemming might return just s,
  • Lemmatization would attempt to return either
  • see the use of the token was as a verb
  • saw the use of the token was as a noun

32
Helpfulness of normalization
  • Do stemming and other normalizations help?
  • Definitely useful for Spanish, German, Finnish,
  • 30 performance gains for Finnish!
  • What about English?

33
Helpfulness of normalization
  • English
  • Not so considerable help!
  • Helps a lot for some queries, hurts performance a
    lot for others.
  • Stemming helps recall but harms precision
  • operative (dentistry) ? oper
  • operational (research) ? oper
  • operating (systems) ? oper
  • For a case like this, moving to using a
    lemmatizer would not completely fix the problem

34
Project 2
  • Find rules for normalizing Farsi documents and
    implement.

35
Exercise
  • Are the following statements true or false? Why?
  • a. In a Boolean retrieval system, stemming never
    lowers precision.
  • b. In a Boolean retrieval system, stemming never
    lowers recall.
  • c. Stemming increases the size of the vocabulary.
  • d. Stemming should be invoked at indexing time
    but not while processing a query

36
Language-specificity
Sec. 2.2.4
  • Many of the above features embody transformations
    that are
  • Language-specific and
  • Often, application-specific
  • These are plug-in addenda to the indexing
    process
  • Both open source and commercial plug-ins are
    available for handling these

37
Faster postings mergesSkip lists
38
Recall basic merge
Sec. 2.3
  • Walk through the two postings simultaneously, in
    time linear in the total number of postings
    entries

128
2
4
8
41
48
64
Brutus
2
8
31
1
2
3
8
11
17
21
Caesar
If the list lengths are m and n, the merge takes
O(mn) operations.
Can we do better? Yes
39
Augment postings with skip pointers (at indexing
time)
Sec. 2.3
128
41
31
11
31
  • Why?
  • To skip postings that will not figure in the
    search results.
  • How?
  • Where do we place skip pointers?
  • The resulted list is skip list.

40
Query processing with skip pointers
Sec. 2.3
128
41
128
31
11
31
Suppose weve stepped through the lists until we
process 8 on each list. We match it and advance.
We then have 41 and 11 on the lower. 11 is
smaller.
41
Where do we place skips?
Sec. 2.3
  • Tradeoff
  • More skips ? shorter skip spans ? more likely to
    skip. But lots of comparisons to skip pointers.
  • Fewer skips ? few pointer comparison, but then
    long skip spans ? few successful skips.

42
Placing skips
Sec. 2.3
  • Simple heuristic for postings of length L, use
    ?L evenly-spaced skip pointers.
  • This ignores the distribution of query terms.
  • Easy if the index is relatively static harder if
    L keeps changing because of updates.
  • This definitely used to help with modern
    hardware it may not (Bahle et al. 2002)
  • The I/O cost of loading a bigger postings list
    can outweigh the gains from quicker in memory
    merging!

D. Bahle, H. Williams, and J. Zobel. Efficient
phrase querying with an auxiliary index. SIGIR
2002, pp. 215-221.
43
Exercise
  • Do exercises 2.5 and 2.6 of your book.

44
Phrase queries and positional indexes
45
Phrase queries
Sec. 2.4
  • Want to be able to answer queries such as
    stanford university as a phrase
  • Thus the sentence I went to university at
    Stanford is not a match.
  • Most recent search engines support a double
    quotes syntax

46
Phrase queries
  • PHRASE QUERIES has proven to be very easily
    understood and successfully used by users.
  • As many as 10 of web queries are phrase queries,
  • Many more queries are implicit phrase queries
  • For this, it no longer suffices to store only
  • ltterm docsgt entries
  • Solutions?

47
A first attempt Biword indexes
Sec. 2.4.1
  • Index every consecutive pair of terms in the text
    as a phrase
  • For example the text Qom computer department
    would generate the biwords
  • Qom computer
  • computer department
  • Each of these biwords is now a dictionary term
  • Two-word phrase query-processing is now immediate.

48
Longer phrase queries
Sec. 2.4.1
  • The query modern information retrieval course
    can be broken into the Boolean query on biwords
  • modern information AND information retrieval AND
    retrieval course
  • Work fairly well in practice,
  • But there can and will be occasional false
    positives.

49
Extended biwords
  • Now consider phrases such as student of the
    computer
  • Perform part-of-speech-tagging (POST).
  • POST classify words as nouns, verbs, etc.
  • Group the terms into (say) Nouns (N) and
    articles/prepositions (X).

50
Extended biwords
Sec. 2.4.1
  • Call any string of terms of the form NXXN an
    extended biword.
  • Each such extended biword is made a term in the
    vocabulary
  • Segment query into enhanced biwords

51
Issues for biword indexes
Sec. 2.4.1
  • False positives, as noted before
  • Index blowup due to bigger dictionary
  • Infeasible for more than biwords, big even for
    them
  • Biword indexes are not the standard solution (for
    all biwords) but can be part of a compound
    strategy

52
Solution 2 Positional indexes
Sec. 2.4.2
  • In the postings, store for each term the
    position(s) in which tokens of it appear
  • ltterm, number of docs containing term
  • doc1 position1, position2
  • doc2 position1, position2
  • etc.gt

53
Positional index example
Sec. 2.4.2
ltbe 993427 1 7, 18, 33, 72, 86, 231 2 3,
149 4 17, 191, 291, 430, 434 5 363, 367, gt
Which of docs 1,2,4,5 could contain to be or not
to be?
  • For phrase queries, we need to deal with more
    than just equality

54
Processing a phrase query
Sec. 2.4.2
  • Extract inverted index entries for each distinct
    term to, be, or, not.
  • Merge their docposition lists
  • to
  • 21,17,74,222,551 48,16,190,429,433
    713,23,191 ...
  • be
  • 117,19 417,191,291,430,434 514,19,101 ...
  • Same general method for proximity searches

55
Proximity queries
Sec. 2.4.2
  • LIMIT /3 STATUTE /3 FEDERAL /2 TORT
  • /k means within k words of (on either side).
  • Clearly, positional indexes can be used for such
    queries biword indexes cannot.
  • Figure 2.12 The merge of postings to handle
    proximity queries.
  • This is a little tricky to do correctly and
    efficiently

56
Positional index size
Sec. 2.4.2
  • Need an entry for each occurrence, not just once
    per document
  • Index size depends on average document size
  • Average web page has lt1000 terms
  • Books, even some epic poems easily 100,000
    terms
  • Consider a term with frequency 0.1

Why?
57
Rules of thumb
Sec. 2.4.2
  • A positional index is 24 as large as a
    non-positional index
  • Positional index size 3550 of volume of
    original text
  • Caveat all of this holds for English-like
    languages

58
Positional index size
Sec. 2.4.2
  • You can compress position values/offsets well
    talk about that in lecture 5
  • Nevertheless, a positional index expands postings
    storage substantially
  • Nevertheless, a positional index is now
    standardly used because of the power and
    usefulness of phrase and proximity queries
    whether used explicitly or implicitly in a
    ranking retrieval system.

59
Combination schemes
Sec. 2.4.3
  • These two approaches can be profitably combined
  • For particular phrases (Hossein Rezazadeh) it
    is inefficient to keep on merging positional
    postings lists

60
Combination schemes
  • Williams et al. (2004) evaluate a more
    sophisticated mixed indexing scheme
  • A typical web query mixture was executed in ¼ of
    the time of using just a positional index
  • It required 26 more space than having a
    positional index alone
  • H.E. Williams, J. Zobel, and D. Bahle. 2004.
    Fast Phrase Querying with Combined Indexes, ACM
    Transactions on Information Systems.

Arbitrary Presentation
Write a Comment
User Comments (0)
About PowerShow.com