FullText Indexing - PowerPoint PPT Presentation

About This Presentation
Title:

FullText Indexing

Description:

Text fields can hold a bit over 10,000 words. Create a FULLTEXT index ... Q1 Visceral Need. Q2 Conscious Need. Q3 Formalized Need. Q4 Compromised Need (Query) ... – PowerPoint PPT presentation

Number of Views:47
Avg rating:3.0/5.0
Slides: 39
Provided by: umiac7
Category:

less

Transcript and Presenter's Notes

Title: FullText Indexing


1
Full-Text Indexing
  • Session 10
  • INFM 718N
  • Web-Enabled Databases

2
Agenda
  • How to do it
  • How it works
  • The A Team

3
  • Relational normalization
  • Structured programming
  • Software patterns
  • Object-oriented design
  • Functional decomposition

Client Hardware
Web Browser
Database
Server Hardware
4
Full-Text Indexing in MySQL
  • Create a MyISAM table (not InnoDB!)
  • Include a CHAR, VARCHAR, or TEXT field
  • Text fields can hold a bit over 10,000 words
  • Create a FULLTEXT index
  • ALTER TABLE x ADD FULLTEXT INDEX y
  • Issue a (ranked) query
  • SELECT y FROM x WHERE MATCH y AGAINST (cat)

5
Other Types of Queries
  • Automatic (ranked) vocabulary expansion
  • SELECT y FROM x WHERE MATCH y AGAINST (cat WITH
    QUERY EXPANSION)
  • Boolean (unranked) search
  • SELECT y FROM x WHERE MATCH y AGAINST (cat
    -dog IN BOOLEAN MODE)

6
Query Details
  • No more than 254 characters (40 words)
  • Longer queries take more time
  • Multiple words are implicitly joined by OR
  • Boolean queries can use (unnested) operators
  • Words preceded by must occur (AND)
  • Words preceded by - must not occur (AND NOT)

7
Whats a Word?
  • Delimited by white space or -
  • White-space includes space, tab, newline,
  • Not case sensitive
  • Exact string match
  • No stemming (automatic truncation)
  • Boolean search has additional options
  • Truncation (e.g., time)
  • Phrases (e.g., cats and dogs)

8
Unsearchable Words
  • Very common words
  • Those that appear in more than 50 of docs
  • Words of 3 or fewer characters
  • Rarely are topically specific
  • Other stopwords
  • able about above according accordingly across
    actually after afterwards again against ain't

9
Human-Machine Synergy
  • Machines are good at
  • Doing simple things accurately and quickly
  • Scaling to larger collections in sublinear time
  • People are better at
  • Accurately recognizing what they are looking for
  • Evaluating intangibles such as quality
  • Both are pretty bad at
  • Mapping consistently between words and concepts

10
Supporting the Search Process
Source Selection
Choose
11
Supporting the Search Process
Source Selection
12
Taylors Model of Question Formation
Q1 Visceral Need
Q2 Conscious Need
Intermediated Search
End-user Search
Q3 Formalized Need
Q4 Compromised Need (Query)
13
Search Goal
  • Choose the same documents a human would
  • Without human intervention (less work)
  • Faster than a human could (less time)
  • As accurately as possible (less accuracy)
  • Humans start with an information need
  • Machines start with a query
  • Humans match documents to information needs
  • Machines match document query representations

14
Search Component Model
Utility
Human Judgment
Information Need
Document
Query Formulation
Query
Document Processing
Query Processing
Representation Function
Representation Function
Query Representation
Document Representation
Comparison Function
Retrieval Status Value
15
Relevance
  • Relevance relates a topic and a document
  • Duplicates are equally relevant, by definition
  • Constant over time and across users
  • Pertinence relates a task and a document
  • Accounts for quality, complexity, language,
  • Utility relates a user and a document
  • Accounts for prior knowledge
  • We seek utility, but relevance is what we get!

16
Problems With Word Matching
  • Word matching suffers from two problems
  • Synonymy paper vs. article
  • Homonymy bank (river) vs. bank (financial)
  • Disambiguation in IR seek to resolve homonymy
  • Index word senses rather than words
  • Synonymy usually addressed by
  • Thesaurus-based query expansion
  • Latent semantic indexing

17
Bag of Terms Representation
  • Bag a set that can contain duplicates
  • The quick brown fox jumped over the lazy dogs
    back ?
  • back, brown, dog, fox, jump, lazy, over,
    quick, the, the
  • Vector values recorded in any consistent order
  • back, brown, dog, fox, jump, lazy, over, quick,
    the, the ?
  • 1 1 1 1 1 1 1 1 2

18
Bag of Terms Example
Document 1
Stopword List
Term
Document 1
Document 2
The quick brown fox jumped over the lazy dogs
back.
for
aid
0
1
is
all
0
1
back
of
1
0
the
brown
1
0
to
come
0
1
dog
1
0
fox
1
0
Document 2
good
0
1
jump
1
0
lazy
1
0
Now is the time for all good men to come to the
aid of their party.
men
0
1
now
0
1
over
1
0
party
0
1
quick
1
0
their
0
1
time
0
1
19
Boolean IR
  • Strong points
  • Accurate, if you know the right strategies
  • Efficient for the computer
  • Weaknesses
  • Often results in too many documents, or none
  • Users must learn Boolean logic
  • Sometimes finds relationships that dont exist
  • Words can have many meanings
  • Choosing the right words is sometimes hard

20
Proximity Operators
  • More precise versions of AND
  • NEAR n allows at most n-1 intervening terms
  • WITH requires terms to be adjacent and in order
  • Easy to implement, but less efficient
  • Store a list of positions for each word in each
    doc
  • Stopwords become very important!
  • Perform normal Boolean computations
  • Treat WITH and NEAR like AND with an extra
    constraint

21
Proximity Operator Example
Term
Doc 1
Doc 2
  • time AND come
  • Doc 2
  • time (NEAR 2) come
  • Empty
  • quick (NEAR 2) fox
  • Doc 1
  • quick WITH fox
  • Empty

aid
1 (13)
0
all
1 (6)
0
back
0
1 (10)
brown
0
1 (3)
come
0
1 (9)
dog
0
1 (9)
fox
0
1 (4)
good
1 (7)
0
jump
0
1 (5)
lazy
0
1 (8)
men
1 (8)
0
now
1 (1)
0
over
0
1 (6)
party
1 (16)
0
quick
1 (2)
0
their
1 (15)
0
time
1 (4)
0
22
Advantages of Ranked Retrieval
  • Closer to the way people think
  • Some documents are better than others
  • Enriches browsing behavior
  • Decide how far down the list to go as you read it
  • Allows more flexible queries
  • Long and short queries can produce useful results

23
Ranked Retrieval Challenges
  • Best first is easy to say but hard to do!
  • The best we can hope for is to approximate it
  • Will the user understand the process?
  • It is hard to use a tool that you dont
    understand
  • Efficiency becomes a concern
  • Only a problem for long queries, though

24
Similarity-Based Queries
  • Treat the query as if it were a document
  • Create a query bag-of-words
  • Find the similarity of each document
  • Using the coordination measure, for example
  • Rank order the documents by similarity
  • Most similar to the query first
  • Surprisingly, this works pretty well!
  • Especially for very short queries

25
Counting Terms
  • Terms tell us about documents
  • If rabbit appears a lot, it may be about
    rabbits
  • Documents tell us about terms
  • the is in every document -- not discriminating
  • Documents are most likely described well by rare
    terms that occur in them frequently
  • Higher term frequency is stronger evidence
  • Low collection frequency makes it stronger still

26
The Document Length Effect
  • Humans look for documents with useful parts
  • But probabilities are computed for the whole
  • Document lengths vary in many collections
  • So probability calculations could be inconsistent
  • Two strategies
  • Adjust probability estimates for document length
  • Divide the documents into equal passages

27
Incorporating Term Frequency
  • High term frequency is evidence of meaning
  • And high IDF is evidence of term importance
  • Recompute the bag-of-words
  • Compute TF IDF for every element

28
TFIDF Example
1
2
3
4
1
2
3
4
Unweighted query contaminated
retrieval Result 2, 3, 1, 4
5
2
1.51
0.60
complicated
0.301
4
1
3
0.50
0.13
0.38
contaminated
0.125
5
4
3
0.63
0.50
0.38
fallout
0.125
Weighted query contaminated(3)
retrieval(1) Result 1, 3, 2, 4
6
3
3
2
information
0.000
1
0.60
interesting
0.602
3
7
0.90
2.11
nuclear
0.301
IDF-weighted query contaminated
retrieval Result 2, 3, 1, 4
6
1
4
0.75
0.13
0.50
retrieval
0.125
2
1.20
siberia
0.602
29
Document Length Normalization
  • Long documents have an unfair advantage
  • They use a lot of terms
  • So they get more matches than short documents
  • And they use the same words repeatedly
  • So they have much higher term frequencies
  • Normalization seeks to remove these effects
  • Related somehow to maximum term frequency
  • But also sensitive to the of number of terms

30
Okapi Term Weights
TF component
IDF component
31
MySQL Term Weights
  • local weight
  • (log(tf)1)/sumtf U/(10.0115U)
  • global weight log((N-nf)/nf)
  • query weight local weight global weight qf
  • tf      How many times the term appears in
    the rowsumtf The sum of "(log(tf)1)" for all
    terms in the same rowU        How many unique
    terms are in the rowN        How many rows are
    in the tablenf       How many rows contain the
    termqf       How many times the term appears in
    the query

32
Summary
  • Goal find documents most similar to the query
  • Compute normalized document term weights
  • Some combination of TF, DF, and Length
  • Optionally, get query term weights from the user
  • Estimate of term importance
  • Compute inner product of query and doc vectors
  • Multiply corresponding elements and then add

33
The Indexing Process
Postings
Term
Doc 1
Doc 2
Doc 3
Doc 4
Doc 5
Doc 6
Doc 7
Doc 8
Inverted File
aid
0
0
0
1
0
0
0
1
AI
4, 8
A
all
0
1
0
1
0
1
0
0
AL
2, 4, 6
back
1
0
1
0
0
0
1
0
BA
1, 3, 7
B
brown
1
0
1
0
1
0
1
0
BR
1, 3, 5, 7
come
0
1
0
1
0
1
0
1
C
2, 4, 6, 8
dog
0
0
1
0
1
0
0
0
D
3, 5
fox
0
0
1
0
1
0
1
0
F
3, 5, 7
good
0
1
0
1
0
1
0
1
G
2, 4, 6, 8
jump
0
0
1
0
0
0
0
0
J
3
lazy
1
0
1
0
1
0
1
0
L
1, 3, 5, 7
men
0
1
0
1
0
0
0
1
M
2, 4, 8
now
0
1
0
0
0
1
0
1
N
2, 6, 8
over
1
0
1
0
1
0
1
1
O
1, 3, 5, 7, 8
party
0
0
0
0
0
1
0
1
P
6, 8
quick
1
0
1
0
0
0
0
0
Q
1, 3
their
1
0
0
0
1
0
1
0
TH
1, 5, 7
T
time
0
1
0
1
0
1
0
0
TI
2, 4, 6
34
The Finished Product
Term
Postings
Inverted File
aid
AI
4, 8
A
all
AL
2, 4, 6
back
BA
1, 3, 7
B
brown
BR
1, 3, 5, 7
come
C
2, 4, 6, 8
dog
D
3, 5
fox
F
3, 5, 7
good
G
2, 4, 6, 8
jump
J
3
lazy
L
1, 3, 5, 7
men
M
2, 4, 8
now
N
2, 6, 8
over
O
1, 3, 5, 7, 8
party
P
6, 8
quick
Q
1, 3
their
TH
1, 5, 7
T
time
TI
2, 4, 6
35
How Big Is the Postings File?
  • Very compact for Boolean retrieval
  • About 10 of the size of the documents
  • If an aggressive stopword list is used!
  • Not much larger for ranked retrieval
  • Perhaps 20
  • Enormous for proximity operators
  • Sometimes larger than the documents!

36
Building an Inverted Index
  • Simplest solution is a single sorted array
  • Fast lookup using binary search
  • But sorting large files on disk is very slow
  • And adding one document means starting over
  • Tree structures allow easy insertion
  • But the worst case lookup time is linear
  • Balanced trees provide the best of both
  • Fast lookup and easy insertion
  • But they require 45 more disk space

37
How Big is the Inverted Index?
  • Typically smaller than the postings file
  • Depends on number of terms, not documents
  • Eventually, most terms will already be indexed
  • But the postings file will continue to grow
  • Postings dominate asymptotic space complexity
  • Linear in the number of documents

38
Summary
  • Slow indexing yields fast query processing
  • Key fact most terms dont appear in most
    documents
  • We use extra disk space to save query time
  • Index space is in addition to document space
  • Time and space complexity must be balanced
  • Disk block reads are the critical resource
  • This makes index compression a big win
Write a Comment
User Comments (0)
About PowerShow.com