Possible Project Topic II: Information Retrieval and Web Search

1 / 45
About This Presentation
Title:

Possible Project Topic II: Information Retrieval and Web Search

Description:

Finalize Proposal Presentation Schedule on ... Send me email about your preferences about individual review and final ... .org/lemur/IndriQueryLanguage.php ... –

Number of Views:58
Avg rating:3.0/5.0
Slides: 46
Provided by: hen4
Category:

less

Transcript and Presenter's Notes

Title: Possible Project Topic II: Information Retrieval and Web Search


1
Possible Project Topic IIInformation Retrieval
and Web Search
  • Heng Ji
  • hengji_at_cs.qc.cuny.edu
  • Sept 16, 2008

Acknowledgement some slides from Jimmy Lin,
Victor Lavrenko
2
Announcement
  • Finalize Proposal Presentation Schedule on the
    course webpage
  • A proposal presentation can include
  • Motivation
  • Input/Output
  • System Architecture
  • Time Table
  • Implementation Details (optional)
  • Preliminary Demo (optional)
  • Innovative Claims (optional)
  • Send me email about your preferences about
    individual review and final presentation dates

3
Outline
  • Introduction
  • IR Approaches and Ranking
  • Query Construction
  • Document Indexing
  • Web Search

4
What is Information Retrieval?
  • Most people equate IR with web-search
  • highly visible, commercially successful endeavors
  • leverage 3 decades of academic research
  • IR finding any kind of relevant information
  • web-pages, news events, answers, images,
  • relevance is a key notion

5
IR System
IR System
5
6
What types of information?
  • Text (Documents and portions thereof)
  • XML and structured documents
  • Images
  • Audio (sound effects, songs, etc.)
  • Video
  • Source code
  • Applications/Web services

7
Interesting Examples
  • Google image search
  • Google video search
  • NYU Prof. Sekines Ngram search
  • http//linserv1.cims.nyu.edu23232/ngram/
  • INDRI Demo Show
  • http//www.lemurproject.org/indri/

http//images.google.com/
http//video.google.com/
8
What about databases?
  • What are examples of databases?
  • Banks storing account information
  • Retailers storing inventories
  • Universities storing student grades
  • What exactly is a (relational) database?
  • Think of them as a collection of tables
  • They model some aspect of the world

9
A (Simple) Database Example
Student Table
Department Table
Course Table
Enrollment Table
10
Database Queries
  • What would you want to know from a database?
  • What classes is John Arrow enrolled in?
  • Who has the highest grade in LBSC 690?
  • Whos in the history department?
  • Of all the non-CLIS students taking LBSC 690 with
    a last name shorter than six characters and were
    born on a Monday, who has the longest email
    address?

11
Comparing IR to databases
12
The IR Black Box
Documents
Query
Hits
13
Inside The IR Black Box
Documents
Query
Representation Function
Representation Function
Query Representation
Document Representation
Index
Comparison Function
Hits
14
Building the IR Black Box
  • Different models of information retrieval
  • Boolean model
  • Vector space model
  • Languages models
  • Representing the meaning of documents
  • How do we capture the meaning of documents?
  • Is meaning just the sum of all terms?
  • Indexing
  • How do we actually store all those words?
  • How do we access indexed terms quickly?

15
Outline
  • Introduction
  • IR Approaches and Ranking
  • Query Construction
  • Document Indexing
  • Web Search

16
The Central Problem in IR
Information Seeker
Authors
Concepts
Concepts
Query Terms
Document Terms
Do these represent the same concepts?
17
Relevance
  • Relevance is a subjective judgment and may
    include
  • Being on the proper subject.
  • Being timely (recent information).
  • Being authoritative (from a trusted source).
  • Satisfying the goals of the user and his/her
    intended use of the information (information
    need).

17
18
IR Ranking
  • Early IR focused on set-based retrieval
  • Boolean queries, set of conditions to be
    satisfied
  • document either matches the query or not
  • like classifying the collection into relevant /
    non-relevant sets
  • still used by professional searchers
  • advanced search in many systems
  • Modern IR ranked retrieval
  • free-form query expresses users information need
  • rank documents by decreasing likelihood of
    relevance
  • many studies prove it is superior

19
A heuristic formula for IR
  • Rank docs by similarity to the query
  • suppose the query is cryogenic labs
  • Similarity query words in the doc
  • favors documents with both labs and cryogenic
  • mathematically
  • Logical variations (set-based)
  • Boolean AND (require all words)
  • Boolean OR (any of the words)

20
Term Frequency (TF)
  • Observation
  • key words tend to be repeated in a document
  • Modify our similarity measure
  • give more weight if word occurs multiple times
  • Problem
  • biased towards long documents
  • spurious occurrences
  • normalize by length

21
Inverse Document Frequency (IDF)
  • Observation
  • rare words carry more meaning cryogenic, apollo
  • frequent words are linguistic glue of, the,
    said, went
  • Modify our similarity measure
  • give more weight to rare words but dont be
    too aggressive (why?)
  • C total number of documents
  • df(q) total number of documents that contain q

22
TF normalization
  • Observation
  • D1cryogenic,labs, D2 cryogenic,cryogenic
  • which document is more relevant?
  • which one is ranked higher? (df(labs) gt
    df(cryogenic))
  • Correction
  • first occurrence more important than a repeat
    (why?)
  • squash the linearity of TF

23
State-of-the-art Formula
24
Vector-space approach to IR
cat
  • cat cat
  • cat pig

pig
  • pig cat

dog
25
Language-modeling Approach
  • query is a random sample from a perfect
    document
  • words are sampled independently of each other
  • rank documents by the probability of generating
    query

D
query
4/9 2/9 4/9 3/9
26
PageRank in Google
27
PageRank in Google (Cont)
I1
A
B
I2
  • Assign a numeric value to each page
  • The more a page is referred to by important
    pages, the more this page is important
  • d damping factor (0.85)
  • Many other criteria e.g. proximity of query
    words
  • information retrieval better than
    information retrieval

28
Outline
  • Introduction
  • IR Approaches and Ranking
  • Query Construction
  • Document Indexing
  • Web Search

29
Keyword Search
  • Simplest notion of relevance is that the query
    string appears verbatim in the document.
  • Slightly less strict notion is that the words in
    the query appear frequently in the document, in
    any order (bag of words).

29
30
Problems with Keywords
  • May not retrieve relevant documents that include
    synonymous terms.
  • restaurant vs. café
  • PRC vs. China
  • May retrieve irrelevant documents that include
    ambiguous terms.
  • bat (baseball vs. mammal)
  • Apple (company vs. fruit)
  • bit (unit of data vs. act of eating)

30
31
Query Expansion
  • http//www.lemurproject.org/lemur/IndriQueryLangua
    ge.php
  • Most errors caused by vocabulary mismatch
  • query cars, document automobiles
  • solution automatically add highly-related words
  • Thesaurus / WordNet lookup
  • add semantically-related words (synonyms)
  • cannot take context into account
  • rail car vs. race car vs. car and cdr
  • Statistical Expansion
  • add statistically-related words (co-occurrence)
  • very successful

32
IR Query Examples
  • http//nlp.cs.qc.cuny.edu/ir.zip
  • Query
  • ltparametersgtltquerygtcombine( weight( 0.063356
    1(explosion) 0.187417 1(blast) 0.411817
    1(wounded) 0.101370 1(injured) 0.161191
    1(death) 0.074849 1(deaths)) weight( 0.311760
    1(Davao Cityinternational airport) 0.311760
    1(Tuesday) 0.103044 1(DAVAO) 0.195505
    1(Philippines) 0.019817 1(DXDC) 0.058113
    1(Davao Medical Center)))lt/querygtlt/parametersgt

33
Outline
  • Introduction
  • IR Approaches and Ranking
  • Query Construction
  • Document Indexing
  • Web Search

34
Document indexing
  • Goal Find the important meanings and create an
    internal representation
  • Factors to consider
  • Accuracy to represent meanings (semantics)
  • Exhaustiveness (cover all the contents)
  • Facility for computer to manipulate
  • What is the best representation of contents?
  • Char. string (char trigrams) not precise enough
  • Word good coverage, not precise
  • Phrase poor coverage, more precise
  • Concept poor coverage, precise

Accuracy (Precision)
Coverage (Recall)
String Word Phrase Concept
35
Indexer steps
  • Sequence of (Modified token, Document ID) pairs.

Doc 1
Doc 2
I did enact Julius Caesar I was killed i' the
Capitol Brutus killed me.
So let it be with Caesar. The noble Brutus hath
told you Caesar was ambitious
36
  • Multiple term entries in a single document are
    merged.
  • Frequency information is added.

37
Stopwords / Stoplist
  • function words do not bear useful information for
    IR
  • of, in, about, with, I, although,
  • Stoplist contain stopwords, not to be used as
    index
  • Prepositions
  • Articles
  • Pronouns
  • Some adverbs and adjectives
  • Some frequent words (e.g. document)
  • The removal of stopwords usually improves IR
    effectiveness
  • A few standard stoplists are commonly used.

38
Stemming
  • Reason
  • Different word forms may bear similar meaning
    (e.g. search, searching) create a standard
    representation for them
  • Stemming
  • Removing some endings of word
  • computer
  • compute
  • computes
  • computing
  • computed
  • computation

comput
39
Lemmatization
  • transform to standard form according to syntactic
    category.
  • E.g. verb ing ? verb
  • noun s ? noun
  • Need POS tagging
  • More accurate than stemming, but needs more
    resources
  • crucial to choose stemming/lemmatization rules
  • noise v.s. recognition rate
  • compromise between precision and recall
  • light/no stemming severe stemming
  • -recall precision recall -precision

40
Outline
  • Introduction
  • IR Approaches and Ranking
  • Query Construction
  • Document Indexing
  • Web Search

41
IR on the Web
  • No stable document collection (spider, crawler)
  • Invalid document, duplication, etc.
  • Huge number of documents (partial collection)
  • Multimedia documents
  • Great variation of document quality
  • Multilingual problem

42
Web Search
  • Application of IR to HTML documents on the World
    Wide Web.
  • Differences
  • Must assemble document corpus by spidering the
    web.
  • Can exploit the structural layout information in
    HTML (XML).
  • Documents change uncontrollably.
  • Can exploit the link structure of the web.

42
43
Web Search System
IR System
43
44
  • Technical Backup

44
45
Some formulas for Sim
  • Dot product
  • Cosine
  • Dice
  • Jaccard

t1
D
Q
t2
Write a Comment
User Comments (0)
About PowerShow.com