CS4323 INFORMATION RETRIEVAL - PowerPoint PPT Presentation

1 / 39

About This Presentation

Title:

CS4323 INFORMATION RETRIEVAL

Description:

Copy softcopy bahan-bahan pendukung kuliah. Kuliah IR yang jadi referensi ... Inquery, Smart, Okapi, Lemur, Indri. Database systems. Oracle, Informix, Access ... – PowerPoint PPT presentation

Number of Views:47

Avg rating:3.0/5.0

Slides: 40

Provided by: mab55

Category:

more less

Transcript and Presenter's Notes

Title: CS4323 INFORMATION RETRIEVAL

1
CS4323 INFORMATION RETRIEVAL

Kuliah 2
Introduction

2
CS4323 sem2 2004/05

Web site,
Mailing list, daftar email mhs
Acuan utama, acuan tambahan
Pokok bahasan
Schedule
Tugas
Waktu konsultasi
Contact person mhs / ketua kelas
Copy softcopy bahan-bahan pendukung kuliah

3
Kuliah IR yang jadi referensi

Text Retrieval and Mining di Stanford.edu
Information Retrieval di Umass.edu
Information Retrieval di UMBC.edu
Slides banyak mengambil dari kuliah2 di atas.
Tugas beberapa diambil dari situ.

4
What is Information Retrieval?

Quite effective (at some things)
Highly visible (mostly)
Commercially successful (some of them, so far)
But what goes on behind the scenes?
How do they work?
Is there more to it than the Web?

5
In this course, we ask

What makes a system like Google or MSN Search
tick?
How does it gather information?
What tricks does it use?
Extending beyond the Web
How can those approaches be made better?
Natural language understanding?
User interactions?
What can we do to make things work quickly?
Faster computers? Caching?
Compression?
How do we decide whether it works well?
For all queries? For special types of queries?
On every collection of information?
What else can we do with the same approach?
Other media?
Other tasks?

6
What is Information Retrieval?

Finding needles in haystacks
Haystacks are pretty big (the Web, the LOC...)
Needles can be pretty vague ("find me anythingn
about...")
Lots of kinds of hay (text, images, video,
audio...)
Compare a users query to a large collection of
documents, and give back a ranked list of
documents which best match the query

7
Comparing IR to databases
8
IR vs text mining
9
Sample Systems (in flux)

IR systems
Verity, Fulcrum, Excalibur, Eurospider
Hummingbird, Documentum
Inquery, Smart, Okapi, Lemur, Indri
Database systems
Oracle, Informix, Access
Web search and In-house systems
West, LEXIS/NEXIS, Dialog
Lycos, AltaVista, Excite, Yahoo, Google, Northern
Light, Teoma, HotBot, Direct Hit,
Ask Jeeves
eLibrary, GOV.Research_center, Inquira
And countless others...

10
(No Transcript)
11
(No Transcript)
12
(No Transcript)
13
(No Transcript)
14
(No Transcript)
15
(No Transcript)
16
(No Transcript)
17
(No Transcript)
18
(No Transcript)
19
Relevant Items are Similar

Much of IR depends upon idea that
similar vocabulary ? relevant to same queries

20
Bag of Words

An effective and popular approach
Compares words without regard to order
Consider reordering words in a headline
Random beating takes points falling another Dow
355
Alphabetical 355 another beating Dow falling
points
Interesting Dow points beating falling 355
another
Actual Dow takes another beating, falling 355
points

21
The Point?

Basis of most IR is a very simple approach
find words in documents
compare them to words in a query
this approach is very effective!
Other types of features are often used
phrases
named entities (people, locations, organizations)
special features (chemical names, product names)
difficult to do in general usually require hand
building
Focus of research is on improving accuracy, speed
and on extending ideas elsewhere

22
(No Transcript)
23
(No Transcript)
24
Keyword Search

Input one keyword or more (with logical
connectives AND, OR, NOT)
Which docs are relevant? How?
term occurrences
hyperlinks
etc
Document similarity. How?
Including meaning to find relevance docs. How?

25
Relevance ranking using terms (TF IDF)

Relevant docs can be very large, so need ranking.
Give user the most relevant (high ranking docs).
The more number of terms the more relevant? How
about docs size!

26
Relevance ranking (..cont)

A simple way to measure the relevance of doc d to
a term t,
r(d,t)
is depend on
n(d,t) of term t in doc d
n(d) of all term in doc d
r(d,t) log ( 1 n(d,t)/n(d) )

27
(No Transcript)
28
Relevance ranking (..cont)

There are many improvements for that metrics by
using other info, eg. term in the title has more
weight. Whatever the formula, r(d,t) is referred
to as term frequency.
Are all terms having the same worth? The same
weight? IDF (Inverse document frequency)
Web vs Silberschatz

29
IDF(Inverse document frequency)

IDF(t) 1 / n(t)
n(t) of docs that contain term t
Anther formula
n total number of docs k
-- docs with term t appearing
-- the DF document frequency

30
TF IDF

Query Q with more than one term
r(d,Q) depend on TFs and IDFs of the terms
r(d,Q) S TF IDF
(sum for all term in the query)

31
TF IDF - review

Two-fold heuristics based on frequency
TF (Term frequency)
More frequent within a document -gt more relevant
to semantics
e.g. query vs. commercial
IDF (Inverse document frequency)
Less frequent among documents -gt more
discriminative
e.g. algebra vs. science

32
Basic Concepts

How to select terms to capture basic concepts
Word stopping
e.g. a, the, always, along
Word stemming
e.g. computer, computing, computerize gt
compute
Latent semantic indexing

33
Relevance ranking using hyperlinks

People tend to search popular docs.
In web documents (ie. Web pages), relevance just
based on the text is simplicity.
For a popular web site, many others linked to it.
Popularity of a site is important to
formulized/measured.
PageRank used by Google is a ranking technique
independently of the query.

34
Similarity-based Retrieval

User query a document A
Collection of documents B, C, D, E, F, G
Doc A ? find the biggest r(A,t)s. ? t1, t2, t3
Using t1, t2, t3 to find the most relevant doc
(B, C, . G)

35
Indexing of Documents

Inverted index, maps each keyword to set of docs
that contain the keyword.
Mapping not just provide the identifiers of the
docs but also the location of the keyord in docs

36
Measuring retrieval effectiveness

False negative/false drop, false positive
True , True
Two metrics
Precision retrieved docs are actually
relevant accurate
Recall docs relevant to the query were
retrieved
Ideally, both precision and recall 100

37
Web search engines

Web crawlers programs that locate and gather
information on the WWW, indexing, the web pages
maybe stored or not
The crawling process takes long time because of
the huge size of WWW. So it is commonly
parallelized
Pagerank, pigeonrank

38
Tugas 1 (dikumpulkan Senin 7 Maret)

22.15 Compute the relevance (using tf, idf) of
each of the questions in chapter 22 to the query
SQL relation
22.16 What is the difference between false
positive and false negative?
22.17 Suppose you want to find documents that
contain at least k of a given set of n keywords.
Suppose also you have a keyword index that gives
you a (sorted) list of identifiers of documents
that contain a specified keyword. Give an
efficient algorithm to find the desired set of
documents.
Catatan mesti benar2 paham atas jawabannya. Akan
ditanya secara lisan.

39
REMINDER