CS4323 INFORMATION RETRIEVAL - PowerPoint PPT Presentation

1 / 39
About This Presentation
Title:

CS4323 INFORMATION RETRIEVAL

Description:

Copy softcopy bahan-bahan pendukung kuliah. Kuliah IR yang jadi referensi ... Inquery, Smart, Okapi, Lemur, Indri. Database systems. Oracle, Informix, Access ... – PowerPoint PPT presentation

Number of Views:47
Avg rating:3.0/5.0
Slides: 40
Provided by: mab55
Category:

less

Transcript and Presenter's Notes

Title: CS4323 INFORMATION RETRIEVAL


1
CS4323 INFORMATION RETRIEVAL
  • Kuliah 2
  • Introduction

2
CS4323 sem2 2004/05
  • Web site,
  • Mailing list, daftar email mhs
  • Acuan utama, acuan tambahan
  • Pokok bahasan
  • Schedule
  • Tugas
  • Waktu konsultasi
  • Contact person mhs / ketua kelas
  • Copy softcopy bahan-bahan pendukung kuliah

3
Kuliah IR yang jadi referensi
  • Text Retrieval and Mining di Stanford.edu
  • Information Retrieval di Umass.edu
  • Information Retrieval di UMBC.edu
  • Slides banyak mengambil dari kuliah2 di atas.
    Tugas beberapa diambil dari situ.

4
What is Information Retrieval?
  • Quite effective (at some things)
  • Highly visible (mostly)
  • Commercially successful (some of them, so far)
  • But what goes on behind the scenes?
  • How do they work?
  • Is there more to it than the Web?

5
In this course, we ask
  • What makes a system like Google or MSN Search
    tick?
  • How does it gather information?
  • What tricks does it use?
  • Extending beyond the Web
  • How can those approaches be made better?
  • Natural language understanding?
  • User interactions?
  • What can we do to make things work quickly?
  • Faster computers? Caching?
  • Compression?
  • How do we decide whether it works well?
  • For all queries? For special types of queries?
  • On every collection of information?
  • What else can we do with the same approach?
  • Other media?
  • Other tasks?

6
What is Information Retrieval?
  • Finding needles in haystacks
  • Haystacks are pretty big (the Web, the LOC...)
  • Needles can be pretty vague ("find me anythingn
    about...")
  • Lots of kinds of hay (text, images, video,
    audio...)
  • Compare a users query to a large collection of
    documents, and give back a ranked list of
    documents which best match the query

7
Comparing IR to databases
8
IR vs text mining
9
Sample Systems (in flux)
  • IR systems
  • Verity, Fulcrum, Excalibur, Eurospider
  • Hummingbird, Documentum
  • Inquery, Smart, Okapi, Lemur, Indri
  • Database systems
  • Oracle, Informix, Access
  • Web search and In-house systems
  • West, LEXIS/NEXIS, Dialog
  • Lycos, AltaVista, Excite, Yahoo, Google, Northern
    Light, Teoma, HotBot, Direct Hit,
  • Ask Jeeves
  • eLibrary, GOV.Research_center, Inquira
  • And countless others...

10
(No Transcript)
11
(No Transcript)
12
(No Transcript)
13
(No Transcript)
14
(No Transcript)
15
(No Transcript)
16
(No Transcript)
17
(No Transcript)
18
(No Transcript)
19
Relevant Items are Similar
  • Much of IR depends upon idea that
  • similar vocabulary ? relevant to same queries

20
Bag of Words
  • An effective and popular approach
  • Compares words without regard to order
  • Consider reordering words in a headline
  • Random beating takes points falling another Dow
    355
  • Alphabetical 355 another beating Dow falling
    points
  • Interesting Dow points beating falling 355
    another
  • Actual Dow takes another beating, falling 355
    points

21
The Point?
  • Basis of most IR is a very simple approach
  • find words in documents
  • compare them to words in a query
  • this approach is very effective!
  • Other types of features are often used
  • phrases
  • named entities (people, locations, organizations)
  • special features (chemical names, product names)
  • difficult to do in general usually require hand
    building
  • Focus of research is on improving accuracy, speed
  • and on extending ideas elsewhere

22
(No Transcript)
23
(No Transcript)
24
Keyword Search
  • Input one keyword or more (with logical
    connectives AND, OR, NOT)
  • Which docs are relevant? How?
  • term occurrences
  • hyperlinks
  • etc
  • Document similarity. How?
  • Including meaning to find relevance docs. How?

25
Relevance ranking using terms (TF IDF)
  • Relevant docs can be very large, so need ranking.
  • Give user the most relevant (high ranking docs).
  • The more number of terms the more relevant? How
    about docs size!

26
Relevance ranking (..cont)
  • A simple way to measure the relevance of doc d to
    a term t,
  • r(d,t)
  • is depend on
  • n(d,t) of term t in doc d
  • n(d) of all term in doc d
  • r(d,t) log ( 1 n(d,t)/n(d) )

27
(No Transcript)
28
Relevance ranking (..cont)
  • There are many improvements for that metrics by
    using other info, eg. term in the title has more
    weight. Whatever the formula, r(d,t) is referred
    to as term frequency.
  • Are all terms having the same worth? The same
    weight? IDF (Inverse document frequency)
  • Web vs Silberschatz

29
IDF(Inverse document frequency)
  • IDF(t) 1 / n(t)
  • n(t) of docs that contain term t
  • Anther formula
  • n total number of docs k
    -- docs with term t appearing
  • -- the DF document frequency

30
TF IDF
  • Query Q with more than one term
  • r(d,Q) depend on TFs and IDFs of the terms
  • r(d,Q) S TF IDF
  • (sum for all term in the query)

31
TF IDF - review
  • Two-fold heuristics based on frequency
  • TF (Term frequency)
  • More frequent within a document -gt more relevant
    to semantics
  • e.g. query vs. commercial
  • IDF (Inverse document frequency)
  • Less frequent among documents -gt more
    discriminative
  • e.g. algebra vs. science

32
Basic Concepts
  • How to select terms to capture basic concepts
  • Word stopping
  • e.g. a, the, always, along
  • Word stemming
  • e.g. computer, computing, computerize gt
    compute
  • Latent semantic indexing

33
Relevance ranking using hyperlinks
  • People tend to search popular docs.
  • In web documents (ie. Web pages), relevance just
    based on the text is simplicity.
  • For a popular web site, many others linked to it.
  • Popularity of a site is important to
    formulized/measured.
  • PageRank used by Google is a ranking technique
    independently of the query.

34
Similarity-based Retrieval
  • User query a document A
  • Collection of documents B, C, D, E, F, G
  • Doc A ? find the biggest r(A,t)s. ? t1, t2, t3
  • Using t1, t2, t3 to find the most relevant doc
    (B, C, . G)

35
Indexing of Documents
  • Inverted index, maps each keyword to set of docs
    that contain the keyword.
  • Mapping not just provide the identifiers of the
    docs but also the location of the keyord in docs

36
Measuring retrieval effectiveness
  • False negative/false drop, false positive
  • True , True
  • Two metrics
  • Precision retrieved docs are actually
    relevant accurate
  • Recall docs relevant to the query were
    retrieved
  • Ideally, both precision and recall 100

37
Web search engines
  • Web crawlers programs that locate and gather
    information on the WWW, indexing, the web pages
    maybe stored or not
  • The crawling process takes long time because of
    the huge size of WWW. So it is commonly
    parallelized
  • Pagerank, pigeonrank

38
Tugas 1 (dikumpulkan Senin 7 Maret)
  • 22.15 Compute the relevance (using tf, idf) of
    each of the questions in chapter 22 to the query
    SQL relation
  • 22.16 What is the difference between false
    positive and false negative?
  • 22.17 Suppose you want to find documents that
    contain at least k of a given set of n keywords.
    Suppose also you have a keyword index that gives
    you a (sorted) list of identifiers of documents
    that contain a specified keyword. Give an
    efficient algorithm to find the desired set of
    documents.
  • Catatan mesti benar2 paham atas jawabannya. Akan
    ditanya secara lisan.

39
REMINDER
  • Pertemuan selanjutnya (Senin, 28 Feb), masih
    introduction.
  • Sebelum pertemuan depan, pelajari
  • Bab decision support buku Ramakrishnan
  • Paper Chauduri
  • Persiapkan tugas-tugas sedini mungkin
Write a Comment
User Comments (0)
About PowerShow.com