Title: CS4323 INFORMATION RETRIEVAL
1CS4323 INFORMATION RETRIEVAL
2CS4323 sem2 2004/05
- Web site,
- Mailing list, daftar email mhs
- Acuan utama, acuan tambahan
- Pokok bahasan
- Schedule
- Tugas
- Waktu konsultasi
- Contact person mhs / ketua kelas
- Copy softcopy bahan-bahan pendukung kuliah
3Kuliah IR yang jadi referensi
- Text Retrieval and Mining di Stanford.edu
- Information Retrieval di Umass.edu
- Information Retrieval di UMBC.edu
- Slides banyak mengambil dari kuliah2 di atas.
Tugas beberapa diambil dari situ.
4What is Information Retrieval?
- Quite effective (at some things)
- Highly visible (mostly)
- Commercially successful (some of them, so far)
- But what goes on behind the scenes?
- How do they work?
- Is there more to it than the Web?
5In this course, we ask
- What makes a system like Google or MSN Search
tick? - How does it gather information?
- What tricks does it use?
- Extending beyond the Web
- How can those approaches be made better?
- Natural language understanding?
- User interactions?
- What can we do to make things work quickly?
- Faster computers? Caching?
- Compression?
- How do we decide whether it works well?
- For all queries? For special types of queries?
- On every collection of information?
- What else can we do with the same approach?
- Other media?
- Other tasks?
6What is Information Retrieval?
- Finding needles in haystacks
- Haystacks are pretty big (the Web, the LOC...)
- Needles can be pretty vague ("find me anythingn
about...") - Lots of kinds of hay (text, images, video,
audio...) - Compare a users query to a large collection of
documents, and give back a ranked list of
documents which best match the query
7Comparing IR to databases
8IR vs text mining
9Sample Systems (in flux)
- IR systems
- Verity, Fulcrum, Excalibur, Eurospider
- Hummingbird, Documentum
- Inquery, Smart, Okapi, Lemur, Indri
- Database systems
- Oracle, Informix, Access
- Web search and In-house systems
- West, LEXIS/NEXIS, Dialog
- Lycos, AltaVista, Excite, Yahoo, Google, Northern
Light, Teoma, HotBot, Direct Hit, - Ask Jeeves
- eLibrary, GOV.Research_center, Inquira
- And countless others...
10(No Transcript)
11(No Transcript)
12(No Transcript)
13(No Transcript)
14(No Transcript)
15(No Transcript)
16(No Transcript)
17(No Transcript)
18(No Transcript)
19Relevant Items are Similar
- Much of IR depends upon idea that
- similar vocabulary ? relevant to same queries
20Bag of Words
- An effective and popular approach
- Compares words without regard to order
- Consider reordering words in a headline
- Random beating takes points falling another Dow
355 - Alphabetical 355 another beating Dow falling
points - Interesting Dow points beating falling 355
another - Actual Dow takes another beating, falling 355
points
21The Point?
- Basis of most IR is a very simple approach
- find words in documents
- compare them to words in a query
- this approach is very effective!
- Other types of features are often used
- phrases
- named entities (people, locations, organizations)
- special features (chemical names, product names)
- difficult to do in general usually require hand
building - Focus of research is on improving accuracy, speed
- and on extending ideas elsewhere
22(No Transcript)
23(No Transcript)
24Keyword Search
- Input one keyword or more (with logical
connectives AND, OR, NOT) - Which docs are relevant? How?
- term occurrences
- hyperlinks
- etc
- Document similarity. How?
- Including meaning to find relevance docs. How?
25Relevance ranking using terms (TF IDF)
- Relevant docs can be very large, so need ranking.
- Give user the most relevant (high ranking docs).
- The more number of terms the more relevant? How
about docs size!
26Relevance ranking (..cont)
- A simple way to measure the relevance of doc d to
a term t, - r(d,t)
- is depend on
- n(d,t) of term t in doc d
- n(d) of all term in doc d
- r(d,t) log ( 1 n(d,t)/n(d) )
27(No Transcript)
28Relevance ranking (..cont)
- There are many improvements for that metrics by
using other info, eg. term in the title has more
weight. Whatever the formula, r(d,t) is referred
to as term frequency. - Are all terms having the same worth? The same
weight? IDF (Inverse document frequency) - Web vs Silberschatz
29IDF(Inverse document frequency)
- IDF(t) 1 / n(t)
- n(t) of docs that contain term t
- Anther formula
- n total number of docs k
-- docs with term t appearing - -- the DF document frequency
30TF IDF
- Query Q with more than one term
- r(d,Q) depend on TFs and IDFs of the terms
- r(d,Q) S TF IDF
- (sum for all term in the query)
31TF IDF - review
- Two-fold heuristics based on frequency
- TF (Term frequency)
- More frequent within a document -gt more relevant
to semantics - e.g. query vs. commercial
- IDF (Inverse document frequency)
- Less frequent among documents -gt more
discriminative - e.g. algebra vs. science
32Basic Concepts
- How to select terms to capture basic concepts
- Word stopping
- e.g. a, the, always, along
- Word stemming
- e.g. computer, computing, computerize gt
compute - Latent semantic indexing
33Relevance ranking using hyperlinks
- People tend to search popular docs.
- In web documents (ie. Web pages), relevance just
based on the text is simplicity. - For a popular web site, many others linked to it.
- Popularity of a site is important to
formulized/measured. - PageRank used by Google is a ranking technique
independently of the query.
34Similarity-based Retrieval
- User query a document A
- Collection of documents B, C, D, E, F, G
- Doc A ? find the biggest r(A,t)s. ? t1, t2, t3
- Using t1, t2, t3 to find the most relevant doc
(B, C, . G)
35Indexing of Documents
- Inverted index, maps each keyword to set of docs
that contain the keyword. - Mapping not just provide the identifiers of the
docs but also the location of the keyord in docs
36Measuring retrieval effectiveness
- False negative/false drop, false positive
- True , True
- Two metrics
- Precision retrieved docs are actually
relevant accurate - Recall docs relevant to the query were
retrieved - Ideally, both precision and recall 100
37Web search engines
- Web crawlers programs that locate and gather
information on the WWW, indexing, the web pages
maybe stored or not - The crawling process takes long time because of
the huge size of WWW. So it is commonly
parallelized - Pagerank, pigeonrank
38Tugas 1 (dikumpulkan Senin 7 Maret)
- 22.15 Compute the relevance (using tf, idf) of
each of the questions in chapter 22 to the query
SQL relation - 22.16 What is the difference between false
positive and false negative? - 22.17 Suppose you want to find documents that
contain at least k of a given set of n keywords.
Suppose also you have a keyword index that gives
you a (sorted) list of identifiers of documents
that contain a specified keyword. Give an
efficient algorithm to find the desired set of
documents. - Catatan mesti benar2 paham atas jawabannya. Akan
ditanya secara lisan.
39REMINDER
- Pertemuan selanjutnya (Senin, 28 Feb), masih
introduction. - Sebelum pertemuan depan, pelajari
- Bab decision support buku Ramakrishnan
- Paper Chauduri
- Persiapkan tugas-tugas sedini mungkin