Search Engine Technology - PowerPoint PPT Presentation

1 / 41
About This Presentation
Title:

Search Engine Technology

Description:

M -- MUSIC AND BOOKS ON MUSIC. N -- FINE ARTS. P -- LANGUAGE AND LITERATURE. Q -- SCIENCE ... R727-727.5 Medical personnel and the public. Physician and the public ... – PowerPoint PPT presentation

Number of Views:962
Avg rating:3.0/5.0
Slides: 42
Provided by: rad2
Category:

less

Transcript and Presenter's Notes

Title: Search Engine Technology


1
Search Engine Technology3http//www.cs.columbia
.edu/radev/SET07.html
  • September 20, 2007
  • Prof. Dragomir R. Radev
  • radev_at_umich.edu

2
SET Fall 2007
5. Evaluation of IR systems Reference
collections TREC
3
Relevance
  • Difficult to change fuzzy, inconsistent
  • Methods exhaustive, sampling, pooling,
    search-based

4
Contingency table
retrieved
not retrieved
relevant
wtp
xfn
n1 w x
yfp
ztn
not relevant
N
n2 w y
5
Precision and Recall
w
Recall
wx
w
Precision
wy
6
Exercise
Go to Google (www.google.com) and search for
documents on Tolkiens Lord of the Rings. Try
different ways of phrasing the query e.g.,
Tolkien, JRR Tolkien, JRR Tolkien Lord of
the Rings, etc. For each query, compute the
precision (P) based on the first 10 documents
returned by AltaVista. Note! Before starting
the exercise, have a clear idea of what a
relevant document for your query should look
like. Try different information needs. Later,
try different queries.
7
From Saltons book
8
(No Transcript)
9
Interpolated average precision (e.g.,
11pt) Interpolation what is precision at
recall0.5?
10
Issues
  • Why not use accuracy A(wz)/N?
  • Average precision
  • Average P at given document cutoff values
  • Report when PR
  • F measure F(b21)PR/(b2PR)
  • F1 measure F1 2/(1/R1/P) harmonic mean of P
    and R

11
Kappa
  • N number of items (index i)
  • n number of categories (index j)
  • k number of annotators

12
Kappa example (from Manning, Schuetze, Raghavan)
13
Kappa (contd)
  • P(A) 370/400
  • P (-) (10207070)/800 0.2125
  • P () (1020300300)/800 0.7875
  • P (E) 0.2125 0.2125 0.7875 0.7875 0.665
  • K (0.925-0.665)/(1-0.665) 0.776
  • Kappa higher than 0.67 is tentatively acceptable
    higher than 0.8 is good

14
Sample TREC query
lttopgt ltnumgt Number 305 lttitlegt Most Dangerous
Vehicles ltdescgt Description Which are the
most crashworthy, and least crashworthy,
passenger vehicles? ltnarrgt Narrative A
relevant document will contain information on the
crashworthiness of a given vehicle or vehicles
that can be used to draw a comparison with other
vehicles. The document will have to
describe/compare vehicles, not drivers. For
instance, it should be expected that vehicles
preferred by 16-25 year-olds would be involved in
more crashes, because that age group is involved
in more crashes. I would view number of
fatalities per 100 crashes to be more revealing
of a vehicle's crashworthiness than the number of
crashes per 100,000 miles, for example. lt/topgt
LA031689-0177 FT922-1008 LA090190-0126 LA101190-02
18 LA082690-0158 LA112590-0109 FT944-136 LA020590-
0119 FT944-5300 LA052190-0048 LA051689-0139 FT944-
9371 LA032390-0172
LA042790-0172LA021790-0136LA092289-0167LA111189
-0013LA120189-0179LA020490-0021LA122989-0063LA
091389-0119LA072189-0048FT944-15615LA091589-010
1LA021289-0208
15
ltDOCNOgt LA031689-0177 lt/DOCNOgt ltDOCIDgt 31701
lt/DOCIDgt ltDATEgtltPgtMarch 16, 1989, Thursday, Home
Edition lt/Pgtlt/DATEgt ltSECTIONgtltPgtBusiness Part 4
Page 1 Column 5 Financial Desk
lt/Pgtlt/SECTIONgt ltLENGTHgtltPgt586 words
lt/Pgtlt/LENGTHgt ltHEADLINEgtltPgtAGENCY TO LAUNCH STUDY
OF FORD BRONCO II AFTER HIGH RATE OF ROLL-OVER
ACCIDENTS lt/Pgtlt/HEADLINEgt ltBYLINEgtltPgtBy LINDA
WILLIAMS, Times Staff Writer lt/Pgtlt/BYLINEgt ltTEXTgt
ltPgtThe federal government's highway safety
watchdog said Wednesday that the Ford Bronco II
appears to be involved in more fatal
roll-over accidents than other vehicles in its
class and that it will seek to determine if the
vehicle itself contributes to the accidents.
lt/Pgt ltPgtThe decision to do an engineering
analysis of the Ford Motor Co. utility-sport
vehicle grew out of a federal accident study of
the Suzuki Samurai, said Tim Hurd, a spokesman
for the National Highway Traffic Safety
Administration. NHTSA looked at Samurai accidents
after Consumer Reports magazine charged that the
vehicle had basic design flaws. lt/Pgt ltPgtSeveral
Fatalities lt/Pgt ltPgtHowever, the accident study
showed that the "Ford Bronco II appears to have a
higher number of single-vehicle, first event
roll-overs, particularly those involving
fatalities," Hurd said. The engineering analysis
of the Bronco, the second of three levels of
investigation conducted by NHTSA, will cover the
1984-1989 Bronco II models, the agency said.
lt/Pgt ltPgtAccording to a Fatal Accident Reporting
System study included in the September report on
the Samurai, 43 Bronco II single-vehicle roll-over
s caused fatalities, or 19 of every 100,000
vehicles. There were eight Samurai fatal
roll-overs, or 6 per 100,000 13 involving the
Chevrolet S10 Blazers or GMC Jimmy, or 6 per
100,000, and six fatal Jeep Cherokee roll-overs,
for 2.5 per 100,000. After the accident report,
NHTSA declined to investigate the Samurai.
lt/Pgt ... lt/TEXTgt ltGRAPHICgtltPgt Photo, The Ford
Bronco II "appears to have a higher number of
single-vehicle, first event roll-overs," a
federal official said. lt/Pgtlt/GRAPHICgt ltSUBJECTgt ltP
gtTRAFFIC ACCIDENTS FORD MOTOR CORP NATIONAL
HIGHWAY TRAFFIC SAFETY ADMINISTRATION VEHICLE
INSPECTIONS RECREATIONAL VEHICLES SUZUKI MOTOR
CO AUTOMOBILE SAFETY lt/Pgt lt/SUBJECTgt lt/DOCgt
16
TREC (contd)
  • http//trec.nist.gov/tracks.html
  • http//trec.nist.gov/presentations/presentations.h
    tml

17
Most used reference collections
  • Generic retrieval OHSUMED, CRANFIELD, CACM
  • Text classification Reuters, 20newsgroups
  • Question answering TREC-QA
  • Web DOTGOV, wt100g
  • Blogs Buzzmetrics datasets
  • TREC ad hoc collections, 2-6 GB
  • TREC Web collections, 2-100GB

18
Comparing two systems
  • Comparing A and B
  • One query?
  • Average performance?
  • Need A to consistently outperform B

this slide courtesy James Allan
19
The sign test
  • Example 1
  • A gt B (12 times)
  • A B (25 times)
  • A lt B (3 times)
  • p lt 0.035 (significant at the 5 level)
  • Example 2
  • A gt B (18 times)
  • A lt B (9 times)
  • p lt 0.122 (not significant at the 5 level)
  • http//www.fon.hum.uva.nl/Service/Statistics/Sign_
    Test.html

this slide courtesy James Allan
20
The t-test
  • Takes into account the actual performances, not
    just which system is better
  • http//www.socialresearchmethods.net/kb/stat_t.php

21
SET Fall 2007
6. Automated indexing/labeling Compression
22
Recap
  • 1. Introduction
  • 2. IR models, the Boolean model
  • 3. Storing, indexing, searching
  • 4. Word distributions, TFIDF
  • HW1 about local indexing and searching
  • HW2 coming up about blog classification
  • HW3 coming up about web crawling, information
    extraction, and graph mining

23
Indexing methods
  • Manual e.g., Library of Congress subject
    headings, MeSH
  • Automatic e.g., TFIDF based

24
LOC subject headings
A -- GENERAL WORKSB -- PHILOSOPHY. PSYCHOLOGY.
RELIGIONC -- AUXILIARY SCIENCES OF HISTORYD --
HISTORY (GENERAL) AND HISTORY OF EUROPEE --
HISTORY AMERICAF -- HISTORY AMERICAG --
GEOGRAPHY. ANTHROPOLOGY. RECREATIONH -- SOCIAL
SCIENCESJ -- POLITICAL SCIENCEK -- LAWL --
EDUCATIONM -- MUSIC AND BOOKS ON MUSICN -- FINE
ARTSP -- LANGUAGE AND LITERATUREQ -- SCIENCER
-- MEDICINES -- AGRICULTURET -- TECHNOLOGYU --
MILITARY SCIENCEV -- NAVAL SCIENCEZ --
BIBLIOGRAPHY. LIBRARY SCIENCE. INFORMATION
RESOURCES (GENERAL)
http//www.loc.gov/catdir/cpso/lcco/lcco.html
25
Medicine
CLASS R - MEDICINE Subclass R R5-920 Medicine
(General) R5-130.5 General works R131-687 Histor
y of medicine. Medical expeditions R690-697 Medi
cine as a profession. Physicians R702-703 Medici
ne and the humanities. Medicine and disease in
relation to history, literature,
etc. R711-713.97 Directories R722-722.32 Missionar
y medicine. Medical missionaries R723-726 Medica
l philosophy. Medical ethics R726.5-726.8 Medicin
e and disease in relation to psychology.
Terminal care. Dying R727-727.5 Medical
personnel and the public. Physician and the
public R728-733 Practice of medicine. Medical
practice economics R735-854 Medical education.
Medical schools. Research R855-855.5 Medical
technology R856-857 Biomedical engineering.
Electronics. Instrumentation R858-859.7 Computer
applications to medicine. Medical
informatics R864 Medical records R895-920 Medica
l physics. Medical radiology. Nuclear medicine
26
Automatic methods
  • TFIDF pick terms with the highest TFIDF scores
  • Centroid-based pick terms that appear in the
    centroid with high scores
  • The maximal marginal relevance principle (MMR)
  • Related to summarization, snippet generation

27
Compression
  • Methods
  • Fixed length codes
  • Huffman coding
  • Ziv-Lempel codes

28
Fixed length codes
  • Binary representations
  • ASCII
  • Representational power (2k symbols where k is the
    number of bits)

29
Variable length codes
  • Alphabet
  • A .-  N -.  0 -----
  • B -...  O ---  1 .----
  • C -.-.  P .--.  2 ..---
  • D -..  Q --.-  3 ...
  • E .  R .-. 4 ....-
  • F ..-. S ... 5 .....
  • G --. T -  6 -....
  • H .... U ..-  7 --...
  • I ..  V ...-  8 ---..
  • J .---  W .--  9 ----.
  • K -.-  X -..-
  • L .-..  Y -.
  • M --  Z --..
  • Demo
  • http//www.scphillips.com/morse/

30
Most frequent letters in English
  • Most frequent letters
  • E T A O I N S H R D L U
  • Demo
  • http//www.amstat.org/publications/jse/secure/v7n2
    /count-char.cfm
  • Also bigrams
  • TH HE IN ER AN RE ND AT ON NT

31
Huffman coding
  • Developed by David Huffman (1952)
  • Average of 5 bits per character (37.5
    compression)
  • Based on frequency distributions of symbols
  • Algorithm iteratively build a tree of symbols
    starting with the two least frequent symbols

32
(No Transcript)
33
0
1
0
1
1
0
g
0
0
0
1
1
1
i
j
f
c
0
1
0
1
b
d
a
0
1
h
e
34
(No Transcript)
35
Exercise
  • Consider the bit string 011011011110001001100011
    10100111000110101101011101
  • Use the Huffman code from the example to decode
    it.
  • Try inserting, deleting, and switching some bits
    at random locations and try decoding.

36
Extensions
  • Word-based
  • Domain/genre dependent models

37
Ziv-Lempel coding
  • Two types - one is known as LZ77 (used in GZIP)
  • Code set of triples lta,b,cgt
  • a how far back in the decoded text to look for
    the upcoming text segment
  • b how many characters to copy
  • c new character to add to complete segment

38
  • lt0,0,pgt p
  • lt0,0,egt pe
  • lt0,0,tgt pet
  • lt2,1,rgt peter
  • lt0,0,_gt peter_
  • lt6,1,igt peter_pi
  • lt8,2,rgt peter_piper
  • lt6,3,cgt peter_piper_pic
  • lt0,0,kgt peter_piper_pick
  • lt7,1,dgt peter_piper_picked
  • lt7,1,agt peter_piper_picked_a
  • lt9,2,egt peter_piper_picked_a_pe
  • lt9,2,_gt peter_piper_picked_a_peck_
  • lt0,0,ogt peter_piper_picked_a_peck_o
  • lt0,0,fgt peter_piper_picked_a_peck_of
  • lt17,5,lgt peter_piper_picked_a_peck_of_pickl
  • lt12,1,dgt peter_piper_picked_a_peck_of_pickled
  • lt16,3,pgt peter_piper_picked_a_peck_of_pickled_pep
  • lt3,2,rgt peter_piper_picked_a_peck_of_pickled_peppe
    r

39
Links on text compression
  • Data compression
  • http//www.data-compression.info/
  • Calgary corpus
  • http//links.uwaterloo.ca/calgary.corpus.html
  • Huffman coding
  • http//www.compressconsult.com/huffman/
  • http//en.wikipedia.org/wiki/Huffman_coding
  • LZ
  • http//en.wikipedia.org/wiki/LZ77

40
100 alternative search engines
  • http//rss.slashdot.org/r/Slashdot/slashdot/3/83
    468703/article.pl

41
Readings
  • For September 20 MRS9
  • For September 27 MRS13, MRS14
  • For October 4 MRS15, MRS16
Write a Comment
User Comments (0)
About PowerShow.com