???,yhf@net.pku.edu.cn - PowerPoint PPT Presentation

About This Presentation

Title:

???,yhf@net.pku.edu.cn

Description:

yhf_at_net.pku.edu.cn 2004 12 24 _at_CERNET2004 ... – PowerPoint PPT presentation

Number of Views:6

Avg rating:3.0/5.0

Slides: 63

Provided by: freeEolCn

Category:

more less

Transcript and Presenter's Notes

Title: ???,yhf@net.pku.edu.cn

1
??????

???,yhf_at_net.pku.edu.cn
?????????????
2004?12?24?_at_CERNET2004

2
????

????????
???????????

3
???? Web Search Engines

??????????,???????????????,???????
???????
????
????
????
???????
???????

4
(No Transcript)
5
(No Transcript)
6
Two service extremes
Browsing Services
Search Engine Services
???
Web Pages
Bag of Words
???
Two semantics extremes
7
???????????
??
??
??

??
????,?????????,????
???
?????????????????
??
???????????????

8
????????
9
??????????
10
???Web??????
11
??????

version 1.0 // version number
url http//www.pku.edu.cn/ // URL
origin http//www.somewhere.cn/ // original URL
date Tue, 15 Apr 2003 081306 GMT // time of
harvest
ip 162.105.129.12 //
IP address
unzip-length 30233 // If
included, the data must be compressed
length 18133 // data length
// a blank line
XXXXXXXX // the followings are data part
XXXXXXXX
.
XXXXXXXX // data end
// insert a new line

12
File Organizations (Indexes)

Choices for accessing data during query
evaluation
Scan the entire collection
Typical in early (batch) retrieval systems
Computational and I/O costs are O(characters in
collection)
Practical for only small text collections
Large memory systems make scanning feasible
Use indexes for direct access
Evaluation time O(query term occurrences in
collection)
Practical for large collections
Many opportunities for optimization
Hybrids Use small index, then scan a subset of
the collection

13
Indexes

What should the index contain?
Database systems index primary and secondarykeys
This is the hybrid approach
Index provides fast access to a subset of
database records
Scan subset to find solution set
IR Problem
Cannot predict keys that people will use in
queries
Every word in a document is a potential search
term
IR Solution Index by all keys (words) ?full
text indexes

14
Index Contents

The contents depend upon the retrieval model
Feature presence/absence
Boolean
Statistical (tf, df, ctf, doclen, maxtf)
Often about 10 the size of the raw data,
compressed
Positional
Feature location within document
Granularities include word, sentence, paragraph,
etc
Coarse granularities are less precise, but take
less space
Word-level granularity about 20-30 the size of
the raw data,compressed

15
Indexes Implementation

Common implementations of indexes
Bitmaps
Signature files
Inverted files
Common index components
Dictionary (lexicon)
Postings
document ids
word positions

No positional data indexed
16
Inverted Files
17
Inverted Files
18
Word-Level Inverted File
19
Inverted Search Algorithm

Find query elements (terms) in the lexicon
Retrieve postings for each lexicon entry
Manipulate postings according to the retrieval
model

20
Word-Level Inverted File
lexicon
posting
Query 1.porridge pot (BOOL) 2.porridge
pot (BOOL) 3. porridge pot (VSM)
Answer
21
????

????????
???????????

22
A Brief history of Modern Information Retrieval

In 1945, Vannevar Bush published "As We May
Think" in the Atlantic monthly.
In the 1960s, the SMART system by Gerard Salton
and his students
Cranfield evaluations done by Cyril Cleverdon
The 1970s and 1980s saw many developments built
on the advances of the 1960s.
In 1992 with the inception of Text Retrieval
Conference.
The algorithms developed
The algorithms developed in IR were employed for
searching the Web from 1996.

23
Clustering of SIGIR papers by topic vs. year
24
Question answering
25
Clustering
26
Inverted files Implementations
27
Message understanding TDT
28
Filtering
29
Hypertext IR, Multiple evidence
30
Probabilistic Language models
31
Distributed IR
32
Evaluation
33
Topic distillation Linkage retrieval
34
Text categorisation
35
Document summarisation
36
Cross lingual
37
???????????

CIIR, University of Massachusetts
LTI, Carnegie Mellon University
The Stanford University DB Group
Microsoft Research Asia
TREC
????, ?????, ???

38
Lemur??

http//www-2.cs.cmu.edu/lemur/

39
Lemur Toolkit

?????LM?IR???research system
ad hoc , distributed retrieval, cross-language
IR, summarization, filtering, and classification
??
?????????????
??Simple Language Model
????Language Model????????????
??
C and C
Unix / Windows
Current Version 3.1

40
MRA Towards Next Generation Web Search

From Pages to Blocks
Analyze the Web at finer granularity
From Surface Web to Deep Web
Unleash the huge assets of high-value information
From Unstructure to Structure
Provide well organized results
From relevance to intelligence
Contribute knowledge discovery with search
From Desktop Search to Mobile Search
Bridge physical world search to digital world
search

41
The Stanford Univ. DB Group

WebBase
Crawling, storage, indexing, and querying of
large collections of Web pages.
Digital Libraries
Infrastructure and services for creating,
disseminating, sharing and managing information

42
TREC Conference

Established in 1992 to evaluate large-scale IR
Retrieving documents from a gigabyte collection
Has run continuously since then
TREC 2004(13th) meeting is in November
Run by NISTs Information Access Division
Probably most well known IR evaluation setting
Started with 25 participating organizations in
1992 evaluation
In 2003, there were 93 groups from 22 different
countries
Proceedings available on-line (http//trec.nist.go
v )
Overview of TREC 2003 at http//trec.nist.gov/pubs
/trec12/papers/OVERVIEW.12.pdf

43
TREC General Format

TREC consists of IR research tracks
Ad hoc, routing, confusion ( scanned documents,
speech recognition ), video, filtering,
multilingual ( cross-language, Spanish, Chinese
), question answering, novelty, high precision,
interactive, Web, database merging, NLP,
Each track works on roughly the same model
November track approved by TREC community
Winter tracks members finalize format for track
Spring researchers train system based on
specification
Summer researchers carry out format evaluation
Usually a blind evaluation research do not
know answer
Fall NIST carries out evaluation
November Group meeting (TREC) to find out
How well your site did
How others tackled the program
Many tracks are run by volunteers outside of NIST
(e.g. Web)
Coopetition model of evaluation
Successful approaches generally adopted in next
cycle

44
TREC Tracks
45
Summary of VLC/Web Track evaluation 1996 - 2003
46
Tianwang Group _at_PKU
47
http//www.infomall.cn/
48
(No Transcript)
49
(No Transcript)
50
CWT100g?????
v
v
v
v
?????,??????!
51
(No Transcript)
52
??2004-12-20??????????
2.5/8.8 28.4
53
????????
TEAM NAME TD-RUNS NPHP-RUNS
??????APEX??? APEX 5 5
?????????????? ANS 3 2
TRS?? TRS 5 2
?????????? MUMIAN1 3 1
?????????? MUMIAN2 2 1
??????????????????? SCUTDB 5 5
?????? WLL 1

?pooling???google,yisou,baidu,sogou,zhongsou??SE?
?????
54
????
????
????
??TIANWANG_RUN????
55
??

????????
???????????

56
??!
57
Vector Space Model

??d???q???????????m???,???????TFIDF,?????????????
,? (?????tf,idf??)

BACK
58
Query Answer

1.porridge pot (BOOL)
d2
2.porridge pot (BOOL)
null
3. porridge pot (VSM)
d2 gt d1gtd5
Next page?

BACK
59
CIIR-Center for Intelligent Information Retrieval
_at_UMASS

One of the leading research groups in IR
improving the probabilistic models,
first description of a retrieval system based on
statistical language models.
introduced and improved a number of techniques
for text and query representation
automatically representing databases and
combining local searches for DIR
first high capacity probabilistic filtering
architecture
define and evaluate the first versions of event
detection and tracking software
earliest research on ranking and representation
techniques for Asian languages
first approaches to information extraction that
emphasized learning
novel techniques for indexing images and video

60
CIIR cont.

Research
more than 500 journal and refereed conference
papers over the past 12 years (52 submissions in
2003).
industrial and government collaboration
INQUERY
licensed our software to nearly 300 sites
Education
20 Ph.D.s , 29 M.S.
123/145, 34/4 graduate/undergraduate

61
CIIR cont.

Personnel
Faculty 4 (W. BRUCE CROFT)
Technical personel 10
Graduate student 34/10
Groups
IESLInformation Extraction and Synthesis
Laboratory
IR Information Retrieval Laboratory
MIR Multimedia Indexing and Retrieval Laboratory
The CIIR is currently concentrating on the
unsolved long-term research problems that
underlie effective information retrieval
text representation,
query acquisition,
retrieval models

62
LTI Language Technologies Institue _at_CMU

Machine Translation, Natural Language Processing,
Speech, and Information Retrieval
IR Projects (Jamie Callan and Yiming Yang )
Adaptive Information Filtering
Distributed Information Retrieval / Federated
Search
Email Classification and Prioritization
Minerva Web Mining for Question Answering
MuchMore Translingual Information Retrieval
JAVELIN Open-Domain Question Answering

BACK

Write a Comment

User Comments (0)