Intelligent Information Retrieval (and Web Search) - PowerPoint PPT Presentation

1 / 40

About This Presentation

Title:

Intelligent Information Retrieval (and Web Search)

Description:

Intelligent Information Retrieval (and Web Search) Professor Celso A A Kaestner, PhD. Brazil * – PowerPoint PPT presentation

Number of Views:243

Avg rating:3.0/5.0

Slides: 41

Provided by: Raymond194

Category:

more less

Transcript and Presenter's Notes

Title: Intelligent Information Retrieval (and Web Search)

1
Intelligent Information Retrieval(and Web Search)

Professor Celso A A Kaestner, PhD.
Brazil

2
Sitewww.dainf.ct.utfpr.edu.br/kaestner/Konstanz
/iir.htm
3
Introduction
4
Introduction Information Retrieval

IR representation, storage, organization of, and
access to information items
Focus is on the user information need
User information need
Find all docs containing information on college
football teams which (1) are maintained by an
university and (2) participate in the national
tournament.
Emphasis is on the retrieval of information (not
data).

5
Data retrieval x Information retrieval

Data Retrieval
which docs. contain a set of keywords?
well defined semantics
a single erroneous object implies failure!
Information Retrieval (IR)
information about a subject or topic
semantics is frequently loose
small errors are tolerated.
IR system
interpret contents of information items
generate a ranking which reflects relevance
notion of relevance is most important.

6
Information Retrieval (IR)

The indexing and retrieval of textual documents.
Searching for pages on the World Wide Web is the
most recent killer app.
Concerned firstly with retrieving relevant
documents to a query.
Concerned secondly with retrieving from large
sets of documents efficiently.

7
Typical IR Task

Given
A corpus of textual natural-language documents.
A user query in the form of a textual string.
Find
A ranked set of documents that are relevant to
the query.

8
IR System
IR System
9
Relevance

Relevance is a subjective judgment and may
include
Being on the proper subject.
Being timely (recent information).
Being authoritative (from a trusted source).
Satisfying the goals of the user and his/her
intended use of the information (information
need).

10
Keyword Search

Simplest notion of relevance is that the query
string appears verbatim in the document.
Slightly less strict notion is that the words in
the query appear frequently in the document, in
any order (bag of words).

11
Problems with Keywords

May not retrieve relevant documents that include
synonymous terms.
restaurant vs. café
PRC vs. China
May retrieve irrelevant documents that include
ambiguous terms.
bat (baseball vs. mammal)
Apple (company vs. fruit)
bit (unit of data vs. act of eating)

12
Beyond Keywords

We will cover the basics of keyword-based IR,
but
We will focus on extensions and recent
developments that go beyond keywords.
We will cover the basics of building an efficient
IR system, but
We will focus on basic capabilities and
algorithms rather than systems issues that allow
scaling to industrial size databases.

13
Intelligent IR

Taking into account the meaning of the words
used.
Taking into account the order of words in the
query.
Adapting to the user based on direct or indirect
feedback.
Taking into account the authority of the source.

14
IR System Architecture
User Interface
Text
User Need
Text Operations
Logical View
Database Manager
Indexing
Query Operations
User Feedback
Inverted file
Searching
Index
Query
Text Database
Ranked Docs
Retrieved Docs
Ranking
15
IR System Components

Text Operations forms index words (tokens).
Standardization (caps )
Stopword removal
Stemming
Indexing constructs an inverted index of word to
document pointers.
Searching retrieves documents that contain a
given query token from the inverted index.
Ranking scores all retrieved documents according
to a relevance metric.

16
IR System Components (continued)

User Interface manages interaction with the user
Query input and document output.
Relevance feedback.
Visualization of results.
Query Operations transform the query to improve
retrieval
Query expansion using a thesaurus.
Query transformation using relevance feedback.

17
IR and the Web

IR at the center of the stage
Advent of the Web changed this perception once
and for all
universal repository of knowledge
free (low cost) universal access
no central editorial board
many problems though IR seen as key to finding
the solutions!

18
IR and the Web

And more
Most of the human task employ the treatment of
information in textual and/ or graphic form
(Lyman, 2003)
How Much Information project (Berkeley)
www.sims.berkeley.edu/how-much-info-2003.
Each person generates 800 Mbytes / year.

19
IR and the Web

In 2002 5 Exabytes of new information
Magnetic media (HDs) 92
Films 7
Print material 0,01
Optical media 0,002.
5 Exabytes 5 million Terabytes
5.000.000.000.000.000.000 bytes
2 times the amount of 1999, given an increasing
rate of 30 / year.

20
IR and the Web

Information flow - radio, TV, Internet
18 Exabytes of new information in 2002
3,5 times of the amount stored
Telephone lines (and cell phones) 98
320 million hours of radio and TV transmissions,
with 70 million new hours, with 81 Gigabytes of
texts.

21
IR and the Web

Email
31 billion of e-mails / year 400.000 Tbytes of
new information
The Internet (Web)
170 Tbytes of information 17 times the printed
content of the US Library of Congress.

22
IR and the Web

Search sites
Yahoo, Google, etc. the 1st option of
access for the users
A typical Internet user 11 h 20 m / month
Access to the desired information 1 / 3 of the
period
The user is obliged to verify if the received
information is the desired one, and several times
is impossible to recover the information needed.

23
IR and the Web

Information Glut or Information Overload is the
main challenge to be surpassed by automatic text
treatment systems.

24
Web Search

Application of IR to HTML documents on the World
Wide Web.
Differences
Must assemble document corpus by spidering the
web.
Can exploit the structural layout information in
HTML (XML).
Documents change uncontrollably.
Can exploit the link structure of the web.

25
Web Search System
IR System
26
Other IR-Related Tasks

Automated document categorization
Information filtering (spam filtering)
Information routing
Automated document clustering
Recommending information or products
Information extraction
Information integration
Question answering

27
History of IR

1960-70s
Initial exploration of text retrieval systems
for small corpora of scientific abstracts, and
law and business documents.
Development of the basic Boolean and vector-space
models of retrieval.
Prof. Salton and his students at Cornell
University are the leading researchers in the
area.

28
IR History Continued

1980s
Large document database systems, many run by
companies
Lexis-Nexis
Dialog
MEDLINE

29
IR History Continued

1990s
Searching FTPable documents on the Internet
Archie
WAIS
Searching the World Wide Web
Lycos
Yahoo
Altavista

30
IR History Continued

1990s continued
Organized Competitions
NIST TREC
Recommender Systems
Ringo
Amazon
NetPerceptions
Automated Text Categorization Clustering

31
Recent IR History

2000s
Link analysis for Web Search
Google
Automated Information Extraction
Whizbang
Fetch
Burning Glass
Question Answering
TREC Q/A track

32
Recent IR History

2000s continued
Multimedia IR
Image
Video
Audio and music
Cross-Language IR
DARPA Tides
Document Summarization

33
Related Areas

Database Management
Library and Information Science
Artificial Intelligence
Natural Language Processing
Machine Learning

34
Database Management

Focused on structured data stored in relational
tables rather than free-form text.
Focused on efficient processing of well-defined
queries in a formal language (SQL).
Clearer semantics for both data and queries.
Recent move towards semi-structured data (XML)
brings it closer to IR.

35
Library and Information Science

Focused on the human user aspects of information
retrieval (human-computer interaction, user
interface, visualization).
Concerned with effective categorization of human
knowledge.
Concerned with citation analysis and
bibliometrics (structure of information).
Recent work on digital libraries brings it closer
to CS IR.

36
Artificial Intelligence

Focused on the representation of knowledge,
reasoning, and intelligent action.
Formalisms for representing knowledge and
queries
First-order Predicate Logic
Bayesian Networks
Others
Recent work on web ontologies and intelligent
information agents brings it closer to IR.

37
Natural Language Processing

Focused on the syntactic, semantic, and pragmatic
analysis of natural language text and discourse.
Ability to analyze syntax (phrase structure) and
semantics could allow retrieval based on meaning
rather than keywords.

38
Natural Language ProcessingIR Directions

Methods for determining the sense of an ambiguous
word based on context (word sense
disambiguation).
Methods for identifying specific pieces of
information in a document (information
extraction).
Methods for answering specific NL questions from
document corpora.

39
Machine Learning

Focused on the development of computational
systems that improve their performance with
experience.
Automated classification of examples based on
learning concepts from labeled training examples
(supervised learning).
Automated methods for clustering unlabeled
examples into meaningful groups (unsupervised
learning).

40
Machine LearningIR Directions