Chapter 1 Introduction

About This Presentation

Title:

Chapter 1 Introduction

Description:

Chapter 1 Introduction – PowerPoint PPT presentation

Number of Views:117

Avg rating:3.0/5.0

Slides: 59

Provided by: Hsin7

Category:

more less

Transcript and Presenter's Notes

Title: Chapter 1 Introduction

1
Chapter 1 Introduction
2
Motivation

Information retrieval
To retrieve information which might be useful or
relevant to the user
Issue Representation, Storage, Organization,
Access
Information need (for reality) ????query????
Find all the pages containing information on
college tennis teams which
(1) are maintained by an university in the USA
and
(2) participate in the NCAA tennis tournament.
To be relevant, the page must include information
on the national ranking of the team in the last
three years and the email or phone number of the
team coach.

3
????

??????

4
??????????
request
Web browser
Query server
response
user
Internet
Web robot
Search engine
5
Information versus Data Retrieval

Data retrieval
Determine which documents of a collection contain
the keywords in the user query
Retrieve all objects which satisfy clearly
defined conditions in regular expression or
relational algebra expression
Data has a well defined structure and semantics
Solution to the user of a database system
Information retrieval

6
Database Management

A specified set of attributes is used to
characterize each item.EMPLOYEE(NAME, SSN,
BDATE, ADDR, SEX, SALARY, DNO)
Exact match between the attributes used inquery
formulations and those attached to the record.
SELECT BDATE, ADDR FROM EMPLOYEE WHERE NAME
John Smith

7
Basic Concepts for IR

Content identifiers (keywords, index terms,
descriptors) characterize the stored texts.
degrees of coincidence between the sets of
identifiers attached to queries and documents

Logical view of the documents
User task
query formulation
content analysis
8
The User Task

Convey the semantics of information need
Retrieval and browsing

Retrieval
Database
Browsing
9
Logical View of Documents

Full text representation
A set of index terms
Elimination of stop-words
The use of stemming
The identification of noun groups

10
From full text to a set of index terms
automatic or manual indexing
accents, spacing, etc.
noun groups
stemming
document
stopwords
text structure
text
structure recognition
index terms
Structure (eg. E-mail )
full text
11
Indexing

indexing assign identifiers to text items.
assign manual vs. automatic indexing
identifiers
objective vs. nonobjective text identifiers
cataloging rules define, e.g., author names,
publisher names, dates of publications,
controlled vs. uncontrolled vocabulariesinstructi
on manuals(????), terminological
schedules(??????),
single-term vs. term phrase

12
The retrieval process
Text
User Interface
user need
Text
Text Operations
logical view
logical view
DB Manager Module
Query Operations
Indexing
user feedback
query
Searching
Index
retrieved documents
Text Database
Ranking
ranked documents
13
Information Retrieval

generic information retrieval system select and
return to the user desired documents from a large
set of documents in accordance with criteria
specified by the user
functions
document searchthe selection of documents from
an existing collection of documents
document routingthe dissemination of incoming
documents to appropriate users on the basis of
user interest profiles

14
Detection Need

Definitiona set of criteria specified by the
user which describes the kind of information
desired.
queries in document search task
profiles in routing task
forms
keywords
keywords with Boolean operators
free text
example documents
...

15
search vs. routing

The search process matches a single Detection
Need against the stored corpus to return a subset
of documents.
Routing matches a single document against a group
of Profiles to determine which users are
interested in the document.
Profiles stand long-term expressions of user
needs.
Search queries are ad hoc in nature.
A generic detection architecture can be used for
both the search and routing.

16
Search

retrieval of desired documents from an existing
corpus
Retrospective search is frequently interactive.
Methods
indexing the corpus by keyword, stem and/or
phrase
apply statistical and/or learning techniques to
better understand the content of the corpus
analyze free text Detection Needs to compare with
the indexed corpus or a single document
...

17
Document Detection Search
18
Document Detection Search(Continued)

Document Corpus
the content of the corpus may be significant for
the performance in some applications
Preprocessing of Document Corpus
stemming
a list of stop words
phrases, multi-term items
...

19
Document Detection Search(Continued)

Building Index from Stems
key place for optimizing run-time performance
cost to build the index for a large corpus
Document Index
a list of terms, stems, phrases, etc.
frequency of terms in the document and corpus
frequency of the co-occurrence of terms within
the corpus
index may be as large as the original document
corpus

20
Document Detection Search(Continued)

Detection Need
the users criteria for a relevant document
Convert Detection Need to System Specific Query
first transformed into a detection query, and
then a retrieval query.
detection query specific to the retrieval
engine, but independent of the corpus
retrieval query specific to the retrieval
engine, and to the corpus

21
Document Detection Search(Continued)

Compare Query with Index
Resultant Rank Ordered List of Documents
Return the top N documents
Rank the list of relevant documents from the most
relevant to the query to the least relevant

22
Routing
23
Routing (Continued)

Profile of Multiple Detection Needs
A Profile is a group of individual Detection
Needs that describes a users areas of interest.
All Profiles will be compared to each incoming
document (via the Profile index).
If a document matches a Profile the user is
notified about the existence of a relevant
document.

24
Routing (Continued)

Convert Detection Need to System Specific Query
Building Index from Queries
similar to build the corpus index for searching
the quantify of source data (Profiles) is usually
much less than a document corpus
Profiles may have more specific, structured data
in the form of SGML tagged fields

25
Routing (Continued)

Routing Profile Index
The index will be system specific and will make
use of all the preprocessing techniques employed
by a particular detection system.
Document to be routed
A stream of incoming documents is handled one at
a time to determine where each should be
directed.
Routing implementation may handle multiple
document streams and multiple Profiles.

26
Routing (Continued)

Preprocessing of Document
A document is preprocessed in the same manner
that a query would be set-up in a search
The document and query roles are reversed
compared with the search process
Compare Document with Index
Identify which Profiles are relevant to the
document
Given a document, which of the indexed profiles
match it?

27
Routing (Continued)

Resultant List of Profiles
The list of Profiles identify which user should
receive the document

28
Summary

Generate a representation of the meaning or
content of each object based on its description.
Generate a representation of the meaning of the
information need.
Compare these two representations to select those
objects that are most likely to match the
information need.

29
Basic Architecture of an Information Retrieval
System
Documents
Queries
Document Representation
Query Representation
Comparison
??????????????????
30
Text retrieval system
user
Document collections
User query
indexing
Query formulation
Representation of a query
Representation of documents
Matching of similarity
Relevant results
Results of retrieval
31
Research Issues

Given a set of description for objects in the
collection and a description of an information
need, we must consider
Issue 1
What makes a good document representation?
What are retrievable units and how are they
organized?
How can a representation be generated from a
description of the document?

32
Research Issues (Continued)

Issue 2How can we represent the information need
and how can we acquire this representation either
from a description of the information need or
through interaction with the user?
Issue 3How can we compare representations to
judge likelihood that a document matches an
information need?

33
Research Issues (Continued)

Issue 4How can we evaluate the effectiveness of
the retrieval process?

34
Text Data Mining Tasks

Information extraction -- facts, fill database
Summarization(????)
Categorization (??)
Clustering (??)
Associations (???)
Temporal analysis of document collection

35
Information ExtractionBeyond Document Retrieval

Question and Answering
Q Who is the author of the book, "The Iron Lady
A Biography of Margaret Thatcher"?A Hugo Young
Q What was the monetary value of the Nobel Peace
Prize in 1989?A 469,000

36
Information Extraction

Generic Information Extraction SystemAn
information extraction system is a cascade of
transducers or modules that at each step add
structure and often lose information, hopefully
irrelevant, by applying rules that are acquired
manually and/or automatically.

37
Information Extraction (Continued)

What are the transducers or modules?
What are their input and output?
What structure is added?
What information is lost?
What is the form of the rules?
How are the rules applied?
How are the rules acquired?

38
Example Parser

transducer parser
input the sequence of words or lexical items
output a parse tree
information added predicate-argument and
modification relations
information lost no
rule form unification grammars
application method chart parser
acquisition method manually

39
Modules

Text Zonerturn a text into a set of text
segments
Preprocessorturn a text or text segment into a
sequence of sentences, each of which is a
sequence of lexical items, where a lexical item
is a word together with its lexical attributes
Filterturn a set of sentences into a smaller set
of sentences by filtering out the irrelevant ones
Preparsertake a sequence of lexical items and
try to identify various reliably determinable,
small-scale structures

40
Modules (Continued)

Parserinput a sequence of lexical items and
perhaps small-scale structures (phrases) and
output a set of parse tree fragments, possibly
complete
Fragment Combinerturn a set of parse tree or
logical form fragments into a parse tree or
logical form for the whole sentence
Semantic Interpretergenerate a semantic
structure or logical form from a parse tree or
from parse tree fragments

41
Modules (Continued)

Lexical Disambiguationturn a semantic structure
with general or ambiguous predicates into a
semantic structure with specific, unambiguous
predicates
Co-reference Resolution, or Discourse
Processingturn a tree-like structure into a
network-like structure by identifying different
descriptions of the same entity in different
parts of the text
Template Generatorderive the templates from the
semantic structures

42
ltDOCgt ltDOCIDgt NTU-AIR_LAUNCH-????-19970612-002
lt/DOCIDgt ltDATASETgt Air Vehicle Launch
lt/DATASETgt ltDDgt 1997/06/12 lt/DDgt ltDOCTYPEgt ????
lt/DOCTYPEgt ltDOCSRCgt ???? lt/DOCSRCgt ltTEXTgt ????????
??????????????????? ????????????
???????,??????? ????????????,?????????????? ??????
?,?????????? ????????? ??????????? ????
?????????????????? ??????? ,?????????????????????
????? ??????????,????????????????
43
???????????????,??????????? ???????,??????????
??????????????????????????? ??????????????????????
?????? ??????????????,????????????
??????????????????????????? ??????????,???????????
?????????????????,???????? ??????????????????
?????????? ????? ?????????????,????????????? ??
??????????????????????????, ??,???????????????????

44
?????????????????????????? ????
?????????????????????????? ????????????
???????????????????????,? ??????????????????????
???? ?????? ??????,???????????? ?????????
lt/TEXTgt lt/DOCgt
45
ltID"3"gt??? ltID"4" REF"3" gt?? ltID"5
REF"3"gt???????????? ???????
ltID"63" gt??????? ltID66 REF63gt?????????????
?????? ?????
ltID"65" REF"63"gt????????????? ltID"70"
REF"65"gt?? ltID"69" REF"65"gt?? ltID"64"
REF"63"gt?????????
46
The Advanced Research and Development Activity
(ARDA)

a joint activity of the Intelligence Community
(IC) and the Department of Defense (DOD) in late
November 1998
intelligence community's (IC) center for
conducting advanced research and development
related extracting intelligence from and
providing security for information stored,
transmitted, or manipulated by electronic means

??
47
(No Transcript)
48
ARDA RD Programs

Information Exploitation
Pulling Information
Pushing Information
Visualizing and Navigating Information
Quantum Information Science Photonics
Digital Network Intelligence

49
Pulling Information

Providing answers to complex, multifaceted
questions that analysts pose
The analyst seeks to "pull" the answer out of
multiple, very large, heterogeneous data sources
that may physically reside in diverse locations

50
Pulling Information (Continued)

Accepting complex questions in a form natural to
the analyst.
Questions may include judgment terms and an
acceptable answer may need to be based upon
conclusions and decisions reached by the system
and may require the summarization, fusion, and
synthesis of information drawn from multiple
sources.
Translating analytic questions into multiple
queries appropriate to the various data sets to
be searched.
Finding relevant information in distributed,
multimedia, multilingual, multi-agency data sets.
Analyzing, fusing and summarizing information
into a coherent answer.
Providing the answer to the analyst in the form
that he/she want

51
Pushing Information

Providing information from multiple, very large,
heterogeneous data sources that analysts do not
ask
The system discovers information in some
profiling, clustering, pattern recognition, data
mining, or other fashion and "pushes" this
information to analysts that the system
determines might have an interest.

52
Pushing Information (Continued)

Profiling and blind clustering of new data.
Detecting anomalies, patterns and changes in
large volumes of data.
Analyzing the nature and description of the
anomalies, patterns, and changes.
Alerting the appropriate analyst(s) of the newly
discovered information.

53
Topics

Introduction to Information Retrieval and
Extraction
Modeling
Retrieval Evaluation
Query Languages
Query Operations
Text and Multimedia Languages and Properties
Text Operations
Indexing and Searching

54
Topics (Continued)

User Interfaces and Visualization
Multimedia IR Models and Languages
Multimedia IR Indexing and Searching
Searching the Web
Digital Libraries
Information Extraction (Jerry R. Hobbs)
Text Data Mining (Marti Hearst)

55
Text IR
Applications for IR
Human-Computer Interaction for IR
Retrieval Models and Evaluation
Bibliographic Systems
Interfaces Visualization
Improvements On Retrieval
The Web
Multimedia IR
Multimedia Modeling Searching
Digital Libraries
Efficient Processing
56
Information Sources

Books
Ricardo Baeza-Yates and Berthier Riberiro-Neto
(1999) Modern Information Retrieval,
Addison-Wesley.?????? ???? ?? (03)5720317
Salton, G. (1989) Automatic Text Processing. The
Transformation, Analysis and Retrieval of
Information by Computer. Reading, MA
Addison-Wesley.
Frakes, W.B. and Baeza-Yates, R. (Eds.) (1992)
Information Retrieval Data Structures and
Algorithms. Englewood Cliffs, NJ Prentice Hall.
Cheong, F. (1996) Internet Agents Spiders,
Wanderers, Brokers, and Bots. Indianapolis, IN
New Riders, 1996.
Karen Sparck Jones and Peter Willett (1997)
Readings in Information Retrieval, CA Morgan
Kaufmann Publishers.

57
Information Sources

Conference Proceedings
ACM SIGIR Annual International Conference on
Research and Development in Information Retrieval
(1978-)
ACM International Conference on Digital Libraries
ACM Conference on Information Knowledge
Management
Text Retrieval Conference

58
Information Sources(Continued)

Journals
ACM Transactions on Information Systems
Information Processing and Management (formerly
Information Storage and Retrieval)
Journal of the American Society for Information
Science (formerly American Documentation)
Journal of Documentation
Information Systems
Information Retrieval
Knowledge and Information Systems

Write a Comment

User Comments (0)

About PowerShow.com

Chapter 1 Introduction - PowerPoint PPT Presentation

Chapter 1 Introduction

Chapter 1 Introduction – PowerPoint PPT presentation