Title: Chapter 1 Introduction
1Chapter 1 Introduction
2Motivation
- Information retrieval
- To retrieve information which might be useful or
relevant to the user - Issue Representation, Storage, Organization,
Access - Information need (for reality) ????query????
- Find all the pages containing information on
college tennis teams which - (1) are maintained by an university in the USA
and - (2) participate in the NCAA tennis tournament.
- To be relevant, the page must include information
on the national ranking of the team in the last
three years and the email or phone number of the
team coach.
3????
4??????????
request
Web browser
Query server
response
user
Internet
Web robot
Search engine
5Information versus Data Retrieval
- Data retrieval
- Determine which documents of a collection contain
the keywords in the user query - Retrieve all objects which satisfy clearly
defined conditions in regular expression or
relational algebra expression - Data has a well defined structure and semantics
- Solution to the user of a database system
- Information retrieval
6Database Management
- A specified set of attributes is used to
characterize each item.EMPLOYEE(NAME, SSN,
BDATE, ADDR, SEX, SALARY, DNO) - Exact match between the attributes used inquery
formulations and those attached to the record.
SELECT BDATE, ADDR FROM EMPLOYEE WHERE NAME
John Smith
7Basic Concepts for IR
- Content identifiers (keywords, index terms,
descriptors) characterize the stored texts. - degrees of coincidence between the sets of
identifiers attached to queries and documents
Logical view of the documents
User task
query formulation
content analysis
8The User Task
- Convey the semantics of information need
- Retrieval and browsing
Retrieval
Database
Browsing
9Logical View of Documents
- Full text representation
- A set of index terms
- Elimination of stop-words
- The use of stemming
- The identification of noun groups
10From full text to a set of index terms
automatic or manual indexing
accents, spacing, etc.
noun groups
stemming
document
stopwords
text structure
text
structure recognition
index terms
Structure (eg. E-mail )
full text
11Indexing
- indexing assign identifiers to text items.
- assign manual vs. automatic indexing
- identifiers
- objective vs. nonobjective text identifiers
cataloging rules define, e.g., author names,
publisher names, dates of publications, - controlled vs. uncontrolled vocabulariesinstructi
on manuals(????), terminological
schedules(??????), - single-term vs. term phrase
12The retrieval process
Text
User Interface
user need
Text
Text Operations
logical view
logical view
DB Manager Module
Query Operations
Indexing
user feedback
query
Searching
Index
retrieved documents
Text Database
Ranking
ranked documents
13Information Retrieval
- generic information retrieval system select and
return to the user desired documents from a large
set of documents in accordance with criteria
specified by the user - functions
- document searchthe selection of documents from
an existing collection of documents - document routingthe dissemination of incoming
documents to appropriate users on the basis of
user interest profiles
14Detection Need
- Definitiona set of criteria specified by the
user which describes the kind of information
desired. - queries in document search task
- profiles in routing task
- forms
- keywords
- keywords with Boolean operators
- free text
- example documents
- ...
15search vs. routing
- The search process matches a single Detection
Need against the stored corpus to return a subset
of documents. - Routing matches a single document against a group
of Profiles to determine which users are
interested in the document. - Profiles stand long-term expressions of user
needs. - Search queries are ad hoc in nature.
- A generic detection architecture can be used for
both the search and routing.
16Search
- retrieval of desired documents from an existing
corpus - Retrospective search is frequently interactive.
- Methods
- indexing the corpus by keyword, stem and/or
phrase - apply statistical and/or learning techniques to
better understand the content of the corpus - analyze free text Detection Needs to compare with
the indexed corpus or a single document - ...
17Document Detection Search
18Document Detection Search(Continued)
- Document Corpus
- the content of the corpus may be significant for
the performance in some applications - Preprocessing of Document Corpus
- stemming
- a list of stop words
- phrases, multi-term items
- ...
19Document Detection Search(Continued)
- Building Index from Stems
- key place for optimizing run-time performance
- cost to build the index for a large corpus
- Document Index
- a list of terms, stems, phrases, etc.
- frequency of terms in the document and corpus
- frequency of the co-occurrence of terms within
the corpus - index may be as large as the original document
corpus
20Document Detection Search(Continued)
- Detection Need
- the users criteria for a relevant document
- Convert Detection Need to System Specific Query
- first transformed into a detection query, and
then a retrieval query. - detection query specific to the retrieval
engine, but independent of the corpus - retrieval query specific to the retrieval
engine, and to the corpus
21Document Detection Search(Continued)
- Compare Query with Index
- Resultant Rank Ordered List of Documents
- Return the top N documents
- Rank the list of relevant documents from the most
relevant to the query to the least relevant
22Routing
23Routing (Continued)
- Profile of Multiple Detection Needs
- A Profile is a group of individual Detection
Needs that describes a users areas of interest. - All Profiles will be compared to each incoming
document (via the Profile index). - If a document matches a Profile the user is
notified about the existence of a relevant
document.
24Routing (Continued)
- Convert Detection Need to System Specific Query
- Building Index from Queries
- similar to build the corpus index for searching
- the quantify of source data (Profiles) is usually
much less than a document corpus - Profiles may have more specific, structured data
in the form of SGML tagged fields
25Routing (Continued)
- Routing Profile Index
- The index will be system specific and will make
use of all the preprocessing techniques employed
by a particular detection system. - Document to be routed
- A stream of incoming documents is handled one at
a time to determine where each should be
directed. - Routing implementation may handle multiple
document streams and multiple Profiles.
26Routing (Continued)
- Preprocessing of Document
- A document is preprocessed in the same manner
that a query would be set-up in a search - The document and query roles are reversed
compared with the search process - Compare Document with Index
- Identify which Profiles are relevant to the
document - Given a document, which of the indexed profiles
match it?
27Routing (Continued)
- Resultant List of Profiles
- The list of Profiles identify which user should
receive the document
28Summary
- Generate a representation of the meaning or
content of each object based on its description. - Generate a representation of the meaning of the
information need. - Compare these two representations to select those
objects that are most likely to match the
information need.
29Basic Architecture of an Information Retrieval
System
Documents
Queries
Document Representation
Query Representation
Comparison
??????????????????
30Text retrieval system
user
Document collections
User query
indexing
Query formulation
Representation of a query
Representation of documents
Matching of similarity
Relevant results
Results of retrieval
31Research Issues
- Given a set of description for objects in the
collection and a description of an information
need, we must consider - Issue 1
- What makes a good document representation?
- What are retrievable units and how are they
organized? - How can a representation be generated from a
description of the document?
32Research Issues (Continued)
- Issue 2How can we represent the information need
and how can we acquire this representation either
from a description of the information need or
through interaction with the user? - Issue 3How can we compare representations to
judge likelihood that a document matches an
information need?
33Research Issues (Continued)
- Issue 4How can we evaluate the effectiveness of
the retrieval process?
34Text Data Mining Tasks
- Information extraction -- facts, fill database
- Summarization(????)
- Categorization (??)
- Clustering (??)
- Associations (???)
- Temporal analysis of document collection
35Information ExtractionBeyond Document Retrieval
- Question and Answering
- Q Who is the author of the book, "The Iron Lady
A Biography of Margaret Thatcher"?A Hugo Young - Q What was the monetary value of the Nobel Peace
Prize in 1989?A 469,000
36Information Extraction
- Generic Information Extraction SystemAn
information extraction system is a cascade of
transducers or modules that at each step add
structure and often lose information, hopefully
irrelevant, by applying rules that are acquired
manually and/or automatically.
37Information Extraction (Continued)
- What are the transducers or modules?
- What are their input and output?
- What structure is added?
- What information is lost?
- What is the form of the rules?
- How are the rules applied?
- How are the rules acquired?
38Example Parser
- transducer parser
- input the sequence of words or lexical items
- output a parse tree
- information added predicate-argument and
modification relations - information lost no
- rule form unification grammars
- application method chart parser
- acquisition method manually
39Modules
- Text Zonerturn a text into a set of text
segments - Preprocessorturn a text or text segment into a
sequence of sentences, each of which is a
sequence of lexical items, where a lexical item
is a word together with its lexical attributes - Filterturn a set of sentences into a smaller set
of sentences by filtering out the irrelevant ones - Preparsertake a sequence of lexical items and
try to identify various reliably determinable,
small-scale structures
40Modules (Continued)
- Parserinput a sequence of lexical items and
perhaps small-scale structures (phrases) and
output a set of parse tree fragments, possibly
complete - Fragment Combinerturn a set of parse tree or
logical form fragments into a parse tree or
logical form for the whole sentence - Semantic Interpretergenerate a semantic
structure or logical form from a parse tree or
from parse tree fragments
41Modules (Continued)
- Lexical Disambiguationturn a semantic structure
with general or ambiguous predicates into a
semantic structure with specific, unambiguous
predicates - Co-reference Resolution, or Discourse
Processingturn a tree-like structure into a
network-like structure by identifying different
descriptions of the same entity in different
parts of the text - Template Generatorderive the templates from the
semantic structures
42ltDOCgt ltDOCIDgt NTU-AIR_LAUNCH-????-19970612-002
lt/DOCIDgt ltDATASETgt Air Vehicle Launch
lt/DATASETgt ltDDgt 1997/06/12 lt/DDgt ltDOCTYPEgt ????
lt/DOCTYPEgt ltDOCSRCgt ???? lt/DOCSRCgt ltTEXTgt ????????
??????????????????? ????????????
???????,??????? ????????????,?????????????? ??????
?,?????????? ????????? ??????????? ????
?????????????????? ??????? ,?????????????????????
????? ??????????,????????????????
43???????????????,??????????? ???????,??????????
??????????????????????????? ??????????????????????
?????? ??????????????,????????????
??????????????????????????? ??????????,???????????
?????????????????,???????? ??????????????????
?????????? ????? ?????????????,????????????? ??
??????????????????????????, ??,???????????????????
44?????????????????????????? ????
?????????????????????????? ????????????
???????????????????????,? ??????????????????????
???? ?????? ??????,???????????? ?????????
lt/TEXTgt lt/DOCgt
45ltID"3"gt??? ltID"4" REF"3" gt?? ltID"5
REF"3"gt???????????? ???????
ltID"63" gt??????? ltID66 REF63gt?????????????
?????? ?????
ltID"65" REF"63"gt????????????? ltID"70"
REF"65"gt?? ltID"69" REF"65"gt?? ltID"64"
REF"63"gt?????????
46The Advanced Research and Development Activity
(ARDA)
- a joint activity of the Intelligence Community
(IC) and the Department of Defense (DOD) in late
November 1998 - intelligence community's (IC) center for
conducting advanced research and development
related extracting intelligence from and
providing security for information stored,
transmitted, or manipulated by electronic means
??
47(No Transcript)
48ARDA RD Programs
- Information Exploitation
- Pulling Information
- Pushing Information
- Visualizing and Navigating Information
- Quantum Information Science Photonics
- Digital Network Intelligence
49Pulling Information
- Providing answers to complex, multifaceted
questions that analysts pose - The analyst seeks to "pull" the answer out of
multiple, very large, heterogeneous data sources
that may physically reside in diverse locations
50Pulling Information (Continued)
- Accepting complex questions in a form natural to
the analyst. - Questions may include judgment terms and an
acceptable answer may need to be based upon
conclusions and decisions reached by the system
and may require the summarization, fusion, and
synthesis of information drawn from multiple
sources. - Translating analytic questions into multiple
queries appropriate to the various data sets to
be searched. - Finding relevant information in distributed,
multimedia, multilingual, multi-agency data sets.
- Analyzing, fusing and summarizing information
into a coherent answer. - Providing the answer to the analyst in the form
that he/she want
51Pushing Information
- Providing information from multiple, very large,
heterogeneous data sources that analysts do not
ask - The system discovers information in some
profiling, clustering, pattern recognition, data
mining, or other fashion and "pushes" this
information to analysts that the system
determines might have an interest.
52Pushing Information (Continued)
- Profiling and blind clustering of new data.
- Detecting anomalies, patterns and changes in
large volumes of data. - Analyzing the nature and description of the
anomalies, patterns, and changes. - Alerting the appropriate analyst(s) of the newly
discovered information.
53Topics
- Introduction to Information Retrieval and
Extraction - Modeling
- Retrieval Evaluation
- Query Languages
- Query Operations
- Text and Multimedia Languages and Properties
- Text Operations
- Indexing and Searching
54Topics (Continued)
- User Interfaces and Visualization
- Multimedia IR Models and Languages
- Multimedia IR Indexing and Searching
- Searching the Web
- Digital Libraries
- Information Extraction (Jerry R. Hobbs)
- Text Data Mining (Marti Hearst)
55Text IR
Applications for IR
Human-Computer Interaction for IR
Retrieval Models and Evaluation
Bibliographic Systems
Interfaces Visualization
Improvements On Retrieval
The Web
Multimedia IR
Multimedia Modeling Searching
Digital Libraries
Efficient Processing
56Information Sources
- Books
- Ricardo Baeza-Yates and Berthier Riberiro-Neto
(1999) Modern Information Retrieval,
Addison-Wesley.?????? ???? ?? (03)5720317 - Salton, G. (1989) Automatic Text Processing. The
Transformation, Analysis and Retrieval of
Information by Computer. Reading, MA
Addison-Wesley. - Frakes, W.B. and Baeza-Yates, R. (Eds.) (1992)
Information Retrieval Data Structures and
Algorithms. Englewood Cliffs, NJ Prentice Hall. - Cheong, F. (1996) Internet Agents Spiders,
Wanderers, Brokers, and Bots. Indianapolis, IN
New Riders, 1996. - Karen Sparck Jones and Peter Willett (1997)
Readings in Information Retrieval, CA Morgan
Kaufmann Publishers.
57Information Sources
- Conference Proceedings
- ACM SIGIR Annual International Conference on
Research and Development in Information Retrieval
(1978-) - ACM International Conference on Digital Libraries
- ACM Conference on Information Knowledge
Management - Text Retrieval Conference
58Information Sources(Continued)
- Journals
- ACM Transactions on Information Systems
- Information Processing and Management (formerly
Information Storage and Retrieval) - Journal of the American Society for Information
Science (formerly American Documentation) - Journal of Documentation
- Information Systems
- Information Retrieval
- Knowledge and Information Systems