Chapter 1 Introduction - PowerPoint PPT Presentation

1 / 58
About This Presentation
Title:

Chapter 1 Introduction

Description:

Chapter 1 Introduction – PowerPoint PPT presentation

Number of Views:114
Avg rating:3.0/5.0
Slides: 59
Provided by: Hsin7
Category:

less

Transcript and Presenter's Notes

Title: Chapter 1 Introduction


1
Chapter 1 Introduction
2
Motivation
  • Information retrieval
  • To retrieve information which might be useful or
    relevant to the user
  • Issue Representation, Storage, Organization,
    Access
  • Information need (for reality) ????query????
  • Find all the pages containing information on
    college tennis teams which
  • (1) are maintained by an university in the USA
    and
  • (2) participate in the NCAA tennis tournament.
  • To be relevant, the page must include information
    on the national ranking of the team in the last
    three years and the email or phone number of the
    team coach.

3
????
  • ??????

4
??????????
request
Web browser
Query server
response
user
Internet
Web robot
Search engine
5
Information versus Data Retrieval
  • Data retrieval
  • Determine which documents of a collection contain
    the keywords in the user query
  • Retrieve all objects which satisfy clearly
    defined conditions in regular expression or
    relational algebra expression
  • Data has a well defined structure and semantics
  • Solution to the user of a database system
  • Information retrieval

6
Database Management
  • A specified set of attributes is used to
    characterize each item.EMPLOYEE(NAME, SSN,
    BDATE, ADDR, SEX, SALARY, DNO)
  • Exact match between the attributes used inquery
    formulations and those attached to the record.
    SELECT BDATE, ADDR FROM EMPLOYEE WHERE NAME
    John Smith

7
Basic Concepts for IR
  • Content identifiers (keywords, index terms,
    descriptors) characterize the stored texts.
  • degrees of coincidence between the sets of
    identifiers attached to queries and documents

Logical view of the documents
User task
query formulation
content analysis
8
The User Task
  • Convey the semantics of information need
  • Retrieval and browsing

Retrieval
Database
Browsing
9
Logical View of Documents
  • Full text representation
  • A set of index terms
  • Elimination of stop-words
  • The use of stemming
  • The identification of noun groups

10
From full text to a set of index terms
automatic or manual indexing
accents, spacing, etc.
noun groups
stemming
document
stopwords
text structure
text
structure recognition
index terms
Structure (eg. E-mail )
full text
11
Indexing
  • indexing assign identifiers to text items.
  • assign manual vs. automatic indexing
  • identifiers
  • objective vs. nonobjective text identifiers
    cataloging rules define, e.g., author names,
    publisher names, dates of publications,
  • controlled vs. uncontrolled vocabulariesinstructi
    on manuals(????), terminological
    schedules(??????),
  • single-term vs. term phrase

12
The retrieval process
Text
User Interface
user need
Text
Text Operations
logical view
logical view
DB Manager Module
Query Operations
Indexing
user feedback
query
Searching
Index
retrieved documents
Text Database
Ranking
ranked documents
13
Information Retrieval
  • generic information retrieval system select and
    return to the user desired documents from a large
    set of documents in accordance with criteria
    specified by the user
  • functions
  • document searchthe selection of documents from
    an existing collection of documents
  • document routingthe dissemination of incoming
    documents to appropriate users on the basis of
    user interest profiles

14
Detection Need
  • Definitiona set of criteria specified by the
    user which describes the kind of information
    desired.
  • queries in document search task
  • profiles in routing task
  • forms
  • keywords
  • keywords with Boolean operators
  • free text
  • example documents
  • ...

15
search vs. routing
  • The search process matches a single Detection
    Need against the stored corpus to return a subset
    of documents.
  • Routing matches a single document against a group
    of Profiles to determine which users are
    interested in the document.
  • Profiles stand long-term expressions of user
    needs.
  • Search queries are ad hoc in nature.
  • A generic detection architecture can be used for
    both the search and routing.

16
Search
  • retrieval of desired documents from an existing
    corpus
  • Retrospective search is frequently interactive.
  • Methods
  • indexing the corpus by keyword, stem and/or
    phrase
  • apply statistical and/or learning techniques to
    better understand the content of the corpus
  • analyze free text Detection Needs to compare with
    the indexed corpus or a single document
  • ...

17
Document Detection Search
18
Document Detection Search(Continued)
  • Document Corpus
  • the content of the corpus may be significant for
    the performance in some applications
  • Preprocessing of Document Corpus
  • stemming
  • a list of stop words
  • phrases, multi-term items
  • ...

19
Document Detection Search(Continued)
  • Building Index from Stems
  • key place for optimizing run-time performance
  • cost to build the index for a large corpus
  • Document Index
  • a list of terms, stems, phrases, etc.
  • frequency of terms in the document and corpus
  • frequency of the co-occurrence of terms within
    the corpus
  • index may be as large as the original document
    corpus

20
Document Detection Search(Continued)
  • Detection Need
  • the users criteria for a relevant document
  • Convert Detection Need to System Specific Query
  • first transformed into a detection query, and
    then a retrieval query.
  • detection query specific to the retrieval
    engine, but independent of the corpus
  • retrieval query specific to the retrieval
    engine, and to the corpus

21
Document Detection Search(Continued)
  • Compare Query with Index
  • Resultant Rank Ordered List of Documents
  • Return the top N documents
  • Rank the list of relevant documents from the most
    relevant to the query to the least relevant

22
Routing
23
Routing (Continued)
  • Profile of Multiple Detection Needs
  • A Profile is a group of individual Detection
    Needs that describes a users areas of interest.
  • All Profiles will be compared to each incoming
    document (via the Profile index).
  • If a document matches a Profile the user is
    notified about the existence of a relevant
    document.

24
Routing (Continued)
  • Convert Detection Need to System Specific Query
  • Building Index from Queries
  • similar to build the corpus index for searching
  • the quantify of source data (Profiles) is usually
    much less than a document corpus
  • Profiles may have more specific, structured data
    in the form of SGML tagged fields

25
Routing (Continued)
  • Routing Profile Index
  • The index will be system specific and will make
    use of all the preprocessing techniques employed
    by a particular detection system.
  • Document to be routed
  • A stream of incoming documents is handled one at
    a time to determine where each should be
    directed.
  • Routing implementation may handle multiple
    document streams and multiple Profiles.

26
Routing (Continued)
  • Preprocessing of Document
  • A document is preprocessed in the same manner
    that a query would be set-up in a search
  • The document and query roles are reversed
    compared with the search process
  • Compare Document with Index
  • Identify which Profiles are relevant to the
    document
  • Given a document, which of the indexed profiles
    match it?

27
Routing (Continued)
  • Resultant List of Profiles
  • The list of Profiles identify which user should
    receive the document

28
Summary
  • Generate a representation of the meaning or
    content of each object based on its description.
  • Generate a representation of the meaning of the
    information need.
  • Compare these two representations to select those
    objects that are most likely to match the
    information need.

29
Basic Architecture of an Information Retrieval
System
Documents
Queries
Document Representation
Query Representation
Comparison
??????????????????
30
Text retrieval system
user
Document collections
User query
indexing
Query formulation
Representation of a query
Representation of documents
Matching of similarity
Relevant results
Results of retrieval
31
Research Issues
  • Given a set of description for objects in the
    collection and a description of an information
    need, we must consider
  • Issue 1
  • What makes a good document representation?
  • What are retrievable units and how are they
    organized?
  • How can a representation be generated from a
    description of the document?

32
Research Issues (Continued)
  • Issue 2How can we represent the information need
    and how can we acquire this representation either
    from a description of the information need or
    through interaction with the user?
  • Issue 3How can we compare representations to
    judge likelihood that a document matches an
    information need?

33
Research Issues (Continued)
  • Issue 4How can we evaluate the effectiveness of
    the retrieval process?

34
Text Data Mining Tasks
  • Information extraction -- facts, fill database
  • Summarization(????)
  • Categorization (??)
  • Clustering (??)
  • Associations (???)
  • Temporal analysis of document collection

35
Information ExtractionBeyond Document Retrieval
  • Question and Answering
  • Q Who is the author of the book, "The Iron Lady
    A Biography of Margaret Thatcher"?A Hugo Young
  • Q What was the monetary value of the Nobel Peace
    Prize in 1989?A 469,000

36
Information Extraction
  • Generic Information Extraction SystemAn
    information extraction system is a cascade of
    transducers or modules that at each step add
    structure and often lose information, hopefully
    irrelevant, by applying rules that are acquired
    manually and/or automatically.

37
Information Extraction (Continued)
  • What are the transducers or modules?
  • What are their input and output?
  • What structure is added?
  • What information is lost?
  • What is the form of the rules?
  • How are the rules applied?
  • How are the rules acquired?

38
Example Parser
  • transducer parser
  • input the sequence of words or lexical items
  • output a parse tree
  • information added predicate-argument and
    modification relations
  • information lost no
  • rule form unification grammars
  • application method chart parser
  • acquisition method manually

39
Modules
  • Text Zonerturn a text into a set of text
    segments
  • Preprocessorturn a text or text segment into a
    sequence of sentences, each of which is a
    sequence of lexical items, where a lexical item
    is a word together with its lexical attributes
  • Filterturn a set of sentences into a smaller set
    of sentences by filtering out the irrelevant ones
  • Preparsertake a sequence of lexical items and
    try to identify various reliably determinable,
    small-scale structures

40
Modules (Continued)
  • Parserinput a sequence of lexical items and
    perhaps small-scale structures (phrases) and
    output a set of parse tree fragments, possibly
    complete
  • Fragment Combinerturn a set of parse tree or
    logical form fragments into a parse tree or
    logical form for the whole sentence
  • Semantic Interpretergenerate a semantic
    structure or logical form from a parse tree or
    from parse tree fragments

41
Modules (Continued)
  • Lexical Disambiguationturn a semantic structure
    with general or ambiguous predicates into a
    semantic structure with specific, unambiguous
    predicates
  • Co-reference Resolution, or Discourse
    Processingturn a tree-like structure into a
    network-like structure by identifying different
    descriptions of the same entity in different
    parts of the text
  • Template Generatorderive the templates from the
    semantic structures

42
ltDOCgt ltDOCIDgt NTU-AIR_LAUNCH-????-19970612-002
lt/DOCIDgt ltDATASETgt Air Vehicle Launch
lt/DATASETgt ltDDgt 1997/06/12 lt/DDgt ltDOCTYPEgt ????
lt/DOCTYPEgt ltDOCSRCgt ???? lt/DOCSRCgt ltTEXTgt ????????
??????????????????? ????????????
???????,??????? ????????????,?????????????? ??????
?,?????????? ????????? ??????????? ????
?????????????????? ??????? ,?????????????????????
????? ??????????,????????????????
43
???????????????,??????????? ???????,??????????
??????????????????????????? ??????????????????????
?????? ??????????????,????????????
??????????????????????????? ??????????,???????????
?????????????????,???????? ??????????????????
?????????? ????? ?????????????,????????????? ??
??????????????????????????, ??,???????????????????

44
?????????????????????????? ????
?????????????????????????? ????????????
???????????????????????,? ??????????????????????
???? ?????? ??????,???????????? ?????????
lt/TEXTgt lt/DOCgt
45
ltID"3"gt??? ltID"4" REF"3" gt?? ltID"5
REF"3"gt???????????? ???????
ltID"63" gt??????? ltID66 REF63gt?????????????
?????? ?????
ltID"65" REF"63"gt????????????? ltID"70"
REF"65"gt?? ltID"69" REF"65"gt?? ltID"64"
REF"63"gt?????????
46
The Advanced Research and Development Activity
(ARDA)
  • a joint activity of the Intelligence Community
    (IC) and the Department of Defense (DOD) in late
    November 1998
  • intelligence community's (IC) center for
    conducting advanced research and development
    related extracting intelligence from and
    providing security for information stored,
    transmitted, or manipulated by electronic means

??
47
(No Transcript)
48
ARDA RD Programs
  • Information Exploitation
  • Pulling Information
  • Pushing Information
  • Visualizing and Navigating Information
  • Quantum Information Science Photonics
  • Digital Network Intelligence

49
Pulling Information
  • Providing answers to complex, multifaceted
    questions that analysts pose
  • The analyst seeks to "pull" the answer out of
    multiple, very large, heterogeneous data sources
    that may physically reside in diverse locations

50
Pulling Information (Continued)
  • Accepting complex questions in a form natural to
    the analyst.
  • Questions may include judgment terms and an
    acceptable answer may need to be based upon
    conclusions and decisions reached by the system
    and may require the summarization, fusion, and
    synthesis of information drawn from multiple
    sources.
  • Translating analytic questions into multiple
    queries appropriate to the various data sets to
    be searched.
  • Finding relevant information in distributed,
    multimedia, multilingual, multi-agency data sets.
  • Analyzing, fusing and summarizing information
    into a coherent answer.
  • Providing the answer to the analyst in the form
    that he/she want

51
Pushing Information
  • Providing information from multiple, very large,
    heterogeneous data sources that analysts do not
    ask
  • The system discovers information in some
    profiling, clustering, pattern recognition, data
    mining, or other fashion and "pushes" this
    information to analysts that the system
    determines might have an interest.

52
Pushing Information (Continued)
  • Profiling and blind clustering of new data.
  • Detecting anomalies, patterns and changes in
    large volumes of data.
  • Analyzing the nature and description of the
    anomalies, patterns, and changes.
  • Alerting the appropriate analyst(s) of the newly
    discovered information.

53
Topics
  • Introduction to Information Retrieval and
    Extraction
  • Modeling
  • Retrieval Evaluation
  • Query Languages
  • Query Operations
  • Text and Multimedia Languages and Properties
  • Text Operations
  • Indexing and Searching

54
Topics (Continued)
  • User Interfaces and Visualization
  • Multimedia IR Models and Languages
  • Multimedia IR Indexing and Searching
  • Searching the Web
  • Digital Libraries
  • Information Extraction (Jerry R. Hobbs)
  • Text Data Mining (Marti Hearst)

55
Text IR
Applications for IR
Human-Computer Interaction for IR
Retrieval Models and Evaluation
Bibliographic Systems
Interfaces Visualization
Improvements On Retrieval
The Web
Multimedia IR
Multimedia Modeling Searching
Digital Libraries
Efficient Processing
56
Information Sources
  • Books
  • Ricardo Baeza-Yates and Berthier Riberiro-Neto
    (1999) Modern Information Retrieval,
    Addison-Wesley.?????? ???? ?? (03)5720317
  • Salton, G. (1989) Automatic Text Processing. The
    Transformation, Analysis and Retrieval of
    Information by Computer. Reading, MA
    Addison-Wesley.
  • Frakes, W.B. and Baeza-Yates, R. (Eds.) (1992)
    Information Retrieval Data Structures and
    Algorithms. Englewood Cliffs, NJ Prentice Hall.
  • Cheong, F. (1996) Internet Agents Spiders,
    Wanderers, Brokers, and Bots. Indianapolis, IN
    New Riders, 1996.
  • Karen Sparck Jones and Peter Willett (1997)
    Readings in Information Retrieval, CA Morgan
    Kaufmann Publishers.

57
Information Sources
  • Conference Proceedings
  • ACM SIGIR Annual International Conference on
    Research and Development in Information Retrieval
    (1978-)
  • ACM International Conference on Digital Libraries
  • ACM Conference on Information Knowledge
    Management
  • Text Retrieval Conference

58
Information Sources(Continued)
  • Journals
  • ACM Transactions on Information Systems
  • Information Processing and Management (formerly
    Information Storage and Retrieval)
  • Journal of the American Society for Information
    Science (formerly American Documentation)
  • Journal of Documentation
  • Information Systems
  • Information Retrieval
  • Knowledge and Information Systems
Write a Comment
User Comments (0)
About PowerShow.com