Information Extraction: Beyond Document Retrieval - PowerPoint PPT Presentation

1 / 34
About This Presentation
Title:

Information Extraction: Beyond Document Retrieval

Description:

enamex type='organization' Bridgestone Sports Co. /enamex said timex type ... a Japanese trading house to produce golf clubs to be shipped to pnamex Japan ... – PowerPoint PPT presentation

Number of Views:76
Avg rating:3.0/5.0
Slides: 35
Provided by: hsinhs
Category:

less

Transcript and Presenter's Notes

Title: Information Extraction: Beyond Document Retrieval


1
Information Extraction Beyond Document Retrieval
  • Robert Gaizauskas and Yorick Wilks
  • Computational Linguistics and Chinese Language
    Processing
  • vol. 3, no. 2, 1998, pp. 17-60
  • Journal of Documentation, Vol 54, No. 1, 1998,
    pp. 70-105.

2
IE and IR
  • IE
  • extracting pre-specified sorts of information
    from short, natural language texts
  • example
  • business newswire texts for retirements,
    appointments, promotions,
  • extract the names of the participating companies
    and individuals, the post involved, the vacancy
    reason, and so on

3
IE and IR (Continued)
  • Populating a structured information source (or
    database) from an unstructured, or free text,
    information source
  • the structured database is used
  • for searching or analysis using conventional
    database queries or data-mining techniques
  • for generating a summary
  • for constructing indices into the source texts
  • ...

4
IE and IR (Continued)
  • IR
  • Given a user query selects a relevant subset of
    documents from a larger set.
  • The user then browses the selected documents in
    order to fulfil his or her information need.
  • Differences
  • IR retrieves relevant documents from collections
  • IE extracts relevant information from documents

5
In combination of IR and IE
(a) an IR query chief executive officer had
president chairman post succeed name (b) a
retrieved text ltDOCgt ltDOCNOgt 940413-0062.
lt/DOCNOgt ltHLgt Whos News _at_ Burns Fry Ltd.
lt/HLgt ltDDgt 04/13/94 lt/DDgt ltSOgt WALL STREET
JOURNAL (J), PAGE B10 lt/SOgt ltTXTgt ltpgt BURNS
FRY Ltd. (Toronto) -- Donald Wright, 46
years old, was named executive vice president
and director of fixed income at this brokerage
firm. Mr. Wright resigned as president Merrill
Lynch Canada Inc., a unit of Merrill Lynch Co.,
to succeed Mark Kassirer, 48, who left Burns
Fry last month. A Merrill Lynch spokerswoman
said it has named a successor Mr. Wright, who is
expected to begin his new position by the end of
month. lt/pgt lt/TCTgt lt/DOCgt
6
(c) an empty template ltTEMPLATEgt
DOC_NR CONTENT ltSUCCESSION_EVENTgt
SUCCESSION_ORG POST IN_AND_OUT VACAN
CY_REASON ltIN_AND_OUTgt IO_REASON NEW_STA
TUS ON_THE_JOB OTHER_ORG REL_OTHER_ORG
ltORGANIZATIONgt ORG_NAME ORG_ALIAS ORG_D
ESCRIPTOR
7
ORG_TYPE ORG_LOCALE ORG_COUNTRY ltPERSONgt
PER_NAME PER_ALIAS PER_TITLE (d) a
fragment of the filled template ltTEMPLATE-94041300
62-1gt DOC_NR 940413062 CONTENT
ltSUCCESSION_EVENT- 9404130062-1gt ltSUCCESSION_EVENT
- 9404130062-1gt SUCCESSION_ORGltORGANIZATION-
9404130062-1gt POST executive vice
president IN_AND_OUT ltIN_AND_OUT-
9404130062-1gt ltIN_AND_OUT-
9404130062-2gt VACANCY_REASON OTH_UNK
8
ltIN_AND_OUT- 9404130062-2gt IO_PERSON ltPERSON-
9404130062-1gt NEW_STATUS IN ON_THE_JOB
NO OTHER_ORG ltORGANIZATION- 9404130062-2gt REL_O
THER_ORG OUTSIDE_ORG ltORGANIZATION-
9404130062-1gt ORG_NAME Burns Fry
Ltd. ORG_ALIAS Burns Fry ORG_DESCRIPTOR
this brokerage firm ORG_TYPE
COMPANY ORGLOCALE Toronto CITY ORG_COUNTRY
Canada ltORGANIZATION- 9404130062-2gt ORG_NAME
Merrill Lynch ORG_ALIAS Merrill
Lynch ORG_DESCRIPTOR a unit of Merril Lynch
Co. ORG_TYPE COMPANY
9
ltPERSON- 9404130062-1gt PER_NAME Donald
Wright PER_ALIAS Wright PER_TITLE
Mr. ltPERSON- 9404130062-2gt PER_NAME Mark
Kassirer a summary generated from the filled
template BURNS FRY Ltd. Named Donald Wright as
executive vice president. Donald Wirght resigned
as president of Merrill Lynch Canada Inc. Mark
Kassirer left as president of BURNS FRY Ltd.
(e)
10
History of Information Extraction
  • Early work on template filling
  • work carried out or under way before the DARPA
    programme
  • work carries out in response to the DARPA MUC
    programme
  • recent work on IE outside the DARPA programme

11
Early Work on Template Filling
  • The Linguistic String Project at New York
    University
  • Derive information formats (regularised
    table-like forms) from the profusion of natural
    language forms
  • Permit fact retrieval (as opposed to document
    retrieval) on such a database

12
Early Work on Template Filling (Continued)
  • the information formats are not predefined a
    priori by experts in the field
  • the information formats are induced by using
    distributional analysis to discover word classes
    in a set of texts of a sub-language

13
Early Work on Template Filling (Continued)
  • Language understanding research at Yale
    University by Roger Schank
  • stories followed certain stereotypical patterns
    called scripts
  • knowing the script, language comprehenders are
    able to fill in details and make inferential
    leaps where the information required to make the
    leap is not present in the text
  • first attempt using this approach FRUMP (Gerald
    De Jong)

14
Message Understanding Conferences (Continued)
  • MUC-1 (May 1987, San Diego)
  • six systems participated
  • tactical naval operations reports on ship
    sightings and engagements
  • 12 training reports, 2 unseen messages
  • MUC-2 (May 1989, San Diego)
  • eight systems participated
  • the same domain as MUC-1
  • 105 training messages, 20 blind messages (1st
    run), 5 blind messages (2nd run)
  • a template and fill rules for the slots

15
Message Understanding Conferences (Continued)
  • MUC-3 (May 1991, San Diego)
  • fifteen systems participated
  • newswire stories about terrorist attacks in nine
    Latin American countries
  • 1,300 development texts, three blind test sets
    of 100 texts
  • a template consisting of 18 slots
  • formal evaluation criteria (precision recall)
  • semi-automated scoring program available

16
Message Understanding Conferences (Continued)
  • MUC-4 (June 1992 McLean, Virginia)
  • seventeen sites participated
  • domain and template structures unchanged
  • changes to the task definitions, corpus, measures
    of performance, and test protocols

17
Message Understanding Conferences (Continued)
  • MUC-5 (August 1993 Baltimore, Maryland)
  • 17 systems participated (14 American, 1 British,
    1 Canadian, 1 Japanese)
  • financial newswire stories and microelectronics
    products announcements
  • English and Japanese
  • development and test corpora increased
  • new evaluation metrics and scoring programs

18
Message Understanding Conferences (Continued)
  • MUC-6 (Nov 1995 Columbus, Maryland)
  • 17 sites took part
  • named entity recognition, coreference
    identification, template and scenario template
    extraction tasks
  • management succession events in financial news
    stories

19
Task complexity measures
  • text corpus complexity (vocabulary size, average
    sentence length)
  • text corpus dimensions (volume of texts, total
    number of sentences/words)
  • template characteristics (number of object types,
    number of slots)
  • difficulty of tasks (hard to measure, but
    considered number of pages of relevance rules and
    template fill definitions)

20
Evaluation Metrics
  • Recall
  • a measure of the fraction of the required
    information that has been correctly extracted
  • Precision
  • a measure of the fraction of the extracted
    information that is correct
  • Beyond Precision and Recall
  • correct, partially correct, incorrect, missing,
    spurious, non-committal
  • overgeneration
  • fraction of extracted information that is
    spurious
  • undergeneration
  • fraction of information to have been extracted is
    missing
  • substitution
  • fraction of the nonspurious extracted information
    is not correct

21
MUC-5
  • Tasks
  • two domains joint ventures and microelectronics
  • two languages Japanese and English
  • acronyms EJV, JJV, EME, JME
  • Resources
  • EJV materials Wall Street Journal, Lexus/Nexus,
    Prompt
  • gazetteer of place names, list of corporate names
    and nationalities, list of corporate designators,
    list of countries, list of nationalities, list of
    international organizations, definitions of
    standard industry codes, list of currency
    names/nationalities, list of female forenames,
    list of male forenames, CIA world fact book.

22
MUC-6
  • Tasks
  • named entity recognition
  • recognition and classification of definite named
    entities such as organizations, persons,
    locations, dates and monetary amounts
  • ltenamex typeorganizationgtBridgestone Sports
    Co.lt/enamexgt said lttimex typedategtFridaylt/timex
    gtit has set up a joint venture in ltenamex
    typelocationgtTaiwanlt/enamexgtwith a local
    concern and a Japanese trading house to produce
    golf clubs to be shipped to ltpnamexgtJapanlt/pnamexgt

23
MUC-6 (Continued)
  • coreference resolution
  • identification of expressions in the text that
    referred to the same object, set or activity
  • ltcoref id100gt Galactic Enterpriseslt/corefgt
    said ltcoref id101 typeident ref100gt
    itlt/corefgt would build a new space station before
    the year 2016
  • template element filling
  • scenarios template filling

24
The Generic IE System
  • text zoner
  • divide the input text into a set of segments
  • preprocessor
  • convert a text segment into a sequence of
    sentences, where each sentence is a sequence of
    lexical items, with associated lexical attributes
    (e.g., part-of-speech)
  • filter
  • eliminate some of the sentences from the previous
    stage by filtering out irrelevant ones
  • preparser
  • detect reliable small-scale structures in
    sequences of lexical items (e.g., noun groups,
    verb groups, etc.)

25
The Generic IE System
  • fragment combiner
  • turn a set of parse tree of logical form
    fragments into a parse tree or logical form for
    the whole sentence
  • semantic interpreter
  • generate a semantic structure of meaning
    representation of logical form from a parse tree
    or parse tree fragments
  • lexical disambiguation
  • disambiguate any ambiguous predicates in the
    logical form
  • coreference resolution or discourse processing
  • build a connected representation of the text by
    linking different descriptions of the same entity
    in different parts of the text
  • template generator

26
LaSIE A Case Study
  • Lexical Processing
  • Tokenisation
  • text segmentation distinguish the document
    header and segment the text into paragraphs
  • tokenisation identify which sequences of
    characters will be treated as individual tokens
  • Sentence splitting
  • determine sentence boundaries in the text
  • the full stops are not sufficient guides, e.g.,
    Allan J. Smith, Mr.
  • Part-of-speech tagging
  • process one sentence at a time, and associate
    with each token one of the 48 part-of-speech tags
    in University of Pennsylvania
  • Morphological analysis
  • determine root forms of nouns and verbs
  • Gazetteer lookup
  • employ 5 gazeetteers (lists of names) to
    facilitate the process of recognizing and
    classifying named entities
  • organization names, location names, personal
    given names, company designators, and personal
    titles

27
LaSIE Parsing
  • Parsing with a special named entity grammar
  • recognize multi-word structures which identify
    organizations, persons, locations, dates, and
    monetary amounts
  • ORGAN\_NP --gt ORGAN\_NP LOC\_NP CDG
    Merrill Lynch Canada Inc.
  • PERSON\_NP --gt FIRST\_NAME NNP
    Donald Wright
  • organization(e17), name(e17, Burns Fry Ltd.)

28
LaSIE Parsing (Continued)
  • Parsing with a more general phrasal grammar
  • recognize noun phrases, verb phrases,
    prepositional phrases, adjective phrases,
    sentences, and relative clauses
  • NP Donald Wright, ADJP 46 years old, VP VP
    was namedNP executive vice president and
    director of fixed incomePP at this brokerage
    firm
  • person(e21), name(e21, Donald Wright)name(e22),
    lobj2(e22,e23)title(e23, executive vice
    president)firm(e24), det(e24, this)

29
LaSIE Parsing (Continued)
  • Select a best parse from the set of partial,
    fragmentary, and possibly overlapping phrasal
    analyses
  • choose that sequence of non-overlapping phrases
    of semantically interpretable categories
    (sentence, noun phrase, verb phrase and
    prepositional phrase) which covers the most words
    and consists of the fewest phrases

30
LaSIE Discourse Processing
31
(No Transcript)
32
Application Areas of Information Extraction
  • Finance
  • categorize newswire stories of relevance to stock
    traders
  • Military Intelligence
  • Medicine
  • help classification of patient records and
    discharge summaries to assist in public health
    research and in medical treatment auditing
  • Law
  • support intelligent retrieval from legal texts
  • Police
  • extract information about road traffic incidents
    from police incident log
  • Technology/product tracking
  • track commodity price changes and factors
    affecting changes in the relevant newsfeeds

33
Application Areas of Information Extraction
(Continued)
  • Fault Diagnosis
  • extract information from reports of car faults
  • Software system requirements specification
  • NLP techniques used to assist in the process of
    deriving formal software specifications from less
    formal, natural language specifications
  • the formal specification is viewed as a template
    which needs to be filled from a natural language
    specifications, supplemented with a dialogue with
    the user
  • Academic research
  • Academic journals and publications are
    increasingly becoming available on-line and offer
    a prime source of material for IE technology

34
Challenges for the future
  • Higher precision and recall
  • User-defined IE
  • permit users to define the extraction task and
    then adapts to the new scenario
  • Integration with other technologies
  • information retrieval
  • natural language generation
  • machine translation
  • data mining
Write a Comment
User Comments (0)
About PowerShow.com