Corpus Linguistics vs. Ontologies - PowerPoint PPT Presentation

1 / 46
About This Presentation
Title:

Corpus Linguistics vs. Ontologies

Description:

... a form of a personalised digital life diary and augmented memory assistant. ... at work or in public spaces will soon carry wireless communications technology. ... – PowerPoint PPT presentation

Number of Views:385
Avg rating:3.0/5.0
Slides: 47
Provided by: lin87
Category:

less

Transcript and Presenter's Notes

Title: Corpus Linguistics vs. Ontologies


1
Corpus Linguistics vs. Ontologies
  • Martin Volk
  • Computational Linguistics
  • Stockholm University
  • volk_at_ling.su.se

2
Summing up
  • Corpus Linguistics, Ontologies and their
    neighbors ?
  • and then a look into the future.

3
Corpus Linguistics
  • What kind of automatic Corpus Annotation is
    possible?
  • Part-of-Speech tagging
  • Lemmatisation
  • Named Entity Recognition
  • Phrase chunking
  • What is the advantage of manual corpus
    annotation?
  • Better accuracy
  • More annotation depth

4
Rule-based vs. Statistical Methods
  • Rule-based
  • Patterns for recognising and classifying proper
    names
  • Patterns for chunking Part-of-Speech sequences
    into phrases
  • Morphological analysis

5
Statistical Methods
  • Vector comparisons for
  • matching queries against documents
  • finding near-synonyms (building a similarity
    thesaurus)
  • word sense disambiguation
  • Cooccurrence measures for
  • collocations (e.g. support verb units)
  • PP-attachments

6
Corpus Linguistics vs. Ontologies
Ontologies
Corpus Linguistics
Language perspective
Knowledge perspective
7
Ontologies
  • Ontologies are modern means of representing world
    knowledge.
  • Previous means
  • Expert Systems (in the 1980s)
  • Thesauri
  • Topic Maps
  • How can we represent what we know about a domain?
    about the world?

8
Ontologies
  • Usage of Ontologies in IR systems
  • as monolingual dictionary
  • as bilingual dictionary
  • seldom for reasoning

9
Knowledge Management
  • First we need to define
  • Data
  • Information
  • Knowledge

10
Knowledge Management
  • is about mapping and tracking key knowledge
    domains of each employee.
  • is about how to best create and support
    knowledge-creating and sharing behaviors.

11
What is Text Mining? by Marti Hearst
http//www.sims.berkeley.edu/hearst/text-mining.
html
  • Text Mining is the discovery by computer of new,
    previously unknown information, by automatically
    extracting information from different written
    resources.
  • A key element is the linking together of the
    extracted information to form new facts or new
    hypotheses to be explored further.

12
What is Text Mining? by Marti Hearst
  • Text mining is different from web search.
  • In web search, the user is typically looking for
    something that is already known and has been
    written by someone else. The problem is pushing
    aside all the irrelevant material in order to
    find the relevant information.
  • In text mining, the goal is to discover
    heretofore unknown information, something that no
    one yet knows and so could not have yet written
    down.

13
What is Text Mining? by Marti Hearst
  • One promising application area for text mining is
    in the biosciences.
  • The best known example is Don Swanson's work on
    hypothesizing causes of rare diseases by looking
    for indirect links in different subsets of the
    bioscience literature.

14
Semantic Web
  • is about adding meta-data to web pages so that
    computers can access the meaning of a web
    document.

15
Is the information flood a problem?
  • Common wisdom The vast amount of information in
    the internet makes it ever harder to find the
    relevant information.
  • But from an Information Retrieval perspective
  • The more information the better!
  • the higher the chance to find it phrased in the
    expected manner.

16
  • How will future
  • corpus annotation
  • look like?

17
Future corpus annotation
  • Ever larger automatically annotated corpora.
  • Manually annotated corpora
  • grow in depth (additional annotation)
  • grow in breadth (additional data)
  • grow across languages

18
Future corpus annotation
  • Intersentential annotation
  • annotation across sentences
  • e.g. co-reference annotation
  • e.g. text and discourse structure
  • Semantic annotation
  • name classes
  • local, temporal, modal, causal units
  • word senses relative to a thesaurus
  • roles within a sentence (FrameNet)
  • (cp. to http//www.icsi.berkeley.edu/framenet/ )

19
Future corpus annotation
  • Parallel treebanks
  • semi-manual annotation of translated texts
  • alignment on word and constituent level
  • the parallel text may serve as disambiguator

20
(No Transcript)
21
Future corpus annotation
  • Better parsers for corpus annotation
  • integration of shallow parsing and deep parsing
  • Long term
  • automatic transcription of spoken language data
  • automatic search through audio- and video data

22
Predicate-Argument Structures
  • Penn PropBank ( Proposition Bank)
  • http//www.cis.upenn.edu/ace/
  • Example The company bought a wheel-loader from
    Dresser.
  • Arg0 The company
  • rel bought
  • Arg1 a wheel-loader
  • Arg2-from Dresser

23
Predicate-Argument Structures
  • Example The company's U.S. subsidiary,
    Matsushita Electric Corp. of America, had donated
    over 35,000 worth of Matsushita-made flashlights
    and batteries to residents shortly after the
    disaster, a company spokesman said.
  • Arg0 The company's U.S. subsidiary, Matsushita
  • Electric Corp. of America,
  • REL donated
  • Arg1 over 35,000 worth of Matsushita-made
    flashlights
  • and batteries
  • Arg2-to residents
  • ArgM-TMP shortly after the disaster

24
PropBank Goals
  • Annotation of sentential verbs for the Penn
    TreeBank II Wall Street Journal Corpus of 1
    million words
  • by June 2003

25
FrameNet
  • Example for donate
  • http//www.icsi.berkeley.edu/framenet/data/html/fr
    ames/Giving.html
  • A general frame (incl. give, donate, endow, hand
    over, ) in an inheritance hierarchy
  • Inherits From RelinquishIs Inherited By
    Commerce_pay, Commerce_sell

26
PropBank vs. FrameNet
  • ProbBank is
  • based on Penn Treebank corpus sentences.
  • covers most semantic classes (some with few
    examples).
  • FrameNet is
  • based on invented sentences (to avoid ambiguity).
  • covers some semantic classes (but in depth).

27
Discourse annotation
  • What are discourse features?
  • Typically cohesion and coherence
  • coherence what makes a text hang together in
    terms of content
  • cohesion the means of making a text hang
    together
  • reference,
  • substitution,
  • ellipsis,
  • conjunctive relations (cause, result, effect
    etc.),
  • thematic development

28
Discourse annotation
  • example anaphoric relations in the IBM/Lancaster
    corpus (UCREL)
  • what are anaphoric relations?
  • links between a proform and an antecedent
  • example
  • The married couple said that they were happy with
    their lot.

29
Pragmatic annotation
  • anything beyond sentences and discourse contexts
    of situation and culture.
  • Examples
  • carry-on signals in conversation (e.g., Stenström
    87) which functions have carry-on signals such
    as well, you know etc. in conversation?
  • speech acts (e.g., Stiles 92) speech act types
    in conversation, e.g., in doctor-patient
    interactions
  • PATIENT I have the headaches to the point that I
    have to vomit (D)
  • DOCTOR Mm -hm (K)
  • PATIENT Then I have to go to bed and I sleep for
    a while (E)
  • D Disclosure
  • K Acknowledgment
  • E Edification

30
Future corpus access
  • through the web
  • the web as corpus
  • to multilingual corpora through the alignment of
    parallel texts

31
  • How will the successful
  • natural language processing systems work in the
    future?

32
Observations
  • The Translation-Memory lesson A sentence S is
    best translated by retrieving a human translation
    T.
  • A human translation is nothing but an annotated
    corpus.
  • The challenge Match all variants and possibly
    fragments of S.

33
Observations
  • The FAQ lesson A question Q is best answered by
    finding the QA pair in the database.
  • The challenge Find all variants of Q.
  • Dialog Systems ? Kiwilogic Assistants (by e.g.
    Artificial Solutions, Stockholm)
  • Examples
  • http//www.tullverket.se/se/ (best with IE)

34
Kiwilogic Assistants
  • Processing works in this order
  • Compare against FAQ
  • Search in word lists
  • Pick keywords
  • Handle basic questions (How are you?)
  • No understanding

35
Conclusions from the Observations
  • A lot can be gained from systematically
    harvesting human annotations.
  • My prediction Lexical entries in NLP systems
    will be replaced by phrases.
  • NLP moves from detailed analysis of the input and
    aggregation of the output to matching of similar
    cases (of the input) and adaptation of the
    output.

36
Lessons learned
  • The overall lesson
  • Concerning language skills ( knowledge skills
    content skills) computers are a lot better (
    useful reliable) at retrieving examples of
    human production than re-producing these skills!
  • Therefore Investing manual labor
  • plus good storage
  • plus good similarity matching
  • plus good recombination strategies
  • will lead to improved systems.

37
Information Technology in the future
  • somewhat unrelated to this course.

38
11 Grand challenges in ITfound at
http//www.cordis.lu/ist/istag.htm
  • Information Society Technologies Advisory Group
  • Working Group
  • Grand Challenges in the Evolution of the
    Information Society
  • DRAFT Report (06 July 2004)
  • Working Group Members
  • Chairman Wolfgang Wahlster
  • Rapporteur Mark Buchanan
  • ISTAG Members Hervé Bourlard, Gabriel Ferrate I
    Pascual, Manuel Hermenegildo, Vladimir Kucera,
    Laure Reinhart
  • WG Members Markus Gross, Thomas Lengauer, Joseph
    Mariani, Erik Sandewall, Walter Weigel

39
Grand challenges in IT(descriptions slightly
shortened)
  • The 100 Safe Car Roadway accidents entail
    enormous human suffering and burden on the
    European society with tremendous economic costs.
    We envision the 100 safe automobile for
    eliminating traffic fatalities almost completely.
  • The Multilingual Companion With the enlargement
    the EU faces a new multi-lingual challenge. We
    envision a powerful multi-lingual companion
    that will make multi-lingual and cross-lingual
    information access and communication virtually
    automatic.
  • The Service Robot Companion As the European
    population ages, spiralling health-related costs
    will place an immense burden on European
    economies. We envision flexible home-care service
    robots, which will help people to care for
    themselves, improve their comfort of living and
    entertain them.

40
Grand challenges in IT
  • The Self-Monitoring and Self-Repairing Computer
    System failures are extremely costly and all too
    frequent in todays complex ICT systems. We
    envision self-repairing computing systems that
    will greatly improve reliability.
  • The Internet Police Agent To reap the full
    benefits of the Internet, we must counter
    criminal and anti-social activities (SPAM,
    viruses, worms, fraud, etc.). We envision an
    automated police agent that will be a socially
    beneficial force within the Internet environment.
  • The Disease and Treatment Simulator We envision
    a computational platform for simulating the
    function of a concrete disease. This simulator
    will enable medicines to be tested without
    putting people at risk.

41
Grand challenges in IT
  • The Augmented Personal Memory The ICT revolution
    will make it possible to store virtually every
    image, film or television program you have ever
    seen, every conversation you have ever had or
    book you have read. We envision a form of a
    personalised digital life diary and augmented
    memory assistant.
  • The Pervasive Communication Jacket Most objects
    in the house, at work or in public spaces will
    soon carry wireless communications technology. We
    envision a communications jacket to exploit
    these information resources in a natural and
    beneficial way.
  • The Personal Everywhere Visualiser Visualisation
    is key for people to exploit the information
    revolution. A grand challenge is a convenient
    personal and mobile visualisation system that
    will work anywhere and with minimal fuss.

42
taken from H. Maurer Der PC in zehn
Jahren. Informatik Spektrum. Feb. 2004.
Brain sensors
Camera
Loudspeaker
Monitor
Microphone
Computer
43
The PC in 10 years
  • Output Audio Monitor on the Glasses
  • Input Spoken Virtual Keyboard
  • Problem Energy Supply
  • Already available
  • DejaView permanent camera
  • mounted on a head set
  • that stores the last 30 seconds.

44
Grand challenges in IT
  • The Ultra-light Aerial Transport Agent We
    envision an unmanned aerial transport agent for
    small scale logistics for the transport of
    small packages and products from point to point,
    monitoring of crime, and helping in search and
    rescue operations.
  • The Intelligent Retail Store We envision the
    intelligent retail store where emerging ICT
    technologies are integrated in a way that brings
    more information and efficiency to both retailers
    and their customers alike.

45
Summary
  • Corpus Linguistics and Ontologies (will) profit
    from the same NLP techniques.
  • Future corpus annotation will be broader and
    deeper.
  • Many of the great IT challenges of the future
    involve language technology.

46
Any Questions?
Write a Comment
User Comments (0)
About PowerShow.com