Title: Corpus Linguistics vs. Ontologies
1Corpus Linguistics vs. Ontologies
- Martin Volk
- Computational Linguistics
- Stockholm University
- volk_at_ling.su.se
2Summing up
- Corpus Linguistics, Ontologies and their
neighbors ? - and then a look into the future.
3Corpus Linguistics
- What kind of automatic Corpus Annotation is
possible? - Part-of-Speech tagging
- Lemmatisation
- Named Entity Recognition
- Phrase chunking
- What is the advantage of manual corpus
annotation? - Better accuracy
- More annotation depth
4Rule-based vs. Statistical Methods
- Rule-based
- Patterns for recognising and classifying proper
names - Patterns for chunking Part-of-Speech sequences
into phrases - Morphological analysis
5Statistical Methods
- Vector comparisons for
- matching queries against documents
- finding near-synonyms (building a similarity
thesaurus) - word sense disambiguation
- Cooccurrence measures for
- collocations (e.g. support verb units)
- PP-attachments
6Corpus Linguistics vs. Ontologies
Ontologies
Corpus Linguistics
Language perspective
Knowledge perspective
7Ontologies
- Ontologies are modern means of representing world
knowledge. - Previous means
- Expert Systems (in the 1980s)
- Thesauri
- Topic Maps
- How can we represent what we know about a domain?
about the world?
8Ontologies
- Usage of Ontologies in IR systems
- as monolingual dictionary
- as bilingual dictionary
- seldom for reasoning
9Knowledge Management
- First we need to define
- Data
- Information
- Knowledge
10Knowledge Management
- is about mapping and tracking key knowledge
domains of each employee. - is about how to best create and support
knowledge-creating and sharing behaviors.
11What is Text Mining? by Marti Hearst
http//www.sims.berkeley.edu/hearst/text-mining.
html
- Text Mining is the discovery by computer of new,
previously unknown information, by automatically
extracting information from different written
resources. - A key element is the linking together of the
extracted information to form new facts or new
hypotheses to be explored further.
12What is Text Mining? by Marti Hearst
- Text mining is different from web search.
- In web search, the user is typically looking for
something that is already known and has been
written by someone else. The problem is pushing
aside all the irrelevant material in order to
find the relevant information. - In text mining, the goal is to discover
heretofore unknown information, something that no
one yet knows and so could not have yet written
down.
13What is Text Mining? by Marti Hearst
- One promising application area for text mining is
in the biosciences. - The best known example is Don Swanson's work on
hypothesizing causes of rare diseases by looking
for indirect links in different subsets of the
bioscience literature.
14Semantic Web
- is about adding meta-data to web pages so that
computers can access the meaning of a web
document.
15Is the information flood a problem?
- Common wisdom The vast amount of information in
the internet makes it ever harder to find the
relevant information. - But from an Information Retrieval perspective
- The more information the better!
- the higher the chance to find it phrased in the
expected manner.
16- How will future
- corpus annotation
- look like?
17Future corpus annotation
- Ever larger automatically annotated corpora.
- Manually annotated corpora
- grow in depth (additional annotation)
- grow in breadth (additional data)
- grow across languages
18Future corpus annotation
- Intersentential annotation
- annotation across sentences
- e.g. co-reference annotation
- e.g. text and discourse structure
- Semantic annotation
- name classes
- local, temporal, modal, causal units
- word senses relative to a thesaurus
- roles within a sentence (FrameNet)
- (cp. to http//www.icsi.berkeley.edu/framenet/ )
19Future corpus annotation
- Parallel treebanks
- semi-manual annotation of translated texts
- alignment on word and constituent level
- the parallel text may serve as disambiguator
20(No Transcript)
21Future corpus annotation
- Better parsers for corpus annotation
- integration of shallow parsing and deep parsing
- Long term
- automatic transcription of spoken language data
- automatic search through audio- and video data
22Predicate-Argument Structures
- Penn PropBank ( Proposition Bank)
- http//www.cis.upenn.edu/ace/
- Example The company bought a wheel-loader from
Dresser. - Arg0 The company
- rel bought
- Arg1 a wheel-loader
- Arg2-from Dresser
23Predicate-Argument Structures
- Example The company's U.S. subsidiary,
Matsushita Electric Corp. of America, had donated
over 35,000 worth of Matsushita-made flashlights
and batteries to residents shortly after the
disaster, a company spokesman said. - Arg0 The company's U.S. subsidiary, Matsushita
- Electric Corp. of America,
- REL donated
- Arg1 over 35,000 worth of Matsushita-made
flashlights - and batteries
- Arg2-to residents
- ArgM-TMP shortly after the disaster
24PropBank Goals
- Annotation of sentential verbs for the Penn
TreeBank II Wall Street Journal Corpus of 1
million words - by June 2003
25FrameNet
- Example for donate
- http//www.icsi.berkeley.edu/framenet/data/html/fr
ames/Giving.html - A general frame (incl. give, donate, endow, hand
over, ) in an inheritance hierarchy - Inherits From RelinquishIs Inherited By
Commerce_pay, Commerce_sell
26PropBank vs. FrameNet
- ProbBank is
- based on Penn Treebank corpus sentences.
- covers most semantic classes (some with few
examples).
- FrameNet is
- based on invented sentences (to avoid ambiguity).
- covers some semantic classes (but in depth).
27Discourse annotation
- What are discourse features?
- Typically cohesion and coherence
- coherence what makes a text hang together in
terms of content - cohesion the means of making a text hang
together - reference,
- substitution,
- ellipsis,
- conjunctive relations (cause, result, effect
etc.), - thematic development
28Discourse annotation
- example anaphoric relations in the IBM/Lancaster
corpus (UCREL) - what are anaphoric relations?
- links between a proform and an antecedent
- example
- The married couple said that they were happy with
their lot.
29Pragmatic annotation
- anything beyond sentences and discourse contexts
of situation and culture. - Examples
- carry-on signals in conversation (e.g., Stenström
87) which functions have carry-on signals such
as well, you know etc. in conversation? - speech acts (e.g., Stiles 92) speech act types
in conversation, e.g., in doctor-patient
interactions - PATIENT I have the headaches to the point that I
have to vomit (D) - DOCTOR Mm -hm (K)
- PATIENT Then I have to go to bed and I sleep for
a while (E) - D Disclosure
- K Acknowledgment
- E Edification
30Future corpus access
- through the web
- the web as corpus
- to multilingual corpora through the alignment of
parallel texts
31- How will the successful
- natural language processing systems work in the
future?
32Observations
- The Translation-Memory lesson A sentence S is
best translated by retrieving a human translation
T. - A human translation is nothing but an annotated
corpus. - The challenge Match all variants and possibly
fragments of S.
33Observations
- The FAQ lesson A question Q is best answered by
finding the QA pair in the database. - The challenge Find all variants of Q.
- Dialog Systems ? Kiwilogic Assistants (by e.g.
Artificial Solutions, Stockholm) - Examples
- http//www.tullverket.se/se/ (best with IE)
34Kiwilogic Assistants
- Processing works in this order
- Compare against FAQ
- Search in word lists
- Pick keywords
- Handle basic questions (How are you?)
- No understanding
35Conclusions from the Observations
- A lot can be gained from systematically
harvesting human annotations. - My prediction Lexical entries in NLP systems
will be replaced by phrases. - NLP moves from detailed analysis of the input and
aggregation of the output to matching of similar
cases (of the input) and adaptation of the
output.
36Lessons learned
- The overall lesson
- Concerning language skills ( knowledge skills
content skills) computers are a lot better (
useful reliable) at retrieving examples of
human production than re-producing these skills! - Therefore Investing manual labor
- plus good storage
- plus good similarity matching
- plus good recombination strategies
- will lead to improved systems.
37Information Technology in the future
- somewhat unrelated to this course.
3811 Grand challenges in ITfound at
http//www.cordis.lu/ist/istag.htm
- Information Society Technologies Advisory Group
- Working Group
- Grand Challenges in the Evolution of the
Information Society - DRAFT Report (06 July 2004)
- Working Group Members
- Chairman Wolfgang Wahlster
- Rapporteur Mark Buchanan
- ISTAG Members Hervé Bourlard, Gabriel Ferrate I
Pascual, Manuel Hermenegildo, Vladimir Kucera,
Laure Reinhart - WG Members Markus Gross, Thomas Lengauer, Joseph
Mariani, Erik Sandewall, Walter Weigel
39Grand challenges in IT(descriptions slightly
shortened)
- The 100 Safe Car Roadway accidents entail
enormous human suffering and burden on the
European society with tremendous economic costs.
We envision the 100 safe automobile for
eliminating traffic fatalities almost completely. - The Multilingual Companion With the enlargement
the EU faces a new multi-lingual challenge. We
envision a powerful multi-lingual companion
that will make multi-lingual and cross-lingual
information access and communication virtually
automatic. - The Service Robot Companion As the European
population ages, spiralling health-related costs
will place an immense burden on European
economies. We envision flexible home-care service
robots, which will help people to care for
themselves, improve their comfort of living and
entertain them.
40Grand challenges in IT
- The Self-Monitoring and Self-Repairing Computer
System failures are extremely costly and all too
frequent in todays complex ICT systems. We
envision self-repairing computing systems that
will greatly improve reliability. - The Internet Police Agent To reap the full
benefits of the Internet, we must counter
criminal and anti-social activities (SPAM,
viruses, worms, fraud, etc.). We envision an
automated police agent that will be a socially
beneficial force within the Internet environment.
- The Disease and Treatment Simulator We envision
a computational platform for simulating the
function of a concrete disease. This simulator
will enable medicines to be tested without
putting people at risk.
41Grand challenges in IT
- The Augmented Personal Memory The ICT revolution
will make it possible to store virtually every
image, film or television program you have ever
seen, every conversation you have ever had or
book you have read. We envision a form of a
personalised digital life diary and augmented
memory assistant. - The Pervasive Communication Jacket Most objects
in the house, at work or in public spaces will
soon carry wireless communications technology. We
envision a communications jacket to exploit
these information resources in a natural and
beneficial way. - The Personal Everywhere Visualiser Visualisation
is key for people to exploit the information
revolution. A grand challenge is a convenient
personal and mobile visualisation system that
will work anywhere and with minimal fuss.
42taken from H. Maurer Der PC in zehn
Jahren. Informatik Spektrum. Feb. 2004.
Brain sensors
Camera
Loudspeaker
Monitor
Microphone
Computer
43The PC in 10 years
- Output Audio Monitor on the Glasses
- Input Spoken Virtual Keyboard
- Problem Energy Supply
- Already available
- DejaView permanent camera
- mounted on a head set
- that stores the last 30 seconds.
44Grand challenges in IT
- The Ultra-light Aerial Transport Agent We
envision an unmanned aerial transport agent for
small scale logistics for the transport of
small packages and products from point to point,
monitoring of crime, and helping in search and
rescue operations. - The Intelligent Retail Store We envision the
intelligent retail store where emerging ICT
technologies are integrated in a way that brings
more information and efficiency to both retailers
and their customers alike.
45Summary
- Corpus Linguistics and Ontologies (will) profit
from the same NLP techniques. - Future corpus annotation will be broader and
deeper. - Many of the great IT challenges of the future
involve language technology.
46Any Questions?