Title: Where the Web Went Wrong
1Where the Web Went Wrong http//gate.ac.uk/
http//nlp.shef.ac.uk/ Hamish
Cunningham Dept. Computer Science, University of
Sheffield Graz, May 2004
2Contents
- The Web, presentation, and syndication
- A Semantic Web for eCulture
- annoy half the audience
- annoy the other half
- eCulture, metadata and human language
- motivation
- Information Extraction quantified language
computing - MUMIS, GATE, ...
- Cultural memory is not a luxury
3Syndication and Mediation
- The web promotes diversity, but also
fragmentation - Original web separate content and presentation
(this is a header, not set in 20 point bold
font) - Now many incompatible/inaccessible interfaces
- Memory Institutions (museums, libraries,
archives) need to - pool their impact syndication in networked
communities - support repurposable content
- Therefore data must be presentation independent
- Candidate technologies DC, CIDOC, XML, RSS,
RDF, OWL (semantic web)...
4Semantic Web (1)
- Memory Institutions (museums, libraries,
archives) host massively diverse content - Fortunately, the differences are primarily at the
level of data structure and syntax. Significant
conceptual overlaps exist between the descriptive
schema used by memory institutions elemental
concepts such as objects, people, places, events,
and the interrelationships between them are
almost universal. Building semantic bridges
between museums, libraries and archives The
CIDOC Conceptual Reference Model, T. Gill, April
2004 - Therefore we can add a semantic metadata layer to
provide generalised inter-institution resource
location - Syndication and mediation for free!
5Semantic Web (2)good news and bad news
- The good news SW focus of AI and metadata work
- The bad news AI always fails
- How does the machine tell the difference between
Mother Theresa is a saint and Tony Blair is a
saint?(Or, who tells Google which statement is
important?) - Other web users do, by linking (also cf. Amazon)
- Two solutions to the AI problem
- allow curators and users to build their own
(simple specific models can succeed, but the cost
may be too high) - use recommender systems to make the user a
curators assistant (researchers and students may
barter for access) - Any route to searchable content!
6IT context the Knowledge Economy and Human
Language
- Gartner, December 2002
- taxonomic and hierachical knowledge mapping and
indexing will be prevalent in almost all
information-rich applications - through 2012 more than 95 of human-to-computer
information input will involve textual language - A contradiction
- to deal with the information deluge we need
formal knowledge in semantics-based systems - our archived history is in informal and ambiguous
natural language - The challenge to reconcile these two phenomena
7HLT Closing the Loop
KEY MNLG Multilingual Natural Language
GenerationOIE Ontology-aware Information
ExtractionAIE Adaptive IECLIE Controlled
Language IE
(M)NLG
Semantic Web Semantic GridSemantic Web
Services
Formal Knowledge(ontologies andinstance bases)
HumanLanguage
OIE
(A)IE
ControlledLanguage
CLIE
8Information Extraction
- Information Extraction (IE) pulls facts and
structured information from the content of large
text collections. - Contrast IE and Information Retrieval
- NLP history from NLU to IE
- Progress driven by quantitative measures
- MUC Message Understanding Conferences
- ACE Advanced Content Extraction
9IE Example
- The shiny red rocket was fired on Tuesday. It is
the brainchild of Dr. Big Head. Dr. Head is a
staff scientist at We Build Rockets Inc.
- NE "rocket", "Tuesday", "Dr. Head, "We Build
Rockets"
- CO"it" rocket "Dr. Head" "Dr. Big Head"
- TE the rocket is "shiny red" and Head's
"brainchild".
- TR Dr. Head works for We Build Rockets Inc.
- ST rocket launch event with various participants
10Performance levels
- (Extensive quantitative evaluation since early
90s mainly on text, ASR now also video OCR) - Vary according to text type, domain, scenario,
language - NE up to 97 (tested in English, Spanish,
Japanese, Chinese, others) - CO 60-70 resolution
- TE 80
- TR 75-80
- ST 60 (but human level may be only 80)
11Ontology-based IE
XYZ was established on 03 November 1978 in
London. It opened a plant in Bulgaria in
Ontology KB
Location
Company
HQ
partOf
City
Country
type
type
HQ
type
type
establOn
partOf
03/11/1978
12Classes, instances metadata
Gordon Brown met George Bush during his two day
visit.
ltmetadatagt ltDOC-IDgthttp// 1.htmllt/DOC-IDgt
ltAnnotationgt lts_offsetgt 0 lt/s_offsetgt
lte_offsetgt 12 lt/e_offsetgt ltstringgtGordon
Brownlt/stringgt ltclassgtPersonlt/classgt
ltinstgtPerson12345lt/instgt lt/Annotationgt
ltAnnotationgt lts_offsetgt 18 lt/s_offsetgt
lte_offsetgt 32 lt/e_offsetgt ltstringgtGeorge
Bushlt/stringgt ltclassgtPersonlt/classgt
ltinstgtPerson67890lt/instgt lt/Annotationgt lt/metad
atagt
Classesinstances before
Bush
Classesinstances after
13An example the MUMIS project
- Multimedia Indexing and Searching Environment
- Composite index of a multimedia programme from
multiple sources in different languages - ASR, video processing, Information Extraction
(Dutch, English, German), merging, user interface - University of Twente/CTIT, University of
Sheffield, University of Nijmegen, DFKI, MPI,
ESTEAM AB, VDA - An important experimental result multiple
sources for same events can improve extraction
quality - PrestoSpace applications in news and sports
archiving
14Semantic Query
Not goal Beckham (includes e.g. missed goals,
or this was not a goal) Instead goal events
with scorer David Beckham
15The results England win!
16GATE, a General Architecture for Text Engineering
is...
- An architecture A macro-level organisational
picture for LE software systems. - A framework For programmers, GATE is an
object-oriented class library that implements the
architecture. - A development environment For language engineers,
a graphical development environment. - GATE comes with...
- Free components, and wrappers for other peoples
stuff - Tools for evaluation visualise/edit
persistence IR IE dialogue ontologies etc. - Free software (LGPL) at http//gate.ac.uk/download
/ - Used by thousands of people at hundreds of sites
17A bit of a nuisance (GATE users)
- Thousands of users at hundreds of
- sites. A representative sample
- the American National Corpus project
- the Perseus Digital Library project, Tufts
University, US - Longman Pearson publishing, UK
- Merck KgAa, Germany
- Canon Europe, UK
- Knight Ridder, US
- BBN (leading HLT research lab), US
- SMEs inc. Sirma AI Ltd., Bulgaria
- Stanford, Imperial College, London, the
University of Manchester, UMIST, the University
of Karlsruhe, Vassar College, the University of
Southern California and a large number of other
UK, US and EU Universities - UK and EU projects inc. MyGrid, CLEF, dotkom,
AMITIES, Cub Reporter, EMILLE, Poesia...
- GATE team projects. Past
- Conceptual indexing MUMIS automatic semantic
indices for sports video - MUSE, cross-genre entitiy finder
- HSL, Health-and-safety IE
- Old Bailey collaboration with HRI on 17th
century court reports - Multiflora plant taxonomy text analysis for
biodiversity research e-science - ACE / TIDES Arabic, Chinese NE
- JHU summer w/s on semtagging
- EMILLE S. Asian languages corpus
- hTechSight chemical eng. K. portal
- Present
- Advanced Knowledge Technologies 12m UK five
site collaborative project - SEKT Semantic Knowledge Technology
- PrestoSpace MM Preservation/Access
- KnowledgeWeb Semantic Web
- Future
- New eContent project LIRICS
18GATE infrastructure for semantic metadata
extraction
- Combines learning and rule-based methods (new
work on mixed-initiative learning) - Allows combination of IE and IR
- Enables use of large-scale linguistic resources
for IE, such as WordNet - Supports ontologies as part of IE applications -
Ontology-Based IE - Supports languages from Hindi to Chinese, Italian
to German
19PrestoSpace Semantics Architecture
IE
...
Formal Text
Formal Text
Formal Text
Final Annotations
IE
Formal Text
IT
Formal Text
Formal Text
Formal Text
Formal Text
Formal Text
EN
Text Sources
IE
Multilingual Conceptual Q A
Formal Text
Formal Text
Formal Text
Formal Text
Formal Text
Formal Text
Formal Text
Formal Text
Formal Text
Formal Text
Formal Text
Formal Text
Formal Text
Formal Text
Formal Text
Formal Text
Formal Text
Formal Text
Formal Text
AV Signals
Formal Text
Signal md, Transcr-iptions
ASR, etc.
20Memory is not a luxury
- C21st all the C20th mistakes but bigger
better? - If you dont know where youve been, how can you
know where youre going? - Archives ammunition in the war on ignorance
- Ammunition is useless if you cant find it new
technology must make our history accessible to
all, for all our futures
21Links
- This talk
- http//gate.ac.uk/sale/talks/eculture-graz-may200
4.ppt - Related projects