Title: GATE, a General Architecture for Text Engineering
1- GATE, a General Architecture for Text Engineering
- http//gate.ac.uk/ http//nlp.shef.ac.uk/
- Hamish Cunningham
- Department of Computer Science, University of
Sheffield - ENST, Paris, 20/1/2003
- Natural Language Engineering in Sheffield
- One of the largest Human Language Technology
groups in the EU - 50 staff in Language and Speech Processing 25 in
Information Retrieval, including 6 professors - A focus on scientific method in AI (participate
in all the leading quantitative evaluation
programmes in the US) - A focus on engineering high-quality open-source
software for applications and demonstrators
2- Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â
                                               Â
                           - GATE, a General Architecture for Text Engineering
- GATE is.
- An architectureA macro-level organisational
picture for LE software systems. - A frameworkFor programmers, GATE is an
object-oriented class library that implements the
architecture. - A development environmentFor language
engineers, computational linguists et al, GATE is
a graphical development environment bundled with
a set of tools for doing e.g. Information
Extraction. - Free software (LGPL). Mature robust software (in
development since 1995). Download at
http//gate.ac.uk/download - Comes with
- Some free components... ...and wrappers for
other people's components - Tools for evaluation visualise/edit
persistence IR IE dialogue ontologies etc.
3- Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Applications
languages - GATE has been used for a variety of applications,
including - MUMIS automatic creation of semantic indexes
for multimedia programme material - MUSE a multi-genre IE system
- EMILLE a 70 million word corpus of Indic
languages - Metadata for Medline (at Merck)
- Creation of metadata for Semantic Web Services
documentation using NLG - HSE summarisation of health and safety
information from company reports - OldBaileyIE NE recognition on 17th century Old
Bailey Court reports. - AKT language technology in knowledge management
- AMITIES call centre automation
- Digital libraries / e-philology for ancient
languages researchers - Various Medical Informatics and database
technology projects - IE in Romanian, Bulgarian, Greek, Bengali,
Spanish, Swedish, German, Italian, and French
(Arabic, Chinese and Russian next year)
4Some users
- At time of writing a representative fraction of
GATE users includes - Longman Pearson publishing, UK
- BT Exact Technologies, UK
- Merck KgAa, Germany
- Canon Europe, UK
- Knight Ridder (the second biggest US news
publisher) - BBN Technologies, US
- Sirma AI Ltd., Bulgaria
- Resco AB, Sweden/Finland/Germany
- Glaxo Smith Kline Plc drug-based navigation of
Medline abstracts - Master Foods NV extraction of commodities events
from news - the American National Corpus project, US
- Imperial College, London, the University of
Manchester, Queen Mary College, UMIST, the
University of Karlsruhe, Vassar College, ISI /
the University of Southern California and a large
number of other UK, US and EU Universities - the Perseus Digital Library project, Tufts
University, US.
5- Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â
                                               Â
                           - Architectural principles
- Non-prescriptive, theory neutral (strength and
weakness) - Re-use, interoperation, not reimplementation
(e.g. diverse XML support, integration of tools
like Protégé, Jena and Weka) - (Almost) everything is a component, and
component sets are user-extendable - Component-based development
- An OO way of chunking software Java Beans
- GATE components CREOLE modified Java Beans
(Collection of REusable Objects for Language
Engineering) - The minimal component 10 lines of Java, 10
lines of XML, 1 URL.
6- Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â
                                               Â
                           - GATE Language Resources
- GATE LRs are documents, ontologies, corpora,
lexicons, - Documents / corpora
- GATE documents loaded from local files or the
web... - Diverse document formats text, html, XML,
email, RTF, SGML. - Processing Resourcres
- Algorithmic components knows as PRs beans with
execute methods. - All PRs can handle Unicode data by default.
- Clear distinction between code and data (simple
repurposing). - 20-30 freebies with GATE
- e.g. Named entity recognition WordNet Protégé
Ontology OntoGazetteer DAMLOIL export
Information Retrieval based on Lucene
7Visual Resources
8Displaying Coreference Information
9Displaying Syntactic Information
10Lexicon Support WordNet example
11A Language AnalysisExample
12Building IE Components in GATE (1) The ANNIE
system a reusable and easily extendable set of
components
13- Â Building IE Components in GATE (2)
- JAPE a Java Annotation Patterns Engine
-
- Light, robust regular-expression-based
processing - Cascaded finite state transduction
- Low-overhead development of new components
- Rule Company1
- Priority 25
- (
- ( Token.orthography upperInitial )
- Lookup.kind companyDesignator
- )companyMatch
- --gt
- companyMatch.NamedEntity kind
company, rule Company1
14- Â Performance Evaluation
- At document level annotation diff
- At corpus level corpus benchmark tool
tracking systems performance over time
15Regression Testing Corpus Benchmark Tool
16The Semantic Web and GATE
- GATE is being used for development of
(semi-)automatic methods for - linking web pages to Ontologies using
Information Extraction - learning and evolving Ontologies via IE and
lexical semantic network traversal.
17Populating Ontologies with IE
18Protégé and Ontology Management
19Information Retrieval Support Based on the
Lucene IR engine
20Editing Multilingual Data
- Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â
- GATE Unicode Kit (GUK)
- Java provides no special support for text input
(this may change) - Support for defining additional Input
Methods (IMs) - currently 30 IMs for 17 languages
- Pluggable in other applications
21Processing Multilingual Data All the
visualisation and editing tools for ML LRs use
enhanced Java facilities
22Dialogue Systems
- GATE is being used in the Amities project for
automating call centres - Creation of dialogue processing server
components to run in the Galaxy Communicator
architecture - Easy adaptation of the portable IE components to
work on noisy ASR output - Robustness and speed of GATE components vital
for real-time dialogue systems
23The MUMIS project
- Multimedia Indexing and Searching Environment
- Composite index of a multimedia programme from
multiple sources in different languages - ASR, video processing, information extraction
(Dutch, English, German), merging, user interface - University of Twente/CTIT, University of
Sheffield, University of Nijmegen, DFKI, MPI,
ESTEAM AB, VDA - Yorick Wilks, Hamish Cunningham, Horacio Saggion,
Kalina Bontcheva, Diana Maynard, Oana Hamza,
Cristian Ursu
24The Whole Picture
Ontology Lexicon
IE
DE
Formal Text
Formal Text
Final Annotations
Formal Text
IE
Formal Text
NL
Formal Text
Formal Text
Formal Text
EN
Formal Text
Formal Text
Text Sources
IE
Video Audio Signal
Query
Formal Text
Formal Text
Formal Text
Formal Text
Formal Text
Formal Text
Formal Text
Formal Text
Formal Text
Formal Text
Formal Text
Formal Text
Formal Text
Formal Text
Formal Text
Formal Text
Formal Text
Formal Text
Multimedia Data Base
Formal Text
Speech Signals
Formal Text
User Interface
Trans criptions
ASR
Results
25User Interface
26Play
27- Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â
                                               Â
                           - Conclusion
-
- GATE an infrastructure that lowers the overhead
of creating embedding robust NLP components - Further information http//gate.ac.uk/
- Online demos, tutorials and documentation
- Software downloads
- Talks and papers