Title:
1- Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â
                                                 Â
                         - GATE, a General Architecture for Text Engineering
- Hamish Cunningham, Kalina BontchevaDepartment of
Computer Science, - University of Sheffield
- Wednesday October 30th 2002
- Next generation web
- GATE, language technology infrastructure
1(20)
2- Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â
                                                 Â
                         - A Ubiquitous Permeable Web
- The next generation of the web must be
- ubiquitous semantics for every device, every
organisation, every individual - permeable allow contextual data to penetrate
and persist - companionable able to engage with us via
multiple natural modalities. - Roles for Language Technology
- discovery of semantics (ubiquity)
- mediating between context and personal semantic
memories (permeability) - conversing with people and the semantic web
(companionableness).
2(20)
3- Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â
                                               Â
                           - Critical Mass for the Semantic Web
- The SW machine processable, repurposable data
to compliment hypertext - But semantics 0.0000000... of the Web
- How to achieve critical mass? Huge scale
automatic annotation. Requirements - Huge scale freely available to all EU
citizens distributed (over a Grid)
re-purposeable (delivered as Web Services) - Portability and robustness via simple and
therefore shallow HLT methods ve and ve
learning analogs of IPSEs for
computer-literate users
3 (20)
4- Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â
                                               Â
                           - Motivation for Software Infrastructure for
Language Engineering - Need for scalable, reusable, and portable HLT
solutions - Support for large data, in multiple media,
languages, formats, and locations - Lowering the cost of creation of new language
processing components - Promoting quantitative evaluation metrics via
tools and a level playing field
4 (20)
5Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â
                                                 Â
                         Motivation (II)
software lifecycle in collaborative
research Project Proposal We love each other.
We can work so well together. We can hold
workshops on Santorini together. We will solve
all the problems of AI that our predecessors were
too stupid to. Analysis and Design Stop work
entirely, for a period of reflection and
recuperation following the stress of attending
the kick-off meeting in Luxembourg. Implementatio
n Each developer partner tries to convince the
others that program X that they just happen to
have lying around on a dusty disk-drive meets the
project objectives exactly and should form the
centrepiece of the demonstrator. Integration and
Testing The lead partner gets desperate and
decides to hard-code the results for a small set
of examples into the demonstrator, and have a
fail-safe crash facility for unknown input
("well, you know, it's still a prototype..."). Ev
aluation Everyone says how nice it is, how it
solves all sorts of terribly hard problems, and
how if we had another grant we could go on to
transform information processing the World over
(or at least the European business travel
industry).
2(20)
6- Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â
                                               Â
                           - GATE, a General Architecture for Text Engineering
- An architectureA macro-level organisational
picture for LE software systems. - A frameworkFor programmers, GATE is an
object-oriented class library that implements the
architecture. - A development environmentFor language
engineers, computational linguists et al, GATE is
a graphical development environment bundled with
a set of tools for doing e.g. Information
Extraction. - Some free components... ...and wrappers for
other people's components - Tools for evaluation visualise/edit
persistence IR IE dialogue ontologies etc. - Free software (LGPL). Download at
http//gate.ac.uk/download/
6 (20)
7- Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â
                                               Â
                           - Architectural principles
- Non-prescriptive, theory neutral (strength and
weakness) - Re-use, interoperation, not reimplementation
(e.g. diverse XML support, integration of tools
like Protégé, Jena and Weka) - (Almost) everything is a component, and
component sets are user-extendable - Component-based development
- An OO way of chunking software Java Beans
- GATE components CREOLE modified Java Beans
(Collection of REusable Objects for Language
Engineering) - The minimal component 10 lines of Java, 10
lines of XML, 1 URL.
7 (20)
8- Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â
                                               Â
                           - GATE Language Resources
- GATE LRs are documents, ontologies, corpora,
lexicons, - Documents / corpora
- GATE documents loaded from local files or the
web... - Diverse document formats text, html, XML,
email, RTF, SGML. - Processing Resourcres
- Algorithmic components knows as PRs beans with
execute methods. - All PRs can handle Unicode data by default.
- Clear distinction between code and data (simple
repurposing). - 20-30 freebies with GATE
- e.g. Named entity recognition WordNet Protégé
Ontology OntoGazetteer DAMLOIL export
Information Retrieval based on Lucene
8 (20)
9ANNIE
Named entity
Core- ference
Document content Document metadata Document
format data Linguistic data
POS tagger
Named entity
Event extraction
A Language AnalysisExample
Custom application 1
Relational Database
File storage
Oracle/ PostgresQL
10 Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â
                                               Â
                          Â
                                                 Â
                                Â
10(11)
11- Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â
                                               Â
                           - Building IE Components in GATE (1)
- The ANNIE system a reusable and easily
extendable set of components
11 (20)
12- Â Building IE Components in GATE (2)
- JAPE a Java Annotation Patterns Engine
-
- Light, robust regular-expression-based
processing - Cascaded finite state transduction
- Low-overhead development of new components
- Rule Company1
- Priority 25
- (
- ( Token.orthography upperInitial )
- Lookup.kind companyDesignator
- )companyMatch
- --gt
- companyMatch.NamedEntity kind
company, rule Company1
12 (20)
13- Â Performance Evaluation
- At document level annotation diff
- At corpus level corpus benchmark tool
tracking systems performance over time
13 (20)
14The Semantic Web and GATE
- GATE is being used for development of
(semi-)automatic methods for - linking web pages to Ontologies using
Information Extraction - learning and evolving Ontologies via IE and
lexical semantic network traversal.
14 (20)
15Populating Ontologies with IE
16Protégé and Ontology Management
17- Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â
                                               Â
                           - Information Retrieval Support
- Based on the Lucene IR engine
17 (20)
18 Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â
                                               Â
                           Processing
Multilingual Data All the visualisation and
editing tools for ML LRs use enhanced Java
facilities
18 (20)
19- Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Applic
ations - GATE has been used for a variety of applications,
including - MUMIS automatic creation of semantic indexes
for multimedia programme material - MUSE a multi-genre IE system
- Metadata for Medline (at Merck)
- ACE participation in the Automatic Content
Extraction programme - HSE summarisation of health and safety
information from company reports - OldBaileyIE NE recognition on 17th century Old
Bailey Court reports. - AKT language technology in knowledge management
- AMITIES call centre automation
- Various Medical Informatics and database
technology projects - IE in Romanian, Bulgarian, Greek, Bengali,
Spanish, Swedish, German, Italian, and French
(Arabic, Chinese and Russian this autumn)
19 (20)
20- Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â
                                               Â
                           - Conclusion
-
- GATE an infrastructure that lowers the overhead
of creating embedding robust NLP components - Further information http//gate.ac.uk/
- Online demos, tutorials and documentation
- Software downloads
- Talks and papers
20 (20)