Introduction to ANNIE - PowerPoint PPT Presentation

About This Presentation
Title:

Introduction to ANNIE

Description:

Typically a new application will use most of the core components ... JAPE is a pattern-matching language. The LHS of each rule contains patterns to be matched ... – PowerPoint PPT presentation

Number of Views:219
Avg rating:3.0/5.0
Slides: 21
Provided by: Dia571
Category:

less

Transcript and Presenter's Notes

Title: Introduction to ANNIE


1
Introduction to ANNIE
http//gate.ac.uk/ http//nlp.shef.ac.uk/
  • Diana Maynard
  • University of Sheffield
  • March 2004

2
What is ANNIE?
  • ANNIE is a vanilla information extraction system
    comprising a set of core PRs
  • Tokeniser
  • Sentence Splitter
  • POS tagger
  • Gazetteers
  • Semantic tagger (JAPE transducer)
  • Orthomatcher (orthographic coreference)

3
ANNIE Pipeline
4
Other Processing Resources
  • There are also lots of additional processing
    resources which are not part of ANNIE itself but
    which come with the default installation of GATE
  • Gazetteer collector
  • PRs for Machine Learning
  • Various exporters
  • Annotation set transfer
  • etc.

5
Creating a new application from ANNIE
  • Typically a new application will use most of the
    core components from ANNIE
  • The tokeniser, sentence splitter and orthomatcher
    are basically language, domain and
    application-independent
  • The POS tagger is language dependent but domain
    and application-independent
  • The gazetteer lists and JAPE grammars may act as
    a starting point but will almost certainly need
    to be modified
  • You may also require additional PRs (either
    existing or new ones)

6
Modifying gazetteers
  • Gazetteers are plain text files containing lists
    of names
  • Each gazetteer set has an index file listing all
    the lists, plus features of each list (majorType,
    minorType and language)
  • Lists can be modified either internally using
    Gaze, or externally in your favourite editor
  • Gazetteers can also be mapped to ontologies
  • To use Gaze and the ontology editor, you need to
    download the relevant creole files

7
JAPE grammars
  • A semantic tagger consists of a set of rule-based
    JAPE grammars run sequentially
  • JAPE is a pattern-matching language
  • The LHS of each rule contains patterns to be
    matched
  • The RHS contains details of annotations (and
    optionally features) to be created
  • More complex rules can also be created

8
Input specifications
  • The head of each grammar phase needs to contain
    certain information
  • Phase name
  • Inputs
  • Matching style
  • e.g.
  • Phase location
  • Input Token Lookup Number
  • Control appelt

9
Matching algorithms and Rule Priority
  • 3 styles of matching
  • Brill (fire every rule that applies)
  • First (shortest rule fires)
  • Appelt (use of priorities)
  • Appelt priority is applied in the following order
  • Starting point of a pattern
  • Longest pattern
  • Explicit priority (default -1)

10
NE Rule in JAPE Rule Company1 Priority 25
( ( Token.orthography
upperInitial ) //from tokeniser
Lookup.kind companyDesignator //from
gazetteer lists )match --gt
match.NamedEntity kindcompany,
ruleCompany1
11
LHS of the rule
  • LHS is expressed in terms of existing
    annotations, and optionally features and their
    values
  • Any annotation to be used must be included in the
    input header
  • Any annotation not included in the input header
    will be ignored (e.g. whitespace)
  • Each annotation is enclosed in curly braces
  • Each pattern to be matched is enclosed in round
    brackets and has a label attached

12
Macros
  • Macros look like the LHS of a rule but have no
    label
  • Macro NUMBER
  • ((Digit))
  • They are used in rules by enclosing the macro
    name in round brackets
  • ( (NUMBER))match
  • Conventional to name macros in uppercase letters
  • Macros hold across an entire set of grammar phases

13
Contextual information
  • Contextual information can be specified in the
    same way, but has no label
  • Contextual information will be consumed by the
    rule
  • (Annotation1)
  • (Annotation2)match
  • (Annotation3)
  • ?

14
RHS of the rule
  • LHS and RHS are separated by ?
  • Label matches that on the LHS
  • Annotation to be created follows the label
  • (Annotation1)match
  • ? match.NE feature1 value1, feature2
    value2

15
Using phases
  • Grammars usually consist of several phases, run
    sequentially
  • Only one rule within a single phase can fire
  • Temporary annotations may be created in early
    phases and used as input for later phases
  • Annotations from earlier phases may need to be
    combined or modified
  • A definition phase (conventionally called
    main.jape) lists the phases to be used, in order
  • Only the definition phase needs to be loaded

16
More complex JAPE rules
  • Any Java code can be used on the RHS of a rule
  • This is useful for e.g. feature percolation,
    ontology population, accessing information not
    readily available, comparing feature values,
    deleting existing annotations etc.
  • There are examples of these in the user guide and
    in the ANNIE NE grammars
  • Most JAPE rules end up being complex!

17
Using JAPE for other tasks
  • JAPE grammars are not just useful for NE
    annotation
  • They can be a quick and easy way of performing
    any kind of task where patterns can be easily
    recognised and a finite-state approach is
    possible, e.g. transforming one style of markup
    into another, deriving features for the learning
    algorithms

18
Example rule for deriving features
  • Rule Entity( Gpe Organization
    Person Location Facility
  • )entity--gtgate.AnnotationSet entityAS
  • (gate.AnnotationSet)bindings.get("entity")
  • gate.Annotation entityAnn (gate.Annotation)entit
    yAS.iterator().next()
  • gate.FeatureMap features Factory.newFeatureMap()
  • features.put("type", entityAnn.getType())outputA
    S.add(entityAnn.getStartNode(),
    entityAnn.getEndNode(),
  • "Entity, features)

19
Finding Examples
  • ANNIE for default NE rules
  • gate/src/gate/resources/creole/NEtransducer/NE/
  • MUSE for more complex NE rules
  • muse/src/muse/resources/grammar/main
  • h-TechSight for ontology population
  • htechsight/application/grammar
  • Various other applications generally follow the
    format
  • projectname/application/grammar/

20
Conclusion
  • This talk http//gate.ac.uk/sale/talks/annie-tut
    orial.ppt
  • More information http//gate.ac.uk/
Write a Comment
User Comments (0)
About PowerShow.com