Exploration 2000 - PowerPoint PPT Presentation

1 / 20
About This Presentation
Title:

Exploration 2000

Description:

Philadelphia 12-15 December 2000. Towards a Generic Framework for Language Resource Annotation ... EAGLES. But also... ISO (MARTIF(DTD-oriented), TMF), OLIF, ... – PowerPoint PPT presentation

Number of Views:31
Avg rating:3.0/5.0
Slides: 21
Provided by: nanc179
Category:

less

Transcript and Presenter's Notes

Title: Exploration 2000


1
Towards a Generic Framework for Language Resource
Annotation
  • Nancy Ide
  • Department of Computer Science
  • Vassar College

Laurent Romary Equipe Langue et
Dialogue Laboratoire LORIA
2
Objectives
  • Provide a framework for annotation of language
    resources to enable
  • Ease of creation and use
  • Inter-operability between/across annotation
    schemes
  • Exploit and adapt to the developing WWW
    environment
  • XML (schemas, transformations), RDF

3
What do we need?
  • A theory/tagset independent annotation scheme
  • Differentiate structure and data categories
  • A DTD-independent annotation scheme?
  • Provide as much abstraction as possible
  • Then map onto specific annotation modules
  • Towards a generic LR annotation framework

4
Syntactic annotation as a testbed
  • An abstract XML representation
  • Basic structural skeleton
  • hierarchy of ltstructgt elements
  • Pointers to underlying data
  • ltseggt element xlink
  • Associated characteristics
  • features (ltfeatgt) within associated ltstructgt
  • Identification of dependencies
  • ltrelgt
  • Categorization of levels
  • A generic  type  attribute

5
Example - Penn Tree Bank
  • Categorial information  NP 
  • Relational information  SBJ  (implicit head)
  • Node identification 1
  • ((S (NP-SBJ-1 Jones)
  • (VP followed)
  • (NP him)
  • (PP-DIR into
  • (NP the front room))
  • ,
  • (S-ADV (NP-SBJ -1)
  • (VP closing
  • (NP the door)
  • (PP behind
  • (NP him)))))
  • .))
  • Categorial information  NP 
  • Relational information implicit  OBJ 
  • Categorial information  NP 
  • Relational information  SBJ  (implicit head)
  • Node reference

6
ltstruct xmlbase"http//www.loria.fr/doc.xml"gt
ltstruct ids0gt ltfeat typeCATgtSlt/featgt
ltstruct ids1/gt ltfeat
typeCATgtNPlt/featgt ltrel typeSBJ
heads2/gt ltseg xlinkhref"xptr(substri
ng(/p/s1/text(),1,5))"/gt lt/structgt
ltstruct ids2/gt ltfeat
typeCATgtVPlt/featgt ltseg
xlinkhref"xptr(substring(/p/s1/text(),7,8))"/gt
ltstructgt ltfeat typeCATgtNPlt/featgt
lt!-- implicit OBJ relation here --gt ltseg
xlinkhref"xptr(substring(/p/s1/text(),16,3))/
gt lt/structgt ltstructgt ltfeat
typeCATgtPPlt/featgt ltrel typeDIR
heads2/gt ltseg xlinkhref"xptr(substring(/p
/s1/text(),20,4))/gt ltstructgt ltfeat
typeCATgtNPlt/featgt ltseg
xlinkhref"xptr(substring(/p/s1/text(),25,14))
/gt lt/structgt lt/structgt lt/structgt
Jones
followed
him
into
the front room
7
Comments
  • A generic element ltrelgt
  • Several attributes (cf. Carroll, Minnen Briscoe
    , to appear)
  • type
  • head
  • dependent
  • introducer
  • initial
  • E.g. ltrel typesubj headw2 dependentw1/gt

8
Comments - cont.
  • Relations
  • Implicit dependent marking in relations
  • The default is the current struct element
  • Xpath ./parentstruct
  • Implicit or explicit head ? explicit
  • ltrelgt s can be described externally to the
    syntactic bracketing

9
Example (Carroll, Minnen Briscoe)
Paul intends to leave IBM. ltw idw1gtPaullt/wgt ltw
idw2gtintendslt/wgt ltw idw3gttolt/wgt ltw
idw4gtleavelt/wgt ltw idw5gtIBMlt/wgt
  • ltstructgt
  • ltrel typesubj headw2 dependentw1/gt
  • ltrel typexcomp headw2 dependentw4
    introducerw3/gt
  • ltrel typesubj headw4 dependentw1/gt
  • ltrel typedobj headw4 dependentw5/gt
  • lt/structgt
  • subj(intend,Paul,_)
  • xcomp(to, intend, leave)
  • subj(leave,Paul)
  • dobj(leave,IBM,_)

10
Terminological description
  • Work being conducted within ISO TC37
  • Cf. ISO 12200 (Martif - Machine Readable
    Terminological Interchange Format)
  • Specific format for computerized multilingual
    terminologies
  • TMF - terminological markup framework
  • Future ISO 16642 standard
  • Generic format encompassing any specific
    representation

11
MARTIF example
  • lttermEntry id'ID67'gt
  • ltdescrip type'subjectFieldgtmanufacturinglt/descr
    ipgt
  • ltdescrip type'definition'gtA value between 0 and
    1 used in ...lt/descripgt
  • ltlangSet lang'en'gt
  • lttiggt
  • lttermgtalpha smoothing factorlt/termgt
  • lttermNote type'termType'gtfullFormlt/termNotegt
  • lt/tiggt
  • lt/langSetgt
  • ltlangSet lang'hu'gt
  • lttiggt
  • lttermgtAlfa ...lt/termgt
  • lt/tiggt
  • lt/langSetgt
  • lt/termEntrygt

12
Structural skeleton
TE Terminological Entry LS Language Section TS
Term Section
idID67 attribute subjectField manufacturing
  typedElement definitionA value
typedElement
TE
LS
lang hu  attribute
lang en  attribute
TS
term element
termalpha smoothing factor element termType
fullForm typedElement
13
Abstract XML Version
  • ltstruct typeTEgt
  • ltfeat typeidgtID67lt/featgt
  • ltfeat typesubjectFieldgtmanufacturinglt/featgt
  • ltfeat typedefinitiongtA value between 0 and 1
    used in ...lt/featgt
  • ltstruct typeLSgt
  • ltfeat typelanggtenlt/featgt
  • ltstruct typeTSgt
  • ltfeat typetermgtalpha smoothing
    factorlt/featgt
  • ltfeat typetermTypegtfullFormlt/featgt
  • lt/structgt
  • lt/structgt
  • ltstruct typeLSgt
  • ltfeat typelanggthult/featgt
  • ltstruct typeTSgt
  • ltfeat typetermgtAlfa ...lt/featgt
  • lt/structgt
  • lt/structgt
  • lt/structgt

14
Overall knowledge source architecture
15
Virtual AML
  • Compare annotation schemes
  • E.g., Penn Treebank, TALANA, Tiger, Italian
    Treebank, etc.
  • Map onto abstract XML representation
  • ltstructgt, ltfeatgt, etc.
  • Design generic visualizing, editing, extraction,
    etc. tools

16
Concrete AML
  • Project-specific XML schema
  • Filters to transform to and from abstract XML
    representation
  • Automatically generated from the data category
    description structural skeleton

17
Data Categories
  • Description of Data Categories
  • Formal properties
  • E.g. name, definition, levels of occurrence,
    content type (possible values), target type (cf.
    ltrelgt)
  • Represented as an RDF structure
  • Registered in the Data Category Repository for
    universal reference and use
  • This is the work we have to do

18
Data Category
19
Existing initiatives
  • DTD-oriented
  • XCES, MATE, TEI
  • Basic lists of Data Categories
  • EAGLES
  • But also
  • ISO (MARTIF(DTD-oriented), TMF), OLIF, etc.

20
Where do we go from here?
  • Possible agenda
  • Create a repository of data categories for
    linguistic representation
  • Organize working groups for specific fields
  • Feedback on data categories
  • Underlying architecture and relations with other
    groups (e.g. syntactic annotation of transcribed
    speech)
  • Collaborate with other bodies
  • Industry, ISO
Write a Comment
User Comments (0)
About PowerShow.com