Title: Exploration 2000
1Towards a Generic Framework for Language Resource
Annotation
- Nancy Ide
- Department of Computer Science
- Vassar College
Laurent Romary Equipe Langue et
Dialogue Laboratoire LORIA
2Objectives
- Provide a framework for annotation of language
resources to enable - Ease of creation and use
- Inter-operability between/across annotation
schemes - Exploit and adapt to the developing WWW
environment - XML (schemas, transformations), RDF
3What do we need?
- A theory/tagset independent annotation scheme
- Differentiate structure and data categories
- A DTD-independent annotation scheme?
- Provide as much abstraction as possible
- Then map onto specific annotation modules
- Towards a generic LR annotation framework
4Syntactic annotation as a testbed
- An abstract XML representation
- Basic structural skeleton
- hierarchy of ltstructgt elements
- Pointers to underlying data
- ltseggt element xlink
- Associated characteristics
- features (ltfeatgt) within associated ltstructgt
- Identification of dependencies
- ltrelgt
- Categorization of levels
- A generic type attribute
5Example - Penn Tree Bank
- Categorial information NP
- Relational information SBJ (implicit head)
- Node identification 1
- ((S (NP-SBJ-1 Jones)
- (VP followed)
- (NP him)
- (PP-DIR into
- (NP the front room))
- ,
- (S-ADV (NP-SBJ -1)
- (VP closing
- (NP the door)
- (PP behind
- (NP him)))))
- .))
- Categorial information NP
- Relational information implicit OBJ
- Categorial information NP
- Relational information SBJ (implicit head)
- Node reference
6ltstruct xmlbase"http//www.loria.fr/doc.xml"gt
ltstruct ids0gt ltfeat typeCATgtSlt/featgt
ltstruct ids1/gt ltfeat
typeCATgtNPlt/featgt ltrel typeSBJ
heads2/gt ltseg xlinkhref"xptr(substri
ng(/p/s1/text(),1,5))"/gt lt/structgt
ltstruct ids2/gt ltfeat
typeCATgtVPlt/featgt ltseg
xlinkhref"xptr(substring(/p/s1/text(),7,8))"/gt
ltstructgt ltfeat typeCATgtNPlt/featgt
lt!-- implicit OBJ relation here --gt ltseg
xlinkhref"xptr(substring(/p/s1/text(),16,3))/
gt lt/structgt ltstructgt ltfeat
typeCATgtPPlt/featgt ltrel typeDIR
heads2/gt ltseg xlinkhref"xptr(substring(/p
/s1/text(),20,4))/gt ltstructgt ltfeat
typeCATgtNPlt/featgt ltseg
xlinkhref"xptr(substring(/p/s1/text(),25,14))
/gt lt/structgt lt/structgt lt/structgt
Jones
followed
him
into
the front room
7Comments
- A generic element ltrelgt
- Several attributes (cf. Carroll, Minnen Briscoe
, to appear) - type
- head
- dependent
- introducer
- initial
- E.g. ltrel typesubj headw2 dependentw1/gt
8Comments - cont.
- Relations
- Implicit dependent marking in relations
- The default is the current struct element
- Xpath ./parentstruct
- Implicit or explicit head ? explicit
- ltrelgt s can be described externally to the
syntactic bracketing
9Example (Carroll, Minnen Briscoe)
Paul intends to leave IBM. ltw idw1gtPaullt/wgt ltw
idw2gtintendslt/wgt ltw idw3gttolt/wgt ltw
idw4gtleavelt/wgt ltw idw5gtIBMlt/wgt
- ltstructgt
- ltrel typesubj headw2 dependentw1/gt
- ltrel typexcomp headw2 dependentw4
introducerw3/gt - ltrel typesubj headw4 dependentw1/gt
- ltrel typedobj headw4 dependentw5/gt
- lt/structgt
- subj(intend,Paul,_)
- xcomp(to, intend, leave)
- subj(leave,Paul)
- dobj(leave,IBM,_)
10Terminological description
- Work being conducted within ISO TC37
- Cf. ISO 12200 (Martif - Machine Readable
Terminological Interchange Format) - Specific format for computerized multilingual
terminologies - TMF - terminological markup framework
- Future ISO 16642 standard
- Generic format encompassing any specific
representation
11MARTIF example
- lttermEntry id'ID67'gt
- ltdescrip type'subjectFieldgtmanufacturinglt/descr
ipgt - ltdescrip type'definition'gtA value between 0 and
1 used in ...lt/descripgt - ltlangSet lang'en'gt
- lttiggt
- lttermgtalpha smoothing factorlt/termgt
- lttermNote type'termType'gtfullFormlt/termNotegt
- lt/tiggt
- lt/langSetgt
- ltlangSet lang'hu'gt
- lttiggt
- lttermgtAlfa ...lt/termgt
- lt/tiggt
- lt/langSetgt
- lt/termEntrygt
12Structural skeleton
TE Terminological Entry LS Language Section TS
Term Section
idID67 attribute subjectField manufacturing
typedElement definitionA value
typedElement
TE
LS
lang hu attribute
lang en attribute
TS
term element
termalpha smoothing factor element termType
fullForm typedElement
13 Abstract XML Version
- ltstruct typeTEgt
- ltfeat typeidgtID67lt/featgt
- ltfeat typesubjectFieldgtmanufacturinglt/featgt
- ltfeat typedefinitiongtA value between 0 and 1
used in ...lt/featgt - ltstruct typeLSgt
- ltfeat typelanggtenlt/featgt
- ltstruct typeTSgt
- ltfeat typetermgtalpha smoothing
factorlt/featgt - ltfeat typetermTypegtfullFormlt/featgt
- lt/structgt
- lt/structgt
- ltstruct typeLSgt
- ltfeat typelanggthult/featgt
- ltstruct typeTSgt
- ltfeat typetermgtAlfa ...lt/featgt
- lt/structgt
- lt/structgt
- lt/structgt
14Overall knowledge source architecture
15Virtual AML
- Compare annotation schemes
- E.g., Penn Treebank, TALANA, Tiger, Italian
Treebank, etc. - Map onto abstract XML representation
- ltstructgt, ltfeatgt, etc.
- Design generic visualizing, editing, extraction,
etc. tools
16Concrete AML
- Project-specific XML schema
- Filters to transform to and from abstract XML
representation - Automatically generated from the data category
description structural skeleton
17Data Categories
- Description of Data Categories
- Formal properties
- E.g. name, definition, levels of occurrence,
content type (possible values), target type (cf.
ltrelgt) - Represented as an RDF structure
- Registered in the Data Category Repository for
universal reference and use - This is the work we have to do
18Data Category
19Existing initiatives
- DTD-oriented
- XCES, MATE, TEI
- Basic lists of Data Categories
- EAGLES
- But also
- ISO (MARTIF(DTD-oriented), TMF), OLIF, etc.
20Where do we go from here?
- Possible agenda
- Create a repository of data categories for
linguistic representation - Organize working groups for specific fields
- Feedback on data categories
- Underlying architecture and relations with other
groups (e.g. syntactic annotation of transcribed
speech) - Collaborate with other bodies
- Industry, ISO