Title: Linguistic Annotation Framework SC4 WG 1
1Linguistic Annotation FrameworkSC4 WG 1
- Nancy Ide Vassar College USA
2LAF Goal
- Provide a generic means to represent linguistic
data and annotations - Based on a formal model
- Users map their formats into/out of LAF
- User formats must conform to underlying model
- Pivot or dump format for exchange, machine
processing
3DUMP FORMAT interlingua
User As representation
User Bs representation
4Principles
- Separation of data and annotations
- Stand-off annotation
- Separation of user annotation formats and the
exchange (dump) format - Mappable to one another
- Separation of referential structure and
annotation content in dump format - Separation of annotation structure (relationships
among parts) and content (data categories) in
representation of annotations
5LAF Development
- LAF has gone through a slow evolution
- Model development (GMT as base)
- Consideration of processing needs
- Application to different annotation
types/structures/formats - Adjustments to development in other WGs on
specific annotation types and feature structures - Proof of concept instantiation in the American
National Corpus - Transduction of several different annotation
types and formats to LAF format - API to merge, transduce to other formats
6LAF Status
- Have now
- Reduced FS specification
- Final XML format / schema
- GrAF Graph Annotation Format
- Mapping rules and examples
- Also
- Coordination with UIMA
- Header specification including information about
annotation, similar to UIMA type definition
7Basic Model
- Annotation content represented by feature
structures - Powerful means to represent any/all annotations
- Referential structure represented as a directed
acyclic graph (DAG) - Enables exploitation of well-understood graph
traversal and manipulation algorithms
8Referential Structure
- Means by which annotation content is associated
with primary data or other annotations - Very simple DAG model
- No need to consider internal structure of
annotation content (i.e. relations among bits of
annotation information)
9Primary Data
- Primary data contains no annotations
- Read-only
- Modifications can be regarded as annotations
- Insistence on the identification of a base
segmentation of the primary data - Identifies contiguous sequences of indivisible
logical units - For text, usually a character
- Compatible annotations (i.e. those that can be
merged etc.) use common base segmentation
10Primary Segmentation
- Set of disjoint edges over primary data
- Vertices
- Virtual, located between each logical unit
- Sequentially numbered
- Edges
- Each edge (x,y) in the graph delimits a
non-divisible region of primary data - Comformance to MAF, SynAF
- call these edges over primary data a span
11- Multiple primary segmentations may be defined
over a single primary data set - Specify segmentations at different levels of
granularity - A segmentation is primary vis a vis a given
annotation, not the data itself - Edges in a primary segmentation can be defined
over any span of contiguous primary data,
regardless of its length - No need for spans to be contiguous
- For text, most common primary segmentation is the
token
12Referring to Primary Segmentation
- Define an edge graph over the edges (spans) in
the primary segmentation - Given an edge set, E, create an edge graph E
such that for each edge (x,y) in E, there is a
vertex xy in E - Annotations are associated with regions of
primary data by referencing the edge graph
vertices - Annotations never reference the primary data
directly
13- Edges in E are defined when annotations
reference vertices in E - Vertices may or may not be contiguous
- An annotation is associated with vertices in E
as follows - Create a new vertex, v
- Label it with the FS containing the annotation
content - Create an edge from v to 0 or more vertices in E
- Zero reference is used in the special case where
the annotation applies to information not present
in the data - References to 2 or more vertices in E by by
default concatenate the information covered by
the referenced vertices (in order) - can be overridden to specify vertices are to be
regarded as an ordered list or bag
14Edge graph over primary data
The clock struck
twenty-two
Annotations associated with vertices in the
primary data edge graph
15- As many annotations as desired can reference the
same segmentation or be layered over lower-level
annotations
MS1
Syn1
S E G 1
MS2
NP
Co-Ref
Primary data
S EG 2
Syn2
MS3
Sem
16Annotating Annotations
- Vertices in an annotation may be referenced from
other annotations - Create a new vertex, v
- Label it with the FS containing the annotation
content - Create an edge from v to one or more vertices
associated with an annotation - The strategy described above may be applied
recursively, thus creating a DAG whose leaves are
the vertices in E
17Annotations associated with token annotations
18XML Instantiation
lt!-- edges over primary data --gt ltedge id"e1"
from"0" to"3"/gt ltedge id"e2" from"4"
to"9"/gt ltedge id"e3" from"10" to"16"/gt ltedge
id"e4" from"17" to"23"/gt ltedge id"e5"
from"23" to"24"/gt ltedge id"e6" from"14"
to"27"/gt
19Token Annotation
ltnode id"t2" edgesTo"e2"gt ltfs type"token"gt
ltf name"base" value"clock"/gt ltf
name"pos" value"NN"/gt lt/fsgt lt/nodegt
Creates a new vertex (node) associated with the
FS with a single edge to vertex e2 in the
primary segmentation edge graph
20NP Annotation
ltnode id"np1" edgesTo"t1 t2"gt ltfs
type"NP"gt ltf name"number"
sVal"singular"/gt lt/fsgt lt/nodegt
Creates a new vertex (node) associated with the
FS with two outgoing edges to vertices t1 and
t2 in the token annotation
21Question
- When referring to annotations, edge targets
typically represent components - E.g. in the example the and clock are
components of NP - But this is not always the case
- Could be e.g. a list of co-referents
- Others?
- Possible solution let the processor deal with it
using the FS type
22Note
- Edges are never labeled, unlike in many
linguistic analyses - Preserves simplicity of the graph
- Relations are DatCats
- edgesTo attribute can be empty
- Can create pseudo-nodes
- Implies a flat (non-nested) structure in the dump
format
23s
obj
head
subj
gen
head
HAVE
FLEA
DOG
DOG
MY
ltnode type"clone" id"E2" ref"t2"/gt ltnode
idc5 edgesTot5gtltf namerole
sValgen/gtlt/nodegt ltnode idc6
edgesTot2gtltf namerole sValhead/gtlt/nodegt lt
node idc7 edgesToc5 c6/gt ltnode idc1
edgesTot1gtltf namerole sValhead/gtlt/nodegt lt
node idc7 edgesToc7gtltf namerole
sVals/gtlt/nodegt ltnode idc3 edgesTot3gtltf
namerole sValobj/gtlt/nodegt ltnode idc4
edgesToE2gtltf namerole sValsubj/gtlt/nodegt lt
node idD1 edgesToc1 c7 c3 c4 E2/gt
24Advantages of DAG
- Can apply graph algorithms to traverse the graph
- Breadth-first, depth-first traversal, shortest
path, minimum spanning tree - Connectedness, articulation vertices
- Topological sort
- Graph coloring, graph partitioning
- Etc.
- What can we do with this?
- What is all info on path to/from node x
- What is nearest common ancestor of nodes x and y
- Find matching sub-graphs
- Identify connected components
- Which nodes (phenomena) are most connected, form
articulation vertices, etc. -
25Feature Structures
- Each edge is labeled with a feature value
- Can be FS, collection (list, bag, set), atom
- Alternation and grouping handled by the FS
mechanisms - Need to identify basic FS mechanisms
- 90 of annotations use only these
- Annotations may (optionally) use only this set
- Ease of use
- No need to implement procedures to handle full
power of FS - Need to create a FS library for abbreviation
26Implications for Other WGs
- Should (conceptually at least) separate
referential structure from annotation content - E.g. tlink in TimeML/SemAF the link itself is
the edge, tlink is the annotation content (?) - Need for coordination
- Inter-project coordination committee?
- Need examples!
27Todays Work
- Discuss the format in terms of specific
annotation types - Remember that dump format is in principle never
seen by the user - Map user format into and out of dump format
- Two topics
- DAG for referential structure
- FS for representing annotation content