Annotation as Algebra: a formal framework for linguistic annotation

About This Presentation
Title:

Annotation as Algebra: a formal framework for linguistic annotation

Description:

HP Labs Bangalore, 8/21/2003. Annotation as Algebra: ... HP Labs Bangalore, 8/21/2003. Basic idea #1: what to do. Abstract away from file formats, ... –

Number of Views:121
Avg rating:3.0/5.0
Slides: 83
Provided by: languagel2
Category:

less

Transcript and Presenter's Notes

Title: Annotation as Algebra: a formal framework for linguistic annotation


1
Annotation as Algebraa formal framework for
linguistic annotation
  • Mark LibermanUniversity of Pennsylvaniamyl_at_cis.u
    penn.edu

(joint work with Steven Bird, Melbourne
University)
2
Outline
  • Motivation
  • Sketch of the idea
  • Survey of linguistic annotation
  • Annotation graphs as a formal framework
  • Practical implementations and experience
  • Issues for the future

3
What linguistic annotation is (and isnt)
  • Linguistic annotation means symbolic
    descriptions of specific linguistic signals
  • e.g. transcriptions, parses, etc.
  • it does not include things like
  • metadata
  • e.g. information about speakers, recordings,
    documents, etc.
  • typically stored in RDB referenced by elements
    of linguistic annotation
  • lexicons
  • but these can be treated in a common framework

4
Motivation
  • A jungle of annotation file formats
  • e.g. more than 20 common formatsfor time-marked
    orthographic transcriptions
  • Many new formats every year
  • Multiple annotations of the same data
  • No good way to search annotations
  • different coding needed for each format
  • extra difficulty of searches across formats
  • Problems for
  • tool builders
  • researchers
  • corpus builders and maintainers

5
Basic idea 1 what to do
  • Abstract away from file formats,to the logical
    structure of linguistic annotation
  • Replace two-level model with three-level model
  • as in database technology several decades ago
  • so many applications can access many kinds of
    data through a consistent API
  • Choose a logical structure with good properties
  • simple, conceptually natural, computationally
    efficient
  • algebra to facilitate boolean combination of
    queries

6
Two-level model
7
Three-level model
8
Basic idea 2 how to do it
  • Three kinds of assertion recur in linguistic
    annotation
  • assigning a label This chunk of stuff has
    property X
  • sequencing labels chunk B immediately follows
    chunk A
  • anchoring the edges of labels this chunk
    boundary has coordinates k (in time,
    space, text...)
  • Formalized as a labeled DAG, these
    primitives provides a logical structure
    adequate for all linguistic annotation
  • The result also defines an algebra useful
    for searching and in other ways

9
Basic assertion type 1 Labeling
Associate a label (typed, structured symbolic
information) with a region of a linguistic signal
10
Basic assertion type 2 sequencing
Example The stretch of signal labeled thisis
followed by a stretch of signal labeled is
11
Basic assertion type 3 anchoring
Example The stretch of signal labeled
this begins 137.4592 seconds from the start
of file XYZ.
12
Informal formalization
  • An annotation graph (AG) is
  • a directed acyclic graph
  • whose arcs are labeled with fielded records
  • e.g. phonemep or wordthis
  • whose nodes may be labeled with signal
    coordinates
  • e.g. 3.45692 seconds
  • Labeling ? arc labelsSequencing ?
    Anchoring ? signal coordinates on nodes
  • Thats all!

13
Outcome
API, open source toolkit (C,C,TCL,Python)
sample tools
Java version (ATLAS) developed by NIST
14
Annotation formats tools
  • Surveyed in 1999 by Liberman and Bird
  • Documented on web pagehttp//ldc.upenn.edu/annota
    tion
  • Used in designing annotation graphsystem AG
    software
  • Survey is updated periodically

15
Some animals in the annotation zoo
  • TIMIT
  • BAS Partitur
  • CHILDES
  • LACITO
  • LDC CALLHOME
  • NIST UTF
  • Switchboard (four types of annotation)
  • ... etc. ...

16
Sample TIMIT data
train/dr1/fjsp0/sa1.wrd train/dr1/fjsp0/sa1.p
hn 2360 5200 she 0 2360 h 5200 9680
had 2360 3720 sh 9680 11077 your
3720 5200 iy 11077 16626 dark 5200
6160 hv 16626 22179 suit 6160 8720
ae 22179 24400 in 8720 9680 dcl 24400
30161 greasy 9680 10173 y 30161 36150 wash
10173 11077 axr 36720 41839 water
11077 12019 dcl 41839 44680 all 12019
12257 d 44680 49066 year ...
17
TIMIT interpreted graphically
5200
6160
9680
8720
18
TIMIT as Annotation Graph
W word level 5200 9680 had
P phoneme level 5200 6160 hv 6160 8720 ae 8720
9680 dcl
19
BAS Partitur
Goal a common format for research results
from many German speech projects. A
multi-tier description of speech signals KAN -
the canonical transcription ORT - orthographic
transcription TRL - transliteration MAU -
phonetic transcription DAS - dialogue act
transcription
20
BAS Partitur example
KAN0 j'a ORT0 ja MAU 4160 1119 0
j KAN1 S'2n_at_n ORT1 schönen MAU 5280 2239 0
a KAN2 d'aNk ORT2 Dank MAU 7520 2399 1
S KAN3 das ORT3 das MAU 9920 1599 1
2 KAN4 vEr_at_ ORT4 wäre MAU 11520 479 1
n KAN5 z'e6 ORT5 sehr MAU 12000 479 1
n KAN6 n'Et ORT6 nett MAU 12480 479
-1 DAS0,1,2 _at_(THANK_INIT BA) DAS3,4,5,6
_at_(FEEDBACK_ACKNOWLEDGEMENT BA)
21
BAS Partitur graphical structure
KAN0 j'a ORT0 ja MAU 4160 1119 0
j KAN1 S'2n_at_n ORT1 sch"onen MAU 5280 2239 0
a DAS0,1,2 _at_(THANK_INIT BA)
22
Partitur differences from TIMIT
File organization everything is in a single
file (even metadata) Time marking time anchors
are in only one tier (MAU) time anchors use
ltstart offset, duration-1gt Relationship between
the tiers KAN tier supplies a set of
identifiers MAU tier several lines for each KAN
line DAS tier one line for several KAN
lines Temporal structure MAU and DAS define
convex intervals
23
BAS Partitur Annotation graph
ORT 0 ja MAU 4160 1119 0 j ORT 1
sch"onen MAU 5280 2239 0 a
MAU 7520 2399 1 S MAU
9920 1599 1 2 MAU 11520 479
1 n DAS0,1,2 _at_(THANK_INIT BA)
24
CHILDES
  • Child language acquisition data
  • Archive organized by Brian MacWhinney at
    CMU
  • CHAT transcription format
  • Tools for creating, browsing, searching
  • Contributions by many researchers around the
    world

25
CHILDES Annotation
ROS yahoo. snd "boys73a.aiff" 7349
8338 FAT you got a lot more to do don't
you? snd "boys73a.aiff" 8607 9999 MAR
yeah. snd "boys73a.aiff" 10482 10839 MAR
because I'm not ready to go to ltthe
bathroomgt gt /. snd "boys73a.aiff" 11621
13784
26
CHILDES differences from TIMIT
  • long recordings with multiple speakers
  • time specified at turn level only
  • there are gaps between the turns
  • the transcription contains embedded annotations

27
CHILDES annotation graph
ROS yahoo. snd "boys73a.aiff" 7349
8338 FAT you got a lot more to do don't
you? snd "boys73a.aiff" 8607 9999 NB
incomplete time info, disconnected structure
28
CHILDES RDB connection
metadata about speakers, recordings etc.
stored separately in relational tables
ID NAME ROLE AGE SEX BIRTH 1
Ross Child 63.11 male 23-DEC-1977 2
Mark Child 44.15 male
19-NOV-1979 3 Brian Father 4 Mary
Mother
29
LACITO
  • Langues et Civilisations a Tradition Orale
  • recordings of unwritten languages, collected and
    transcribed over three decades
  • preservation and dissemination
  • Based on XML
  • markup for alignment to audio signal
  • different XSL style sheets for display
  • generating HTML
  • with hyperlinks to audio clips

30
LACITO example
ltS id"s1"gt ltAUDIO start"2.3656"
end"7.9256"/gt ltTRANSCRgt ltWgtltFORMgtnakpult/FORMgt
ltGLSgtdeuxlt/GLSgtlt/Wgt
ltWgtltFORMgtnonotsolt/FORMgt
ltGLSgtsoeurslt/GLSgtlt/Wgt ltWgtltFORMgtsix014blt/FORMgt
ltGLSgtboislt/GLSgtlt/Wgt ltWgtltFORMgtpalt/FORMgt
ltGLSgtfairelt/GLSgtlt/Wgt
ltWgtltFORMgtlax0294natshemlt/FORMgt
ltGLSgtallerentlt/GLSgtlt/Wgt ltWgtltFORMgtarelt/FORMgt
ltGLSgtdit.onlt/GLSgtlt/Wgt
ltPONCTgt.lt/PONCTgt lt/TRANSCRgt ltTRADUC
lang"Francais"gtOn raconte que deux soeurs
allerent chercher du bois.lt/TRADUCgt ltTRADUC
lang"Anglais"gtThey say that two sisters went to
get firewood.lt/TRADUCgt lt/Sgt
31
LACITO as AG
ltAUDIO start"2.3656" end"7.9256"/gt ltWgtltFORMgtnakp
ult/FORMgt ltGLSgtdeuxlt/GLSgtlt/Wgt ltWgtltFORMgt
nonotsolt/FORMgt ltGLSgtsoeurslt/GLSgtlt/Wgt ltWgt
ltFORMgtsix014blt/FORMgt ltGLSgtboislt/GLSgtlt/Wgt
ltWgtltFORMgtpalt/FORMgt
ltGLSgtfairelt/GLSgtlt/Wgt ltTRADUC lang"Francais"gtOn
raconte que deux ...lt/TRADUCgt ltTRADUC
lang"Anglais"gtThey say that two ...lt/TRADUCgt
32
LACITO discussion
  • Two kinds of partiality for times
  • where they are simply unknown
  • where they are inappropriate
  • Unknown times
  • the annotation is incomplete
  • time-alignment is coarse-grained
  • Inappropriate times
  • for word boundaries in the phrasal translation
  • for punctuation?

33
LDC Call Home example
980.18 989.56 A you know, given how he's how
far he's gotten, you know, he got his degree at
Tufts and all, I found that surprising that
for the first time as an adult they're
diagnosing this. um 989.42 991.86 B mm. I
wonder about it. But anyway. 991.75 994.65 A
yeah, but that's what he said. And um 994.19
994.46 B yeah. 995.21 996.59 A He um
996.51 997.61 B Whatever's helpful. 997.40
1002.55 A Right. So he found this new job as a
financial consultant and seems to be happy with
that. 1003.14 1003.45 B Good.
34
LDC CallHome as AG
995.21 996.59 A He um 996.51 997.61 B
Whatever's helpful. 997.40 1002.55 A Right. So
...
35
CallHome discussion
  • Speaker overlap
  • No special devices, just turn time-marks
  • Scales for an arbitrary number of speakers
  • Information about word-level overlap is left
    ambiguous
  • Additional time references could easily
    specify word overlap

36
NIST UTF (circa 1999)
  • NIST National Institute for Standards and
    Technology(USA)
  • UTF Universal Transcription Format
  • Intended to generalize over several earlier LDC
    broadcast news and conversation transcription
    formats
  • Special treatment for
  • metadata, time stamps, speaker overlap,
    contractions

N.B. now abandoned in favor of AG-based
representations
37
NIST UTF example (from BN)
ltturn speaker"Roger_Hedgecock" spkrtype"male"
dialect "native" start"2348.811875"
end"2391.606000" mode"spontaneous"
fidelity"high"gt lttime sec"2387.353875"gt on
welfare and away from real ownership \breath and
ltcontraction e_form"thatgtthat'sgtis"gtthat's
a real problem in this ltb_overlap
start"2391.115375" end"2391.606000"gt
countrylte_overlapgtlt/turngt ltturn
speaker"Gloria_Allred" spkrtype"female"
dialect "native" start"2391.299625"
end"2439.820312" mode"spontaneous"
fidelity"high"gt ltb_overlap start"2391.299625"
end"2391.606000"gt well ilte_overlapgt think the
real problem is that uh these kinds of
republican attacks lttime sec"2395.462500"gt i see
as code words for discriminationlt/turngt
38
NIST UTF turn element
ltturn speaker"Roger_Hedgecock"
spkrtype"male" dialect "native"
start"2348.811875" end"2391.606000"
mode"spontaneous" fidelity"high"gt
39
NIST UTF Contraction
ltcontraction e_form"thatgtthat'sgtis"gt
that's
40
NIST UTF overlap
ltb_overlap start"2391.115375"
end"2391.606000"gt country lte_overlapgt
41
NIST UTF discussion
Relational data (e.g. speaker demographics) is
embedded in the annotation (redundantly). Time
stamps are stored in three different
places. Speaker overlap is convolved with the
speaker turn, so time relation with an external
event disrupts the internal structure of a
turn Contractions are treated in a way that
facilitates link to lexicon, but may be hard to
ignore in a search function
42
NIST UTF as AG
43
AG contraction treatment
Additional textual annotations e.g. for
expanding a contraction don't complicate the
existing representation --facilitates search
44
NIST UTF / AG version
Metadata stored in a separate RDB table (cf.
CHILDES) Time stamps stored in a single place --
AG nodes Speaker overlap not convolved with the
speaker turn so temporal relationship with an
external event remains external to the
structure of a turn Contractions no new device,
easily ignored in search No artificial order on
speaker turns
45
Switchboard
Corpus of 2400 5-minute telephone conversations
collected at Texas Instruments in
1991 Transcribed and aligned on three
levels conversation, speaker turn,
word Subsequently annotated for POS, syntactic
structure, breath groups, disfluencies, speech
acts, phonetic segments, etc. Then
re-transcribed with many corrections!
--Proliferation of layers with different
tokenizations --Problem of correction after
annotation
46
SWB example (1, 2)
B 21.86 0.26 Metric B 22.12 0.26 system, B 22.38
0.18 no B 22.56 0.06 one's B 22.86 0.32 very, B
23.88 0.14 uh, B 24.02 0.16 no B 24.18 0.32 one B
24.52 0.28 wants B 24.80 0.06 it B 24.86 0.12
at B 24.98 0.22 all B 25.66 0.22 seems B 25.88
0.22 like.
Metric/JJ system/NN ,/, no/DT one/NN
's/BES very/RB ,/, uh/UH ,/, no/DT
one/NN wants/VBZ it/PRP at/IN all/DT
seems/VBZ like/IN ./.
47
SWB example (3, 4)
B.22 Yeah, / no one seems to be adopting it.
/ Metric system, no one's very, F uh, no
one wants it at all seems like. / ((S
(NP-TPC Metric system) , (S-TPC-1 (EDITED (RM
) (S (NP-SBJ no one)
(VP 's (ADJP-PRD-UNF very))) ,
(IP )) (INTJ uh) ,
(NP-SBJ no one) (VP wants (RS )
(NP it) (ADVP at all))) (NP-SBJ ) (VP seems
(SBAR like (S T-1))) . E_S))
48
Switchboard AG
49
Another multiple annotation
It is quite realistic to have this many diverse
annotations (and more!) for the same material...
50
AG formalization Background
  • Annotation - the basic action
  • associate a label with an extent of signal
  • labels may be of different types
  • different types may span different amounts of
    time need not form a hierarchy
  • Minimal formalization
  • directed graph
  • typed, fielded records on the arcs
  • optional time references on the nodes

51
Timelines
  • Nodes are anchored to signals using offsets
  • An annotation may reference more than one signal
  • e.g. simultaneous audio and video
    signals signals from multiple
    microphones audio and physiological signals
  • All the signals covered by a given annotation
    must be from the same "flow of time" timeline
    T
  • but signals may cover a timeline only partially
  • (Other ordered sets, such as the sequence of
    characters in a text,may also be treated as
    timelines... )

52
Two Signals, One Timeline
(Could be treated as a single multi-channel
signal -- but different channels might be in
different files, have different frame rates,
etc.)
53
AG Formal Definition
  • An Annotation Graph G over a label set L and
    timeline T is a 3-tuple ltN,A,tgt
  • N set of nodes
  • A set of arcs labelled with elements of L
  • t partial function from N to T
  • satisfying the following conditions
  • ltN,Agt is acyclic, with no nodes of degree zero
  • for any path from node n1 to n2, if t(n1) and
    t(n2) are defined, then t(n1) lt t(n2)

54
Condition 1
  • 1. ltN,Agt is acyclic, with no nodes of degree zero
  • 1a. AGs are acyclic
  • expresses the linearity of signal annotations
  • an important property wrt implementations and to
    QLs containing path expressions
  • 1b. AGs have no orphan nodes
  • the only point of nodes is to anchor the arcs
  • avoids the situation of AGs that are identical
    but for orphan nodes

55
Condition 2
  • for any path from node n1 to n2, if t(n1) and
    t(n2) are defined, then t(n1) lt t(n2)
  • 2. AGs respect the flow of time (or the
    structure of another anchoring space)

56
AG Interpretation of Labels
  • Arc labels may be interpreted as
  • substantive content
  • conforming to a coding practice
  • as meta-commentary
  • as a reference to other material
  • as an identifier
  • as arbitrary binary data
  • Choice of label interpretations falls outside
    the scope of the formalism

57
AG Expressiveness
  • Is the formalism too minimalist?
  • Some things that some people want
  • 1. cross-reference from a label to another
    arbitrary label, arc or node2. labels as well as
    anchors for nodes3. anchoring nodes to arcs or
    labels rather than timelines4. anchoring
    arcs/labels in 2- or 3-dimensional spaces5.
    recursive structures in labels
  • Core AG has sufficient expressive capacity to
    encode, in an intuitive way, all commonly used
    formats,and also good properties wrt creation,
    maintenance, search
  • Our strategy
  • - see how far we can go with this core
  • - dispense with more complex syntax and focus on
    semantics
  • - but some of (1) has been added in core AG
    implementation,and (4) has been added in ATLAS
    (NIST version)

58
Structures for a single layer
All of these have (one or more) natural
representations in the basic AG
formalism. Multiple layers can of course be added
in a general way.
59
Equivalence classes
Equivalence classes (joint reference to an
external ID) provide a way to establish
symmetrical inter-label linkages without any new
formal devices
60
AG as algebra
  • An AG can be represented as a set of arcs each
    with an associated label and (optionally-anchore
    d) source and destination nodes
  • The power set of this arc set defines a boolean
    algebra (as usual)
  • Every member of the power set is itself a
    well-defined AG
  • This algebra can be used for queries, just as
    the relational algebra is for RDBs
  • Adding e.g. pointers from labels to other
    arc compromises this property (because arc
    subsets are not well-formed if pointers cannot
    be dereferenced)

61
AG as RDB
  • An AG can therefore also be interpreted as a
    relational table
  • or (more conveniently) as a set of three
    relational tables
  • This allows standard RDB implementations to be
    used for AG storage and retrieval
  • Obvious advantages, though standard RDB may
    not use AG structure optimally...

62
Relational Representation
  • Three relations
  • anchor, annotation (arc), feature (label)

63
Anchor Relation
Ann1 ltl1,l2,...,lngt
  • AnchorId Offset
  • a1 t1
  • a2 t2

64
Annotation (arc) Relation
Ann1 ltl1,l2,...,lngt
  • AnnotationId Source Destination
  • Ann1 a1 a2

65
Feature Relation
Ann1 ltl1,l2,...,lngt
  • AnnotationId Feature Value
  • Ann1 F1 l1
  • Ann1 F2 l2
  • ... ... ...

66
Queries across multiple tables
ha /hh aa1/ habit /hh ae1 b ix t/ had /hh ae1
d/ hafta /hh ae1 f t ax/
ID Sex DR Ht AKS0 F 1 5'04" ASW0 F 5
5'06" BJL0 F 5 5'07"
train/dr2/fbjl0/
67
Queries on AG Tables
  • select from FEATURE where FEATURE.AGID"TimitAG
    80"
  • select ANNOTATIONID,SPKRINFO.ID
  • from FEATURE,SPKRINFO
  • where SPKRINFO.DR1
  • and SPKRINFO.Ht70
  • and FEATURE.VALUE"dark"

68
AG software
  • AGTK
  • provides API
  • and language bindings
  • version 2.0 recently released
  • Sample applications
  • Open-source license
  • Available on sourceforge

69
AGTK architecture
70
API Summary
  • Functions for creating, accessing, modifying,
    storing and loading AGs
  • C library
  • Compiles on Unix and Windows
  • Scripting language access
  • Python, Tcl/tk

71
File I/O Library
  • Approach
  • build import methods for all widely used formats
  • public API documentation to encourage others to
    contribute code for their formats
  • Currently supported
  • AIF (ATLAS Interchange Format - XML)
  • BAS, BU, CALLHOME, CSV, Switchboard, TIMIT,
    Treebank, xlabel

72
Integration with other tools
  • Example WaveSurfer/SNACK
  • Sjölander and Beskow
  • www.speech.kth.se/wavesurfer/
  • open source software for sound visualization,
    analysis and manipulation
  • Linux, Windows 95/98/NT/2k, Mac, Solaris, ...
  • customizable, extensible, embeddable
  • can read and write
  • wav, au, aiff, mp3, csl, sd, sphere
  • unlimited file size
  • Unicode support

73
Wavesurfer Screenshot 1
74
Wavesurfer Screenshot 2
75
Wavesurfer Screenshot 3
76
Wavesurfer Screenshot 4
77
Annotation Component Spreadsheet (TRAINSDAMSL)
Annotation here presented in spreadsheet mode
Each row is an annotation of stretch of
signalEach column is a type of annotation
78
TableTrans tool
Seamless integration of AGTK for annotation,and
Wavesurfer for audio display and playback.
79
Components in TableTrans
80
Another annotation GUI
81
Issues for the future
  • Some positive things
  • stand-off (rather than in-line) annotation
  • is now common though by no means universal
  • but in-line annotators mostly realize they are
    sinful
  • AGTK implementation is mature
  • libraries are well designed implemented
  • good integration with GUIs and DB backends
  • can read/write many common formats
  • Some AG-based tools are good
  • basically, those that have really been used
  • demand pull influence of users on development

82
Issues for the future
  • Some things need more work
  • AG API and AGTK are not yet widely used
  • Many AG-based tools are rough sketches
  • NIST ATLAS is not popular with researchers
    (java, complexity)
  • For many projects, something simpler less
    general is still the local optimum
  • lines of tab-separated fields, or
  • in-line mark-up (XML or ad hoc), or
  • other legacy or new ad hoc formats
  • but its still early days...
Write a Comment
User Comments (0)
About PowerShow.com