Title: Semantic Annotation for Interlingual Representation of Mulilingual Texts
1Semantic Annotation for Interlingual
Representation of Mulilingual Texts
- Teruko Mitamura (CMU), Keith Miller (MITRE),
- Bonnie Dorr (Maryland), David Farwell (NMSU),
Nizar Habash (Columbia), Stephen Helmreich
(NMSU), Eduard Hovy (ISI), Lori Levin (CMU), Owen
Rambow (Columbia), - Flo Reeder (MITRE), Advaith Siddharthan
(Columbia) - LREC 2004 Workshop Beyond Named Entity
Recognition - Semantic labelling for NLP tasks
2(No Transcript)
3IAMTC (Interlingua Annotation of Multilingual
Corpora) Project
- Collaboration
- New Mexico State University
- University of Maryland
- Columbia University
- MITRE
- Carnegie Mellon University
- ISI, University of Southern California
4Goals of IAMTC
- Interlingua design
- Three levels of depth
- Annotation methodology
- manuals, tools, evaluations
- Annotated multi-parallel texts
- Foreign language original and multiple English
translations - Foreign languages Arabic, French, Hindi,
Japanese, Korean, Spanish
5Getting at Meaning(Two translations of Korean
original text)
- Starting January 1st
- of next year
- customers of SK Telecom
- can change their service company to
- LG Telecom or KTF
- Once a service company swap has been made,
- customers
- are not allowed to change
- companies again
- within the first three months,
- although they can cancel
- the change
- anytime within 14 days
- if problems
- such as poor call quality
- are experienced.
- Starting on January 1
- of next year,
- SK Telecom subscribers
- can switch to
- less expensive LG Telecom or KTF.
- The Subscribers
- cannot switch again
- to another provider
- for the first 3 months,
- but they can cancel
- the switch
- in 14 days
- if they are not satisfied with services
- like voice quality.
6Color Key
- Black same meaning and same expression
- Green small syntactic difference
- Blue Lexical difference
- Red Not contained in the other text
- Purple Larger difference.
- Need to use some inference to know that the
meaning is the same
7Getting at meaning(Two translations of a
Japanese original text)
- This year,
- which has already seen
- the announcement
- of the birth
- of Mitsubishi Chemical Corporation
- as well as
- the continuous
- numbers of big mergers,
- may
- too
- be recorded
- as the "year of the merger
- for all we know.
- This year,
- too,
- in addition to
- the birth
- of Mitsubishi Chemical,
- which has already been announced,
- other rather large-scale mergers
- may continue,
- and be recorded
- as a "year of mergers."
More lexical similarity. More differences in
dependency relations.
8Toward a Theory of Annotation
- Recently, sharp increase in number of annotated
resources being built - Penn Treebank, Propbank, many others
- For annotation, need
- Theory behind phenomena being annotated (for)
- Annotation termsets (even WordNet, FrameNet,
verbnet, HowNet) - Standard (?) annotation corpus (same old
Treebank?) - Annotation toolsthey make an immense difference
- Carefully considered annotation procedure
(interleaving per text vs. per sentence, etc.) - Reconciliation and consistency checking
procedures - Evaluation measures, appropriately defined
9Corpus and Data
- Initial Corpus
- 10 texts in each language
- 2 translations each into English
- Interlingua designed for MT
- Multiple English translations of same source show
translation divergences. Some phenomena - Lexical level word changes
- Syntactic level phrasing, thematization,
nominalization - Semantic level additional/different content
- Discourse level multi-clause structure, anaphor
- Pragmatic level Speech Acts, implicatures,
style, interpersonal - Causes of divergence
- Genuine ambiguity/vagueness of source meaning
- Translator error/reinterpretation
10IL Development Staged, deepening
- IL0 simple dependency tree gives structure
- IL1 semantic annotations for Nouns, Verbs, Adjs,
Advs, and Theta Roles - Not yet semanticbuy?sell, many remaining
simplifications - Concept senses from ISIs Omega ontology
- Theta Roles from Dorrs LCS work
- Elaborate annotation manuals
- Tiamat annotation interface
- Post-annotation reconciliation process and
interface - Evaluation scores annotator agreement
- IL2 that comes next
11Details of IL0
- Deep syntactic dependency representation
- Removes auxiliary verbs, determiners, and some
function words - Normalizes passives, clefts, etc.
- Includes syntactic roles (Subj, Obj)
- Construction
- Dependency parsed using Connexor (English)
- Tapanainen and Jarvinen, 1997
- Hand-corrected
- Extensive manual and instructions on IAMTC Wiki
website
12Example of IL0
Sheikh Mohammed, who is also the Defense Minister
of the United Arab Emirates, announced at the
inauguration ceremony that we want to make Dubai
a new trading center
TrEd, Pajas, 1998
13Example of IL0
- Sheikh Mohammed, who is also the Defens Minister
of the United Arab Emirates, announced at the
inauguration ceremony that we want to make Dubai
a new trading center - announced V Root
- Mohamed PN Subj
- Sheikh PN Mod
- Defense_Minister PN Mod
- who Pron Subj
- also Adv Mod
- of P Mod
- UAE PN Obj
- at P Mod
- ceremony N Obj
- inauguration N Mod
14Details of IL1
- Intermediate semantic representation
- Annotations performed manually by each person
alone - Associate open-class lexical items with Omega
Ontology items - Replace syntactic relations by one of approx. 20
semantic (theta) roles (from Dorr), e.g., AGENT,
THEME, GOAL, INSTR - No treatment of prepositions, quantification,
negation, time, modality, idioms, proper names,
NP-internal structure - Nodes may receive more than one concept
- Average about 1.2
- Manual under development annotation tool built
15Example of IL1
Sheikh Mohammed, who is also the Defense Minister
of the United Arab Emirates, announced at the
inauguration ceremony that we want to make Dubai
a new trading center
16Example of IL1 internal representation
- The study led them to ask the Czech government to
recapitalize CSA at this level. - 3, lead, V, lead, Root, LEADltGET, GUIDE
- 2, study, N, study, AGENT, SURVEYltWORK, REPORT
- 4, they, N, they, THEME, ---, ---
- 6, ask, V, ask, PROPOSITION, ---, ---
- 9, government, N, government, GOAL,
AUTHORITIES, - GOVERNMENTAL-ORGANIZATION
- 8, Czech, Adj, Czech, MOD,
CZECHCZECHOSLOVAKIA, --- - 11, recapitalize, V, recapitalize,
PROP, CAPITALIZEltSUPPLY, INVEST - 12, csa, N, csa, THEME,
AIRLINEltLINE, --- - 16, at, P, value_at, GOAL, ---,
--- - 15, level, N, level, ---,
DEGREE, MEASURE - 14, this, Det, this,
---, ---, ---
Semantic Roles
Concepts from the Omega Ontology
17Details of IL2 In development
- Start capturing meaning
- Handle proper names one of around 5 classes
(PERSON, LOCATION, TIME, ORGANIZATION) - Conversives (buy vs. sell) at the FrameNet level
- Non-literal language usage (open the door to
customers vs. start doing business) - Extended paraphrases involving syntax, lexicon,
grammatical features - Possible incorporation of other standardized
notations for temporal and spatial expressions - Still excluded
- Quantification and negation
- Discourse structure
- Pragmatics
18Omega ontology
- Single set of all semantic terms, taxonomized and
interconnected (http//omega.isi.edu) - Merger of existing ontologies and other
resources - Manually built top structure from ISI
- WordNet (110,000 nodes) from Princeton
- Mikrokosmos (6000 nodes) from NMSU
- Penman Upper model (300 nodes) from ISI
- 1-million instances (people, locations) from ISI
- TAP domain relations from Stanford
- Undergoing constant reconciliation and pruning
- Used in several past projects (metadata formation
for database integration MT QA summarization)
19Dependency parser and Omega ontology
Omega (ISI)110,000 concepts (WordNet,
Mikrokosmos, etc.), 1.1 mill instances URL
http//omega.isi.edu
Dependency parser (Prague)
20Tiamat annotation interface
For each new sentence
Step 1 find Omega concepts for objects and events
Candidate concepts
Step 2 select event frame (theta roles)
21Evaluation webpage
22Evaluation
- Three approaches to evaluation
- Inter-annotator agreement completed
- Sentence generation from extracted annotation
structure to be completed - Comparison of interlingual structures (graph
comparisons) not planned - Inter-annotator agreement Is the IL sufficiently
defined to permit consistent annotation? - Impacts ontology, theta-roles coverage and
precision
23Annotation Issues
- Post-annotation consistency checking
- Novice annotators may make inconsistent
annotations within the same text. - Intra-annotator consistency checking procedure
- e.g. If two nodes in different sentences are
co-indexed, then annotators must ensure that the
two nodes carry the same meaning in the context
of the two different sentences - Post-annotation reconciliation
242. Post-annotation reconciliation
- Question How much can annotators be brought into
agreement? - Procedure
- Annotator sees all annotations, votes
Yes/Maybe/No on each - Annotators then discuss all differences
(telephone conf) - Annotators then vote again, independently
- We collapse all Yes and Maybe votes, compare them
with No to identify all serious disagreement - Result
- Annotators derive common methodology
- Small errors and oversights removed during
discussion - Inter-annotator agreement improved
- Serious problems of interpretation or error
identified
25Annotation across Translations
- Question How different are the translations?
- Procedure
- Annotator sees annotations across both
translations, identifies differences of form and
meaning - Annotator selects true meaning(s)
- Results (work still in progress)
- Impacts ontology richness/conciseness
- Improvement in Interlingua representation depth
- Useful for IL2 design development
- Observations
- This is very hard work
- Methodology unclear what is seen first, how to
show alternatives, what to do with results
26Principal problems to date
- Proper nouns
- Proposed solution automatically tag with one of
6 types (Person, Location, Org, DateTime, etc.) - Noun compounds
- Alternatives tag head only parse and tag whole
structure - Omega is too rich
- Hard to distinguish from the others
- Granularity of concept selection
- Light verbs
- Proposed solution rephrase to remove light verb
if possible (take a shower ? shower, but
take a shower ? ?) - Vagueness and ambiguity
- Annotate all plausible senses (propose as Urge
and Suggest) - Idioms and metaphors
- Proposed solution ?
27Discussion and conclusion
- Results are encouraging
- But more work must be done to solidify them
- Outcomeshow have we done?
- IL design partly, and IL2 in the works
- Annotation methodology, manuals, tools, evals
yes - Annotated parallel texts approx. 150 done
- Six texts, two translations, 10-12 annotators
- Next steps
- Foreign language annotation standards and tools
- Development of IL2
- Addressing coverage gaps (1/3 of open class words
marked as having no concept) - Generation of surface structure from deep
structure - Is it possible?
28Contact information
- URLs and Wiki pages
- Project website http//aitc.aitcnet.org/nsf/iamtc
/