Language Documentation - PowerPoint PPT Presentation

1 / 10
About This Presentation
Title:

Language Documentation

Description:

Language Documentation & Archiving Activities at the Linguistic Data ... Field Recordings of Vervet Monkey Calls. FORM1 Kinematic Gestures. Mawukakan Lexicon ... – PowerPoint PPT presentation

Number of Views:61
Avg rating:3.0/5.0
Slides: 11
Provided by: www91
Category:

less

Transcript and Presenter's Notes

Title: Language Documentation


1
Language Documentation Archiving Activities at
the Linguistic Data ConsortiumChristopher
Cieri, Mark LibermanLinguistic Data
Consortiumccierimyl_at_ldc.upenn.edu
2
Model
  • LDC is an open, international, non-profit
    consortium of researchers, technology developers
    and educators in the non-profit, commercial and
    government sectors.
  • Consortium is group of organizations, 2019 in 57
    countries
  • Staff now 44 FT 65 PT staff in Philadelphia
  • 31,300 copies of 558 corpora 30/year
  • LDC was founded to address need for stable
    repository and distribution center for linguistic
    data.
  • Organizations support Consortium through annual
    membership, data and tool contributions.
  • Members receive rights to corpora released in
    years in which they contribute
  • All memberships use of LDC online discounts
    on data licenses
  • Standard membership rights to 16 corpora at no
    extra charge
  • Subscription membership rights to all corpora
    (except special issues) 2 copies of media at no
    extra charge
  • Much data available to non-members via licenses
  • Benefits
  • broad data distribution with uniform licensing
    across research communities
  • funding agencies avoid distribution costs
  • members receive vast amount of data avoid
    enormous development costs
  • development costs of many corpora 1, 2 or even 3
    orders of magnitude greater than membership fees

3
Origin Expanding Scope
  • LDC founded to address need for stable repository
    and distribution center for linguistic data
  • Scope has expanded in response to, and
    occasionally in anticipation of, community needs.
  • 1992 LDC founded
  • Distribute linguistic data
  • Archive and provide stable repository
  • Encourage and facilitate data creation activities
  • 1995 add collections especially telephone
    conversations
  • CallHome, CallFriend conversation telephone
    speech in 2 dozen languages some with
    transcripts and pronouncing lexicons
  • 1998 add annotation
  • Topic Detection and Tracking project newswire,
    broadcast news in Arabic, English, Mandarin,
    audio transcribed, all segmented, topic labeled
  • 1999 add tool creation
  • Talkbank project
  • 2002 add large scale project management
  • EARS, TIDES, GALE improved speech to text,
    translingual information detection, extraction,
    summarization, translation and distillation in
    Arabic, English and Mandarin
  • 2005 rededicate focus on less commonly taught
    languages
  • REFLEX LCTL resource kits for a dozen under
    resourced languages
  • documentation of practices

4
Trends
  • General trend toward
  • greater volume of data
  • growing number of languages
  • increasingly sophisticated annotation
  • new research communities
  • Language resource production occasionally
    outpaces use
  • DARPA TIDES/EARS groups in machine translation
    and speech-to-text
  • focus shifted to source variation, annotation
    richness, quality, coordination
  • As technologies approach human performance
    improving quality, understanding natural limits
    of human annotation performance become very
    important
  • New communities, interdisciplinary teams adopt LR
    sharing, demand simple, adaptive access to data
    flexible standards.
  • Worldwide computing growth increases diversity of
    languages represented, demand for technologies
    and thus LR
  • Linguists and language teachers are now staking a
    claim in what was once the domain only of human
    language technology developers.

5
LDC Activities
  • Data Collection
  • news text
  • parallel text
  • blogs
  • zines
  • newsgroups
  • broadcast news and broadcast talk
  • telephone conversation
  • meetings
  • read and prompted speech

6
LDC Activities
  • Annotation
  • transcription
  • time-alignment
  • turn and word segmentation
  • morphological
  • part-of-speech
  • gloss
  • syntactic
  • semantic
  • discourse
  • disfluency
  • topic relevance
  • identification and classification of
  • entities
  • relations
  • events
  • co-reference
  • summarization
  • translation and multiple translation

7
LDC Activities
  • Lexicon Building
  • pronunciation, morphological, translation
  • Tools
  • Transcriber,
  • MultiTrans TableTrans
  • Buckwalter Arabic Morphological Analyzer
  • BITS Bilingual Internet Text Search, Champollion
  • XTrans multichannel transcription
  • Infrastructure Building
  • OLAC Open Language Archives Community
  • Annotation Graph Toolkit
  • SPHERE Utilities
  • annotation workflow systems
  • Standards and Best Practices
  • Topic Detection and Tracking v1.4, Entity
    Annotation Guidelines v2.5, Relation Annotation
    Guidelines v3.6, Simple MDE v6.2
  • Data Resource Coordination
  • common task programs, outsourcing (ELRA MED
    Center, Arabic Transcription)
  • Consulting and Training
  • Hosting and Maintaining research fora

8
Publications
  • Enron E-mails annotated for topic
  • Read/Prompted Speech in Arabic, Croatian,
    American dialects of English, Russian, Turkish
    and Urdu
  • Conversations in Levantine. Iraqi and Gulf Arabic
  • Broadcast news in Arabic, Czech, English,
    Mandarin and Korean
  • Emotional Speech in Mandarin
  • Human-Computer Dialogues in English
  • Entity Annotetd Text in Arabic, Chinese, English
  • Parallel Text in Arabic-English, Chinese-English
  • Treebanks in Arabic, Chinese, English and Korean
  • Santa Barbara Corpora of Spoken American English
    III, IV
  • Field Recordings of Vervet Monkey Calls
  • FORM1 Kinematic Gestures
  • Mawukakan Lexicon
  • American National Corpus

9
LCTL
  • Monolingual Text 500K words
  • Parallel Text 175K words translated from the
    LCTL 75K translated from English to the LCTL
    including 30K of English news, 20K from
    Elicitation Corpus, 25K words from other genres
  • Bilingual Lexicon minimum of 10K lemmas,
    targeting 90-95 token coverage over monolingual
    text
  • Encoding Converters convert all raw text and
    lexicons encodings into the standard encoding
    selected
  • Sentence Segmenter, Word Segmenter
  • POS Tagset, Tagger and Tagged Text
  • Morphological Analyzer, Morphologically Tagged
  • Named-Entity Tagged Text, Named Entity Tagger
  • Personal Name Transliterator
  • Grammar Sketch
  • Amazigh, Bengali, Hungarian, Kurdish, Pashto,
    Punjabi, Tagalog, Tamil, Thai, Tigrigna, Urdu,
    Uzbek, Yoruba

10
Potential Collaboration
  • LDC
  • could share data from its Catalog
  • could distribute contributed data providing broad
    exposure
  • could share tools, specifications
  • could collaboration on joint production of
    resources
  • LDC looking for
  • collaborations that produce concrete results
  • resources that can be shared among members
  • expertise complementary to our experience
Write a Comment
User Comments (0)
About PowerShow.com