Archiving Strategies for Multimedia Language Documentation - PowerPoint PPT Presentation

1 / 23
About This Presentation
Title:

Archiving Strategies for Multimedia Language Documentation

Description:

people/institutions still live with chaos ... click on resource link and stream (ready but MPEG4) ... someone does not like this and realizes his dream ... – PowerPoint PPT presentation

Number of Views:75
Avg rating:3.0/5.0
Slides: 24
Provided by: daanbr
Category:

less

Transcript and Presenter's Notes

Title: Archiving Strategies for Multimedia Language Documentation


1
Archiving Strategies for Multimedia Language
Documentation
Research Implication and Feasibility of
Distributed Digital Archives (Tools for
Transcription and Annotation)
Peter Wittenburg Max-Planck-Institute for
Psycholinguistics Nijmegen, The Netherlands
Archives Workshop Sydney October 2003
2
Who am I?
  • MPI for Psycholinguistics
  • about 180 people
  • experimental and observational psychologists
    linguists
  • large multimedia/multimodal corpus
  • per year about 30 expeditions ? lots of av tapes
  • all digital since 4/5 years
  • currently about 5 TB/5000 h and gt 15.000
    sessions
  • Technical Department
  • giving service / running infrastructure
  • developing methods and tools
  • participating in externally funded projects
  • IMDI (ISLE Metadata Initiative)
  • DOBES (Documentation of EL)
  • COREX (Tools for Dutch Spoken Corpus)
  • INTERA (Integrated European LR Area)
  • ECHO (European Cultural Heritage Online)

Archives Workshop Sydney October 2003
3
DOBES Teams
Chintang/Puma
Tofa
Svan/Tush
Hocank
Wichita
Salar/Monguor
Chol
Mawe
Marquesan
Lacandon
Tsafiki
Ega
Waimaa
Kuikuro
Uru-Chipaya
Trumai
Teop
Aweti
Chaco
Hai//om
!Xoo
Iwaidja
  • currently 21 docu teams working on about 30
    languages
  • 2 to 3 other rounds of funding deadline always
    in November

Archives Workshop Sydney October 2003
4
What happened?
Short look back (EU biased) 98 LREC
Granada first workshop about sharing
resources at MPI started to clean up a
mess first metadata attempts first scheme
for mm annotations TEI already around too
complex/no tools all fairly new 00 LREC
Athens start of ISLE seed point for IMDI
OLAC workshop together with Nancy Ide -
annotation structures - MD with presentation
of IMDI roadmap start of DOBES with many
bottom-up discussions lots of discussions with
LE, MM, FL 02 LREC Las Palmas established
frameworks for annos and MD XML as common
syntax ATLAS, MPI-EAF tools such as
ELAN ISO TC37/SC4 was formed
Archives Workshop Sydney October 2003
5
Where are we?
  • just 5 years ago!
  • better understanding of all aspects and the
    different requirements
  • for annotations and MD
  • but also annotations as relations (RDF triples)
  • but special cases like music annotation
  • but different MD sets, profiles,
  • different MD goals - IMDI discoverystructuring
    managing
  • some good tools supporting these standards
  • have built I/E modules (Shoebox EAF) and
    converters (Word-XML)
  • have built online archives
  • some agreements about formats (JPEG?, MP3?,
    MPEG2?)
  • have seen excellent tools for naïve users
  • but
  • Enabler Workshop, INTERA ECHO experiences
  • very few researchers are aware and convinced to
    invest time
  • very little language resources are visible
  • people/institutions still live with chaos
  • many rDBs are created will we ever be able to
    access the data?

Archives Workshop Sydney October 2003
6
What is missing?
  • Awareness, Training,
  • Easy access
  • Easy ingest
  • Long-term strategies
  • Common management domain
  • Shared semantics (short)

Archives Workshop Sydney October 2003
7
What is missing? - Awareness
  • Awareness, Training,
  • convince people to use MD frameworks to improve
    visibility
  • convince people that much more data can be made
    open
  • openly funded resources in general to be open
  • train people in modern techniques
  • convince people to think of exports to XML
  • make schemas available
  • offer and do much training these days

Archives Workshop Sydney October 2003
8
What is missing? Easy Access
  • Easier Access to archives metadata
  • example MPI/DOBES metadata domain
  • excellent browser operating on XML with nice
    features
  • requires download install big threshold for
    many

Archives Workshop Sydney October 2003
9
What is missing? Easy Access
  • Easier Access to archives - annotations
  • example MPI/DOBES annotations
  • excellent annotation exploitation tool
  • needs download install resource big
    threshold

Archives Workshop Sydney October 2003
10
What is missing? Easy Access
  • Easier Access to archives - others
  • making material available for communities (Don
    Vincent)
  • linking of end-user tools with archives
  • currently they are isolated showcases (Davids
    tool)
  • should build on archive material is it
    possible?
  • task for DELAN
  • many more resources should be open
  • need suitable management tools as pre-requisite

Archives Workshop Sydney October 2003
11
What is missing? Easy Ingest
  • Easy ingest
  • many people would contribute if others take over
    responsibility
  • adding sub-corpus via web-forms
  • request space and rights
  • define data organization of sub-corpus
    (iterative)
  • define bundles and load sets and associate with
    nodes
  • create MD templates
  • request check on completeness and ingest
  • improve metadata (add specific information)
  • again excellent editor available but load
    install
  • same should work to add annotations
  • how to keep control on quality and consistency
  • responsibility for correct input at the user
  • need version control
  • are currently working on this

Archives Workshop Sydney October 2003
12
What is missing? Long term
  • data survival primarily a matter of social
    acceptance (no control)
  • but there are some aspects we can influence
  • three relevant layers
  • physical storage layer
  • how long is data physically available?
  • encoding layer
  • how long can data be interpreted?
  • organizational layer
  • how long can we assure manageability? (not
    today)
  • may not forget
  • speak about dynamic online accessible archives

Archives Workshop Sydney October 2003
13
What is missing? Long term
figure indicates our problem
various e-media
clay tablets
  • persistence times
  • hard discs 4 years ?
  • CDROMS 10 years ?
  • MPI institution 20 years ??
  • MPG society 100 years ???
  • Germany, Netherlands, ????????
  • continuous migration replacement
  • institute copying at MPI 2 copies, originals
    (with scientists)
  • campus copying on campus 3. copy
  • society copying within MPG 4. and 5. copy
    (rsync, AFS, )
  • GRID copying automatic, international copying

Archives Workshop Sydney October 2003
14
What is missing? Long term
  • copying to compensate for small media lifetimes
    and errors
  • distributing to compensate for political
    uncertainties
  • Consequences
  • need to keep the costs low
  • hide our data (no costs for DOBES for 45
    copy)
  • need efficient Data GRID technology
  • need faster lines (400 kB/s currently via hop
    NL-D)
  • need good ethical legal agreements
  • copying migration requires URID
  • distinguish between object and instance

Archives Workshop Sydney October 2003
15
What is missing? Long term
011001010100001010110100101010
  • Guarantee interpretability of data independent of
    technology change
  • there is no good solution emulation is
    expensive
  • format and encoding migration - also a cost
    problem, data lost
  • Is this really a long-term issue?
  • some argue that future generations will be smart
    enough
  • nevertheless lets do our best now
  • including lets take best standard we have
    (MPEG2 for video)

Archives Workshop Sydney October 2003
16
What is missing? Long term
  • Guarantee coherent and accessible archive
  • easily find the resource bundles
  • easily discover interesting resources
  • easily manipulate within the archive (add, move,
    copy parts, )
  • archives are organized with IMDI principles
  • DOBES applies the immediate way
  • do MD description/organization and conversion
    to standards now
  • Is it really a long-term issue?
  • not per se - although would be helpful
  • needed for short/medium term usage
  • how long will archive organization survive?
  • (new descriptors, new technologies, )
  • nevertheless lets do our best now,
    explicitness
  • knowledge about archive content will decrease

Archives Workshop Sydney October 2003
17
What is missing? Management
create distributed corpus management domain
URID Resolving Service (based on Handle System?)
MD Rep
URID S
MD Rep
URID S
makes use of
MD Rep
URID S
MD Rep
URID S
Resources
is associated with
User S
User S
AR S
AR S
User S
AR S
User S
AR S
has access rights
Archives Workshop Sydney October 2003
User group Management (LDAP ?)
Access Rights Management (LDAP ?)
18
What is missing? Management
Different Scenarios - here two examples IMDI (
Digilib)
Authentication Service
URID Resolving Service
CLIENT
send name passwd
send URID
authenticated
send URLs
show me resource with URL (URID, UID)
send resource
Access Rights Service
send URID, UID
send status
Resource Server
Archives Workshop Sydney October 2003
19
What is missing? Management
Different Scenarios - here two examples (IMDI )
Digilib
DIGILIB Server
DIGILIB Client
send DL Request
send DL Result
send resource
send URL
send URID
send URLs
send URID
send URLs
send URID
send URLs
Proxy Service
URID Resolving Service
Resource Server
Archives Workshop Sydney October 2003
20
What is missing? Management
  • Some criteria
  • one DELAN mechanism different authorities
  • URID example
  • DELAN handle 18.4/trumai123456789
  • truly distributed cant be dependent on one
    site/server
  • delegation principle (users/groups and AR by
    researchers)
  • services must be fast (load sharing)
  • must be safe (handling partly secrete
    information)
  • have to get rid of ad hoc solutions (web forms
    and ht-access)
  • DELAN lets do it now

DOBES ID naming convention (may be different)
DOBES authority
DELAN authority
Archives Workshop Sydney October 2003
21
What is missing? Semantics
  • Need a domain of shared and re-usable semantics
  • currently hard-wired transformation between OLAC
    and IMDI
  • better
  • all term definitions in open repositories
    according to ISO 11179
  • all relations in open repositories
  • all in RDF/OWL standards
  • MPI could state
  • someone does not like this and realizes his
    dream
  • all OLAC and IMDI terms soon be fully defined
    this way
  • (INTERA project)

rdfsisSubpartOf
IMDICollector
DCCreator
owlisEqualClass
IMDICollector
DCCreator
Archives Workshop Sydney October 2003
22
Is something exciting?
  • nothing except semantics issue scientifically
    exciting
  • exciting is perspective to create joint
    world-wide service domains
  • its lot of work
  • good engineering work required no quick shot
  • systems have to run smoothly
  • create dependencies
  • there is now a short time slot we can miss it
  • looking back at 4/5 years work at MPI
  • IMDI creation was exciting (community process
    tools)
  • ELAN work was exciting (annotation formats)
  • but biggest jobs
  • maintaining a consistent organized large
    dynamic corpus
  • improving accessibility at various levels

Archives Workshop Sydney October 2003
23
End
Thanks
Archives Workshop Sydney October 2003
Write a Comment
User Comments (0)
About PowerShow.com