Title: Archiving Strategies for Multimedia Language Documentation
1Archiving Strategies for Multimedia Language
Documentation
Research Implication and Feasibility of
Distributed Digital Archives (Tools for
Transcription and Annotation)
Peter Wittenburg Max-Planck-Institute for
Psycholinguistics Nijmegen, The Netherlands
Archives Workshop Sydney October 2003
2Who am I?
- MPI for Psycholinguistics
- about 180 people
- experimental and observational psychologists
linguists - large multimedia/multimodal corpus
- per year about 30 expeditions ? lots of av tapes
- all digital since 4/5 years
- currently about 5 TB/5000 h and gt 15.000
sessions - Technical Department
- giving service / running infrastructure
- developing methods and tools
- participating in externally funded projects
- IMDI (ISLE Metadata Initiative)
- DOBES (Documentation of EL)
- COREX (Tools for Dutch Spoken Corpus)
- INTERA (Integrated European LR Area)
- ECHO (European Cultural Heritage Online)
Archives Workshop Sydney October 2003
3DOBES Teams
Chintang/Puma
Tofa
Svan/Tush
Hocank
Wichita
Salar/Monguor
Chol
Mawe
Marquesan
Lacandon
Tsafiki
Ega
Waimaa
Kuikuro
Uru-Chipaya
Trumai
Teop
Aweti
Chaco
Hai//om
!Xoo
Iwaidja
- currently 21 docu teams working on about 30
languages - 2 to 3 other rounds of funding deadline always
in November
Archives Workshop Sydney October 2003
4What happened?
Short look back (EU biased) 98 LREC
Granada first workshop about sharing
resources at MPI started to clean up a
mess first metadata attempts first scheme
for mm annotations TEI already around too
complex/no tools all fairly new 00 LREC
Athens start of ISLE seed point for IMDI
OLAC workshop together with Nancy Ide -
annotation structures - MD with presentation
of IMDI roadmap start of DOBES with many
bottom-up discussions lots of discussions with
LE, MM, FL 02 LREC Las Palmas established
frameworks for annos and MD XML as common
syntax ATLAS, MPI-EAF tools such as
ELAN ISO TC37/SC4 was formed
Archives Workshop Sydney October 2003
5Where are we?
- just 5 years ago!
- better understanding of all aspects and the
different requirements - for annotations and MD
- but also annotations as relations (RDF triples)
- but special cases like music annotation
- but different MD sets, profiles,
- different MD goals - IMDI discoverystructuring
managing - some good tools supporting these standards
- have built I/E modules (Shoebox EAF) and
converters (Word-XML) - have built online archives
- some agreements about formats (JPEG?, MP3?,
MPEG2?) - have seen excellent tools for naïve users
- but
- Enabler Workshop, INTERA ECHO experiences
- very few researchers are aware and convinced to
invest time - very little language resources are visible
- people/institutions still live with chaos
- many rDBs are created will we ever be able to
access the data?
Archives Workshop Sydney October 2003
6What is missing?
- Awareness, Training,
- Easy access
- Easy ingest
- Long-term strategies
- Common management domain
- Shared semantics (short)
Archives Workshop Sydney October 2003
7What is missing? - Awareness
- Awareness, Training,
- convince people to use MD frameworks to improve
visibility - convince people that much more data can be made
open - openly funded resources in general to be open
- train people in modern techniques
- convince people to think of exports to XML
- make schemas available
- offer and do much training these days
Archives Workshop Sydney October 2003
8What is missing? Easy Access
- Easier Access to archives metadata
- example MPI/DOBES metadata domain
- excellent browser operating on XML with nice
features - requires download install big threshold for
many
Archives Workshop Sydney October 2003
9What is missing? Easy Access
- Easier Access to archives - annotations
- example MPI/DOBES annotations
- excellent annotation exploitation tool
- needs download install resource big
threshold
Archives Workshop Sydney October 2003
10What is missing? Easy Access
- Easier Access to archives - others
- making material available for communities (Don
Vincent) - linking of end-user tools with archives
- currently they are isolated showcases (Davids
tool) - should build on archive material is it
possible? - task for DELAN
- many more resources should be open
- need suitable management tools as pre-requisite
Archives Workshop Sydney October 2003
11What is missing? Easy Ingest
- Easy ingest
- many people would contribute if others take over
responsibility - adding sub-corpus via web-forms
- request space and rights
- define data organization of sub-corpus
(iterative) - define bundles and load sets and associate with
nodes - create MD templates
- request check on completeness and ingest
- improve metadata (add specific information)
- again excellent editor available but load
install - same should work to add annotations
- how to keep control on quality and consistency
- responsibility for correct input at the user
- need version control
- are currently working on this
Archives Workshop Sydney October 2003
12What is missing? Long term
- data survival primarily a matter of social
acceptance (no control) - but there are some aspects we can influence
- three relevant layers
- physical storage layer
- how long is data physically available?
- encoding layer
- how long can data be interpreted?
- organizational layer
- how long can we assure manageability? (not
today) - may not forget
- speak about dynamic online accessible archives
Archives Workshop Sydney October 2003
13What is missing? Long term
figure indicates our problem
various e-media
clay tablets
- persistence times
- hard discs 4 years ?
- CDROMS 10 years ?
- MPI institution 20 years ??
- MPG society 100 years ???
- Germany, Netherlands, ????????
- continuous migration replacement
- institute copying at MPI 2 copies, originals
(with scientists) - campus copying on campus 3. copy
- society copying within MPG 4. and 5. copy
(rsync, AFS, ) - GRID copying automatic, international copying
Archives Workshop Sydney October 2003
14What is missing? Long term
- copying to compensate for small media lifetimes
and errors - distributing to compensate for political
uncertainties - Consequences
- need to keep the costs low
- hide our data (no costs for DOBES for 45
copy) - need efficient Data GRID technology
- need faster lines (400 kB/s currently via hop
NL-D) - need good ethical legal agreements
- copying migration requires URID
- distinguish between object and instance
Archives Workshop Sydney October 2003
15What is missing? Long term
011001010100001010110100101010
- Guarantee interpretability of data independent of
technology change - there is no good solution emulation is
expensive - format and encoding migration - also a cost
problem, data lost - Is this really a long-term issue?
- some argue that future generations will be smart
enough - nevertheless lets do our best now
- including lets take best standard we have
(MPEG2 for video)
Archives Workshop Sydney October 2003
16What is missing? Long term
- Guarantee coherent and accessible archive
- easily find the resource bundles
- easily discover interesting resources
- easily manipulate within the archive (add, move,
copy parts, ) - archives are organized with IMDI principles
- DOBES applies the immediate way
- do MD description/organization and conversion
to standards now - Is it really a long-term issue?
- not per se - although would be helpful
- needed for short/medium term usage
- how long will archive organization survive?
- (new descriptors, new technologies, )
- nevertheless lets do our best now,
explicitness - knowledge about archive content will decrease
Archives Workshop Sydney October 2003
17What is missing? Management
create distributed corpus management domain
URID Resolving Service (based on Handle System?)
MD Rep
URID S
MD Rep
URID S
makes use of
MD Rep
URID S
MD Rep
URID S
Resources
is associated with
User S
User S
AR S
AR S
User S
AR S
User S
AR S
has access rights
Archives Workshop Sydney October 2003
User group Management (LDAP ?)
Access Rights Management (LDAP ?)
18What is missing? Management
Different Scenarios - here two examples IMDI (
Digilib)
Authentication Service
URID Resolving Service
CLIENT
send name passwd
send URID
authenticated
send URLs
show me resource with URL (URID, UID)
send resource
Access Rights Service
send URID, UID
send status
Resource Server
Archives Workshop Sydney October 2003
19What is missing? Management
Different Scenarios - here two examples (IMDI )
Digilib
DIGILIB Server
DIGILIB Client
send DL Request
send DL Result
send resource
send URL
send URID
send URLs
send URID
send URLs
send URID
send URLs
Proxy Service
URID Resolving Service
Resource Server
Archives Workshop Sydney October 2003
20What is missing? Management
- Some criteria
- one DELAN mechanism different authorities
- URID example
- DELAN handle 18.4/trumai123456789
- truly distributed cant be dependent on one
site/server - delegation principle (users/groups and AR by
researchers) - services must be fast (load sharing)
- must be safe (handling partly secrete
information) - have to get rid of ad hoc solutions (web forms
and ht-access) - DELAN lets do it now
DOBES ID naming convention (may be different)
DOBES authority
DELAN authority
Archives Workshop Sydney October 2003
21What is missing? Semantics
- Need a domain of shared and re-usable semantics
- currently hard-wired transformation between OLAC
and IMDI - better
- all term definitions in open repositories
according to ISO 11179 - all relations in open repositories
- all in RDF/OWL standards
- MPI could state
- someone does not like this and realizes his
dream - all OLAC and IMDI terms soon be fully defined
this way - (INTERA project)
rdfsisSubpartOf
IMDICollector
DCCreator
owlisEqualClass
IMDICollector
DCCreator
Archives Workshop Sydney October 2003
22Is something exciting?
- nothing except semantics issue scientifically
exciting - exciting is perspective to create joint
world-wide service domains - its lot of work
- good engineering work required no quick shot
- systems have to run smoothly
- create dependencies
- there is now a short time slot we can miss it
- looking back at 4/5 years work at MPI
- IMDI creation was exciting (community process
tools) - ELAN work was exciting (annotation formats)
- but biggest jobs
- maintaining a consistent organized large
dynamic corpus - improving accessibility at various levels
Archives Workshop Sydney October 2003
23End
Thanks
Archives Workshop Sydney October 2003