Possibilities for Broadly Shared, Crosslinked Metadata - PowerPoint PPT Presentation

1 / 13
About This Presentation
Title:

Possibilities for Broadly Shared, Crosslinked Metadata

Description:

autonym Mawukakan missing from ISO 639-3 & Ethnologue. French based Mahou is reference name ... autonym missing from ISO 639-3 & Ethnologue ... – PowerPoint PPT presentation

Number of Views:41
Avg rating:3.0/5.0
Slides: 14
Provided by: stephaniem8
Category:

less

Transcript and Presenter's Notes

Title: Possibilities for Broadly Shared, Crosslinked Metadata


1
Possibilities for Broadly Shared, Cross-linked
Metadata
  • Christopher Cieri
  • ccieri _at_ ldc.upenn.edu
  • University of Pennsylvania
  • Linguistic Data Consortium

2
LDC Catalog Metadata
  • ISO 639-3
  • includes sample

3
Issues in LDC Metadata
  • Author
  • What exactly is a LR author? Rule versus practice
  • older and external corpora
  • Title
  • The Data Corpus
  • Catalog Number LDCYYYYSTLEG
  • ISBN standard reference, similarity to other
    published work
  • Language ISO639-3 code, plus language name
  • Taiwanese Putonghua
  • Member Year when produced, available to members
    at n/c
  • Type speech, text, transcripts, lexicon
  • Source broadcast, conversation, news text
  • Research Project sponsoring, benefitting
  • Recommended Applications education, issues with
    reuse

4
Current Searching
  • Many users comment that this interface is too
    complicated. They want(ed) a single field with
    Google style search and ordering of results.
  • Compliance with the controlled vocabulary is
    ongoing struggle.

5
OLAC
  • OLAC Open Language Archives
  • 37 Archives including ELDA LDC (also on
    steering committees)
  • 49,063 Resources including corpora, lexica,
    papers, advice
  • Data Providers
  • static uploads
  • regular data harvest
  • Service Providers
  • customized views on metadata
  • specific selections of metadata
  • custom searches
  • Advantages standardization, broader exposure,
    easier search
  • Disadvantages slower change, metadata by
    committee
  • Missing fields relevant to large data centers
  • license
  • cost
  • not included in original OLAC metadata set
  • subsequently added as proposed extensions under
    Net-DC

6
5 OLAC Gold Stars
  • metadata validation, archive rating service
  • LDC routinely rated 3/5 stars when rating
    dropped to 2, we decided to improve
  • analysis revealed simple repeating errors
  • missing agent, date, DCMI type
  • gold star was easy to acquire
  • standards should probably toughen
  • QCed language codes
  • Also converted to ISO 639-3
  • OLAC confirms code is present possible
  • but not that it is accurate
  • LDC linguist reviewed every publication

7
Issues in Current Standards
  • Actual use often reveals issues
  • Mawukakan
  • autonym Mawukakan missing from ISO 639-3
    EthnologueFrench based Mahou is reference name
  • alternatives are Maou, Mau, Mahu, Mauka, and
    Mauke.
  • See change request http//www.sil.org/ISO639-3/cr
    _files/2009-001.pdf
  • Amazighe
  • autonym missing from ISO 639-3 Ethnologue
  • names for individual dialects present in both
    but marked as not in use in Ethnologue 15th
    edition.
  • exonym Berber used in family tree
  • though weve know the preferred name for some
    time
  • OED entry for Berber quotes Latham (1854) The
    Amazirg tongues are often called Berber.

8
Issues in Current Standards

9
Specifications
  • related to ORI goals
  • definition (EAGLES, ISLE)
  • versioned, published
  • promotes good science
  • currently (only) linked to data matrix records
    some projects

10
LR (Harvest) Wiki
11
LR Wiki, Bengali
  • pre-cataloging, ex-catalog exercise
  • unique editor by language
  • for LCTLs, the wiki supports the location and
    evaluation of LRs
  • for well (better) represented languages, the wiki
    supports evaluation and selection on LRs

12
Timeline
  • LDC ELDA join OLAC
  • LDC created internal LCTL Harvest Wiki
  • LDC adds metadata for specification in sponsored
    projects
  • ELDA announces Universal Catalog initiative
  • NICT begins mining and mirroring ELDA and LDC
    Catalogs
  • ELDA, NICT, LDC discuss joint cataloging
  • LDC creates external LR (Harvest) Wiki
  • LDC begins corpus citation index

13
Proposed Collaborations
  • Universal Catalog with cross-linked records
  • LRs released by members others (OLAC, ELDA,
    NICT)
  • corpora and lexicons but also (sources of) raw
    data (LDC)
  • tools
  • specifications (LDC)
  • papers about LRs, papers that somehow use LRs
    (OLAC)
  • Records created
  • manually as members register LRs
  • seeded through one-time projects
  • LR Citation Index (cf LREC 2010 Map), LR Wiki
  • submitted by providers through web form or
    structured file (OLAC)
  • shows penetration of LRs
  • provides publicity for paper authors
  • eventually becomes a useful index
  • mined from the web (cf NICT), optionally also
    harvested (need name space transfer lexicon)
  • Flexible Search (OLAC)
  • Boolean search, Google, graphical
  • Using standards as appropriate, not captive to
    them
Write a Comment
User Comments (0)
About PowerShow.com