The Rosetta Project ALL Language Archive - PowerPoint PPT Presentation

1 / 40
About This Presentation
Title:

The Rosetta Project ALL Language Archive

Description:

A digital archive of language documentation ... Minolta PS 7000 Overhead. Bitmap and grayscale scans up to 600 dpi. Multiple sizes, orientations ... – PowerPoint PPT presentation

Number of Views:378
Avg rating:3.0/5.0
Slides: 41
Provided by: laurab172
Category:

less

Transcript and Presenter's Notes

Title: The Rosetta Project ALL Language Archive


1
The Rosetta ProjectALL Language Archive
Presented by Laura Buszard-Welcher The Rosetta
Project / University of California, Berkeley
  • A Project of the Long Now Foundation A
    National Science Digital Library
  • www.rosettaproject.org

2
Primary Goals
  • Support the documentation of the worlds nearly
    7000 languages through building
  • A digital archive of language documentation
  • A linguistically sophisticated site that is also
    useful and interesting for the general public
  • Networks of speakers, educators, linguists
  • Contributes to the effort to document endangered
    languages
  • Promotes linguistic diversity by educating the
    public about languages with small numbers of
    speakers.

3
Secondary Goals
  • Support metadata standardization and
    interoperability
  • OLAC
  • EMELD
  • Develop tools for collaborative linguistic
    research
  • Endangered Language Query Room
  • Wordlist Tool
  • Collaborative document editing/creation (new site)

4
Roles
  • The Long Now Foundation
  • Parent organization of The Rosetta Project
  • Projects, seminars on topics that foster long
    term thinking
  • The National Science Digital Library
  • U.S. National Science Foundation Program
  • Goal is to bring online high quality STEM
    (Science, Technology, Engineering, and Math)
    resources for education
  • Sponsor of Rosetta Project (NSF 333727)
  • Stanford University
  • Online and offline storage of Rosetta materials

5
The Long Now Foundation
6
The National Science Digital Library
7
Stanford University Libraries
8
Project HistoryThe 1000 Language Archive
  • Initiated by The Long Now Foundation
  • Wanted to experiment with new microetching
    technology, looking for suitable content
  • Decided to collect basic descriptive information
    for 1000 of the worlds approximately 7000
    languages

9
Why language information?
  • Most natural human languages are products of
    millenia of human history (therefore a good long
    term thinking project)
  • Repositories of cultural information
  • Languages showcase
  • Human intellectual sophistication
  • Cultural diversity
  • To draw attention to the critical issue of
    language endangerment

10
The Rosetta Disk
  • Next generation microfiche
  • Micro-etched 2" nickel disk at densities of up to
    200,000 page images per disk
  • Developed by Los Alamos Laboratories and Norsam
    Technologies
  • Reading the disk requires a microscope, either
    optical or electron, depending on the density of
    encoding

11
The Rosetta Stone
  • Not us! (196 BC)
  • Parallel text written in three scripts
  • Hieroglyphic
  • Demotic (script form)
  • Greek
  • The key to deciphering Egyptian Hieroglyphs

12
Rosetta Stone Language LearningSoftware(Also
not us!)
13
Design of the Disk
  • Original design has human-eye readable text
    (Genesis text) and micro-etched text inside an
    index
  • New design has human-eye readable text
    (instructions) on one side and microetched images
    on the reverse

14
In-House Scanning
  • HP CapShare Scanners
  • Scan printed page in multiple passes, any
    direction
  • Page is assembled into one image
  • Stores about 50 pages at a time (300 dpi bitmap
    .tif)
  • Uploads numerically sequenced images to computer
    by infrared port

15
In-House Scanning
  • Minolta PS 7000 Overhead
  • Bitmap and grayscale scans up to 600 dpi
  • Multiple sizes, orientations
  • Single page / double page spread (good for text
    collections with verso annotations)
  • Best for fragile books, manuscripts that would be
    damaged by hand scanning

16
Categories of Collection (1)
17
Categories of Collection (2)
18
Language Curation
19
Rosetta Project Web Site
  • Welcome
  • Search for a language
  • Language overview page
  • Browse (by name, family, country)
  • Wordlist tool

20
Welcome
21
Search
22
Language Overview
23
Browse
24
Projects
  • Endangered Language Query Rooms
  • Digital Online Curation Services for Endangered
    Language Archives (DOCS)
  • Wordlist Tool
  • LangGator

25
Endangered Language Query Rooms
http//emeld.rosettaproject.org/
26
Query Room Virtual Keyboard
27
Potawatomi Query Room
Re Bozho by Donald Perrot (host) on July 9 2004,
853 PM Nmedagwe'ndan e'gi nebye'ge'yen ngom.
Neaseno ndesh ne kas ge' nin, mine E'shkanabe'
e'nda ge' nin. I like what you have written. I am
called Neaseno (Southwind) myself, and I live in
Escanaba, MI. Re Bozho by Justin Neely on
September 7 2004, 116 PM Bozho Neaseno mine
Lameen Zagnenibi ndeznekas. Nishnabe ndaw ipi
Bodewadmi ndaw. Shi shi ban nee yek ndebendagwes.
Zego ndotem. Kansas City,Mo ndoch bya. Eskanabe
edayen ge nin. Bama pi ngom Zagnenibi
ndeznekas Hello Neaseno and Lameen my name is
Zagnenibi. Im Native and Potawatomi. I belong
to the Citizen Band. Im Crane Clan . Im from
Kansas City, Missouri. I also live in Escanaba.
Bye for now, Zagnenibi.
28
Taking Conversational Risks
by TL on July 17 2004, 1030 AM mbesuk onago
ngi zhyamen . nseze wgi bye tot i jiman ewi
nepamshkamen be gishek. wabek nuwi zhya men ibe
eje shna mbesuk . ngi wabmak gode chemokmanuk
demojgewat. wabek nin gezhe ni demojgeyan
gnebech. bama mine mtego I went to the lake
yesterday. My brother brought a canoe so we
could float around all day. Tomorrow well go
there to the lake. I saw the white folks
fishing. Tomorrow Ill fish too, maybe. So long
for now, Mtego. Re onago egi zhejkeyak by JN
on July 17 2004, 812 PM mbesek ndazhya ngom.
Mbish ksenyak shode. Nedwendan ode Mbish gshatek.
Megwa Nwinebyege ode bodewadmi kiktowenen bama.
Megwetch Zagnenibi nin se I should go to the
lake today. The water is cold here. I wish the
water were warm. Ill write more of this
Potawatomi conversation later. Thanks, yours
truly Zagnenibi.
29
Factors in query room success
30
DOCS Project
  • Digital Online Curation Services for Endangered
    Language Archives
  • Many small language archives are beginning to
    digitize their materials
  • Lack technical infrastructure to bring resources
    online
  • Goal is to provide access through Rosetta

31
DOCS Project Archives
  • Endangered Language Fund (ELF)
  • Survey for California and Other American Indian
    Languages (SCOIL)
  • The Alaska Native Languages Center (ANLC)
  • Max-Planck Institute for Evolutionary
    Anthropology (Leipzig)

32
Wordlist Tool
  • Swadesh lists (100, 200, 207 terms) from
  • Tryon's Comparative Austronesian Dictionary
    (rekeyed)
  • Tim Usher's Indo-Pacific database (2002 version)
  • Paul Whitehouse's Australian and New Guinea
    database (2002 version)
  • George Starostin's Dravidian database
  • Ilya Peiros' Mon Khmer database
  • Total of 1,384 languages, 3,090 lists online
  • Additional 3000 lists, up to 1850 terms per list,
    most 300-500 words in length.

33
LangGator
  • A linguistic Wayback Machine
  • Language resource location and aggregation
  • Use alternate language names, spellings
  • Deutsch, Hochdeutsch, High German, Allemande
  • Fadicca, Fadicha, Fedija, Fadija, Fiadidja,
    Fiyadikkya, and Fedicca
  • Character identification (inventory,
    distribution)
  • Dera (Chadic, Nigeria)
  • Dera (Trans-New Guinea, Indonesia)
  • Seed crawler with Wordlist terms (see previous
    slide), weighted towards longer terms
  • Archiving through Internet Archive
  • Serve results through the Rosetta site

34
Collaborations
  • Electronic Metastructure for Endangered Languages
    Data (E-MELD)
  • General Ontology for Linguistic Description
    (GOLD)
  • Open Language Archives Community (OLAC)

35
E-MELD
  • Electronic Metastructure for Endangered Language
    Data
  • School of Best Practice http//emeld.org/school/in
    dex.html
  • Guidelines and examples for putting linguistic
    data into best practice digital formats
  • XML with XML Schema or DTD
  • Mapping terminology to ontology (GOLD)
  • FIELD lexical database tool http//emeld.org/tools
    /field/beta/
  • Online collaborative tool to build linguistic
    dictionaries, backed by ontology (GOLD)

36
GOLD
  • General Ontology for Linguistic Description
  • Built in OWL (Web Ontology Language), linked to
    SUMO (Suggested Upper Merged Ontology)
  • Best practice resources should include a mapping
    between the researchers terms, and a standard
    set, known as the profile
  • independent (mine) main clause (GOLD)
  • obviative (mine) fourth person (GOLD)
  • The standard terminology set can then allow
    sophisticated searches across disparate resources.

37
GOLD Community Model
38
OLAC
  • Open Language Archives Community
  • Set of 23 metadata elements and controlled
    vocabularies (based on Dublin Core)
  • Subject.language (language described, rather than
    audience language) uses SIL language codes
  • Type.linguistic (grammar, lexicon, text)
  • IMDI (Isle Metadata Initiative) has 135 elements
  • Recommended extensions (Discourse Types,
    Linguistic Field, Participant roles
  • Enables searches across a network of archives
    that use OLAC metadata http//www.language-archive
    s.org/tools/search/

39
URLs
  • Electronic Metastructure for Endangered Language
    Data (E-MELD) http//www.emeld.org (School of
    Best Practice, FIELD Tool).
  • Endangered Language Query Rooms
    http//rosettaproject.org8080/emeldbase/.
  • The Ethnologue http//www.ethnologue.com.
  • General Ontology for Linguistic Description
    (GOLD) http//www.linguistics-ontology.org.
  • ISLE MetaData Initiative (IMDI)
    http//www.mpi.nl/IMDI/.
  • National Science Digital Library (NSDL)
    http//nsdl.org
  • Open Language Archives Community (OLAC)
    http//www.language-archives.org.
  • The Rosetta Project, http//www.rosettaproject.org
    /live. A preview of the new Web site (currently
    under construction) is available at
    http//preview.rosettaproject.org.

40
Credits
  • This project is funded by the US National Science
    Digital Library (NSF 333727)
Write a Comment
User Comments (0)
About PowerShow.com