Title: The Rosetta Project ALL Language Archive
1The Rosetta ProjectALL Language Archive
Presented by Laura Buszard-Welcher The Rosetta
Project / University of California, Berkeley
- A Project of the Long Now Foundation A
National Science Digital Library - www.rosettaproject.org
2Primary Goals
- Support the documentation of the worlds nearly
7000 languages through building - A digital archive of language documentation
- A linguistically sophisticated site that is also
useful and interesting for the general public - Networks of speakers, educators, linguists
- Contributes to the effort to document endangered
languages - Promotes linguistic diversity by educating the
public about languages with small numbers of
speakers.
3Secondary Goals
- Support metadata standardization and
interoperability - OLAC
- EMELD
- Develop tools for collaborative linguistic
research - Endangered Language Query Room
- Wordlist Tool
- Collaborative document editing/creation (new site)
4Roles
- The Long Now Foundation
- Parent organization of The Rosetta Project
- Projects, seminars on topics that foster long
term thinking - The National Science Digital Library
- U.S. National Science Foundation Program
- Goal is to bring online high quality STEM
(Science, Technology, Engineering, and Math)
resources for education - Sponsor of Rosetta Project (NSF 333727)
- Stanford University
- Online and offline storage of Rosetta materials
5The Long Now Foundation
6The National Science Digital Library
7Stanford University Libraries
8Project HistoryThe 1000 Language Archive
- Initiated by The Long Now Foundation
- Wanted to experiment with new microetching
technology, looking for suitable content - Decided to collect basic descriptive information
for 1000 of the worlds approximately 7000
languages
9Why language information?
- Most natural human languages are products of
millenia of human history (therefore a good long
term thinking project) - Repositories of cultural information
- Languages showcase
- Human intellectual sophistication
- Cultural diversity
- To draw attention to the critical issue of
language endangerment
10The Rosetta Disk
- Next generation microfiche
- Micro-etched 2" nickel disk at densities of up to
200,000 page images per disk - Developed by Los Alamos Laboratories and Norsam
Technologies - Reading the disk requires a microscope, either
optical or electron, depending on the density of
encoding
11The Rosetta Stone
- Not us! (196 BC)
- Parallel text written in three scripts
- Hieroglyphic
- Demotic (script form)
- Greek
- The key to deciphering Egyptian Hieroglyphs
12Rosetta Stone Language LearningSoftware(Also
not us!)
13Design of the Disk
- Original design has human-eye readable text
(Genesis text) and micro-etched text inside an
index - New design has human-eye readable text
(instructions) on one side and microetched images
on the reverse
14In-House Scanning
- HP CapShare Scanners
- Scan printed page in multiple passes, any
direction - Page is assembled into one image
- Stores about 50 pages at a time (300 dpi bitmap
.tif) - Uploads numerically sequenced images to computer
by infrared port
15In-House Scanning
- Minolta PS 7000 Overhead
- Bitmap and grayscale scans up to 600 dpi
- Multiple sizes, orientations
- Single page / double page spread (good for text
collections with verso annotations) - Best for fragile books, manuscripts that would be
damaged by hand scanning
16Categories of Collection (1)
17Categories of Collection (2)
18Language Curation
19Rosetta Project Web Site
- Welcome
- Search for a language
- Language overview page
- Browse (by name, family, country)
- Wordlist tool
20Welcome
21Search
22Language Overview
23Browse
24Projects
- Endangered Language Query Rooms
- Digital Online Curation Services for Endangered
Language Archives (DOCS) - Wordlist Tool
- LangGator
25Endangered Language Query Rooms
http//emeld.rosettaproject.org/
26Query Room Virtual Keyboard
27Potawatomi Query Room
Re Bozho by Donald Perrot (host) on July 9 2004,
853 PM Nmedagwe'ndan e'gi nebye'ge'yen ngom.
Neaseno ndesh ne kas ge' nin, mine E'shkanabe'
e'nda ge' nin. I like what you have written. I am
called Neaseno (Southwind) myself, and I live in
Escanaba, MI. Re Bozho by Justin Neely on
September 7 2004, 116 PM Bozho Neaseno mine
Lameen Zagnenibi ndeznekas. Nishnabe ndaw ipi
Bodewadmi ndaw. Shi shi ban nee yek ndebendagwes.
Zego ndotem. Kansas City,Mo ndoch bya. Eskanabe
edayen ge nin. Bama pi ngom Zagnenibi
ndeznekas Hello Neaseno and Lameen my name is
Zagnenibi. Im Native and Potawatomi. I belong
to the Citizen Band. Im Crane Clan . Im from
Kansas City, Missouri. I also live in Escanaba.
Bye for now, Zagnenibi.
28Taking Conversational Risks
by TL on July 17 2004, 1030 AM mbesuk onago
ngi zhyamen . nseze wgi bye tot i jiman ewi
nepamshkamen be gishek. wabek nuwi zhya men ibe
eje shna mbesuk . ngi wabmak gode chemokmanuk
demojgewat. wabek nin gezhe ni demojgeyan
gnebech. bama mine mtego I went to the lake
yesterday. My brother brought a canoe so we
could float around all day. Tomorrow well go
there to the lake. I saw the white folks
fishing. Tomorrow Ill fish too, maybe. So long
for now, Mtego. Re onago egi zhejkeyak by JN
on July 17 2004, 812 PM mbesek ndazhya ngom.
Mbish ksenyak shode. Nedwendan ode Mbish gshatek.
Megwa Nwinebyege ode bodewadmi kiktowenen bama.
Megwetch Zagnenibi nin se I should go to the
lake today. The water is cold here. I wish the
water were warm. Ill write more of this
Potawatomi conversation later. Thanks, yours
truly Zagnenibi.
29Factors in query room success
30DOCS Project
- Digital Online Curation Services for Endangered
Language Archives - Many small language archives are beginning to
digitize their materials - Lack technical infrastructure to bring resources
online - Goal is to provide access through Rosetta
31DOCS Project Archives
- Endangered Language Fund (ELF)
- Survey for California and Other American Indian
Languages (SCOIL) - The Alaska Native Languages Center (ANLC)
- Max-Planck Institute for Evolutionary
Anthropology (Leipzig)
32Wordlist Tool
- Swadesh lists (100, 200, 207 terms) from
- Tryon's Comparative Austronesian Dictionary
(rekeyed) - Tim Usher's Indo-Pacific database (2002 version)
- Paul Whitehouse's Australian and New Guinea
database (2002 version) - George Starostin's Dravidian database
- Ilya Peiros' Mon Khmer database
- Total of 1,384 languages, 3,090 lists online
- Additional 3000 lists, up to 1850 terms per list,
most 300-500 words in length.
33LangGator
- A linguistic Wayback Machine
- Language resource location and aggregation
- Use alternate language names, spellings
- Deutsch, Hochdeutsch, High German, Allemande
- Fadicca, Fadicha, Fedija, Fadija, Fiadidja,
Fiyadikkya, and Fedicca - Character identification (inventory,
distribution) - Dera (Chadic, Nigeria)
- Dera (Trans-New Guinea, Indonesia)
- Seed crawler with Wordlist terms (see previous
slide), weighted towards longer terms - Archiving through Internet Archive
- Serve results through the Rosetta site
34Collaborations
- Electronic Metastructure for Endangered Languages
Data (E-MELD) - General Ontology for Linguistic Description
(GOLD) - Open Language Archives Community (OLAC)
35E-MELD
- Electronic Metastructure for Endangered Language
Data - School of Best Practice http//emeld.org/school/in
dex.html - Guidelines and examples for putting linguistic
data into best practice digital formats - XML with XML Schema or DTD
- Mapping terminology to ontology (GOLD)
- FIELD lexical database tool http//emeld.org/tools
/field/beta/ - Online collaborative tool to build linguistic
dictionaries, backed by ontology (GOLD)
36GOLD
- General Ontology for Linguistic Description
- Built in OWL (Web Ontology Language), linked to
SUMO (Suggested Upper Merged Ontology) - Best practice resources should include a mapping
between the researchers terms, and a standard
set, known as the profile - independent (mine) main clause (GOLD)
- obviative (mine) fourth person (GOLD)
- The standard terminology set can then allow
sophisticated searches across disparate resources.
37GOLD Community Model
38OLAC
- Open Language Archives Community
- Set of 23 metadata elements and controlled
vocabularies (based on Dublin Core) - Subject.language (language described, rather than
audience language) uses SIL language codes - Type.linguistic (grammar, lexicon, text)
- IMDI (Isle Metadata Initiative) has 135 elements
- Recommended extensions (Discourse Types,
Linguistic Field, Participant roles - Enables searches across a network of archives
that use OLAC metadata http//www.language-archive
s.org/tools/search/
39URLs
- Electronic Metastructure for Endangered Language
Data (E-MELD) http//www.emeld.org (School of
Best Practice, FIELD Tool). - Endangered Language Query Rooms
http//rosettaproject.org8080/emeldbase/. - The Ethnologue http//www.ethnologue.com.
- General Ontology for Linguistic Description
(GOLD) http//www.linguistics-ontology.org. - ISLE MetaData Initiative (IMDI)
http//www.mpi.nl/IMDI/. - National Science Digital Library (NSDL)
http//nsdl.org - Open Language Archives Community (OLAC)
http//www.language-archives.org. - The Rosetta Project, http//www.rosettaproject.org
/live. A preview of the new Web site (currently
under construction) is available at
http//preview.rosettaproject.org.
40Credits
- This project is funded by the US National Science
Digital Library (NSF 333727)