The Rosetta Project ALL Language Archive - PowerPoint PPT Presentation

1 / 40

About This Presentation

Title:

The Rosetta Project ALL Language Archive

Description:

A digital archive of language documentation ... Minolta PS 7000 Overhead. Bitmap and grayscale scans up to 600 dpi. Multiple sizes, orientations ... – PowerPoint PPT presentation

Number of Views:382

Avg rating:3.0/5.0

Slides: 41

Provided by: laurab172

Category:

more less

Transcript and Presenter's Notes

Title: The Rosetta Project ALL Language Archive

1
The Rosetta ProjectALL Language Archive
Presented by Laura Buszard-Welcher The Rosetta
Project / University of California, Berkeley

A Project of the Long Now Foundation A
National Science Digital Library
www.rosettaproject.org

2
Primary Goals

Support the documentation of the worlds nearly
7000 languages through building
A digital archive of language documentation
A linguistically sophisticated site that is also
useful and interesting for the general public
Networks of speakers, educators, linguists
Contributes to the effort to document endangered
languages
Promotes linguistic diversity by educating the
public about languages with small numbers of
speakers.

3
Secondary Goals

Support metadata standardization and
interoperability
OLAC
EMELD
Develop tools for collaborative linguistic
research
Endangered Language Query Room
Wordlist Tool
Collaborative document editing/creation (new site)

4
Roles

The Long Now Foundation
Parent organization of The Rosetta Project
Projects, seminars on topics that foster long
term thinking
The National Science Digital Library
U.S. National Science Foundation Program
Goal is to bring online high quality STEM
(Science, Technology, Engineering, and Math)
resources for education
Sponsor of Rosetta Project (NSF 333727)
Stanford University
Online and offline storage of Rosetta materials

5
The Long Now Foundation
6
The National Science Digital Library
7
Stanford University Libraries
8
Project HistoryThe 1000 Language Archive

Initiated by The Long Now Foundation
Wanted to experiment with new microetching
technology, looking for suitable content
Decided to collect basic descriptive information
for 1000 of the worlds approximately 7000
languages

9
Why language information?

Most natural human languages are products of
millenia of human history (therefore a good long
term thinking project)
Repositories of cultural information
Languages showcase
Human intellectual sophistication
Cultural diversity
To draw attention to the critical issue of
language endangerment

10
The Rosetta Disk

Next generation microfiche
Micro-etched 2" nickel disk at densities of up to
200,000 page images per disk
Developed by Los Alamos Laboratories and Norsam
Technologies
Reading the disk requires a microscope, either
optical or electron, depending on the density of
encoding

11
The Rosetta Stone

Not us! (196 BC)
Parallel text written in three scripts
Hieroglyphic
Demotic (script form)
Greek
The key to deciphering Egyptian Hieroglyphs

12
Rosetta Stone Language LearningSoftware(Also
not us!)
13
Design of the Disk

Original design has human-eye readable text
(Genesis text) and micro-etched text inside an
index
New design has human-eye readable text
(instructions) on one side and microetched images
on the reverse

14
In-House Scanning

HP CapShare Scanners
Scan printed page in multiple passes, any
direction
Page is assembled into one image
Stores about 50 pages at a time (300 dpi bitmap
.tif)
Uploads numerically sequenced images to computer
by infrared port

15
In-House Scanning

Minolta PS 7000 Overhead
Bitmap and grayscale scans up to 600 dpi
Multiple sizes, orientations
Single page / double page spread (good for text
collections with verso annotations)
Best for fragile books, manuscripts that would be
damaged by hand scanning

16
Categories of Collection (1)
17
Categories of Collection (2)
18
Language Curation
19
Rosetta Project Web Site

Welcome
Search for a language
Language overview page
Browse (by name, family, country)
Wordlist tool

20
Welcome
21
Search
22
Language Overview
23
Browse
24
Projects

Endangered Language Query Rooms
Digital Online Curation Services for Endangered
Language Archives (DOCS)
Wordlist Tool
LangGator

25
Endangered Language Query Rooms
http//emeld.rosettaproject.org/
26
Query Room Virtual Keyboard
27
Potawatomi Query Room
Re Bozho by Donald Perrot (host) on July 9 2004,
853 PM Nmedagwe'ndan e'gi nebye'ge'yen ngom.
Neaseno ndesh ne kas ge' nin, mine E'shkanabe'
e'nda ge' nin. I like what you have written. I am
called Neaseno (Southwind) myself, and I live in
Escanaba, MI. Re Bozho by Justin Neely on
September 7 2004, 116 PM Bozho Neaseno mine
Lameen Zagnenibi ndeznekas. Nishnabe ndaw ipi
Bodewadmi ndaw. Shi shi ban nee yek ndebendagwes.
Zego ndotem. Kansas City,Mo ndoch bya. Eskanabe
edayen ge nin. Bama pi ngom Zagnenibi
ndeznekas Hello Neaseno and Lameen my name is
Zagnenibi. Im Native and Potawatomi. I belong
to the Citizen Band. Im Crane Clan . Im from
Kansas City, Missouri. I also live in Escanaba.
Bye for now, Zagnenibi.
28
Taking Conversational Risks
by TL on July 17 2004, 1030 AM mbesuk onago
ngi zhyamen . nseze wgi bye tot i jiman ewi
nepamshkamen be gishek. wabek nuwi zhya men ibe
eje shna mbesuk . ngi wabmak gode chemokmanuk
demojgewat. wabek nin gezhe ni demojgeyan
gnebech. bama mine mtego I went to the lake
yesterday. My brother brought a canoe so we
could float around all day. Tomorrow well go
there to the lake. I saw the white folks
fishing. Tomorrow Ill fish too, maybe. So long
for now, Mtego. Re onago egi zhejkeyak by JN
on July 17 2004, 812 PM mbesek ndazhya ngom.
Mbish ksenyak shode. Nedwendan ode Mbish gshatek.
Megwa Nwinebyege ode bodewadmi kiktowenen bama.
Megwetch Zagnenibi nin se I should go to the
lake today. The water is cold here. I wish the
water were warm. Ill write more of this
Potawatomi conversation later. Thanks, yours
truly Zagnenibi.
29
Factors in query room success
30
DOCS Project

Digital Online Curation Services for Endangered
Language Archives
Many small language archives are beginning to
digitize their materials
Lack technical infrastructure to bring resources
online
Goal is to provide access through Rosetta

31
DOCS Project Archives

Endangered Language Fund (ELF)
Survey for California and Other American Indian
Languages (SCOIL)
The Alaska Native Languages Center (ANLC)
Max-Planck Institute for Evolutionary
Anthropology (Leipzig)

32
Wordlist Tool

Swadesh lists (100, 200, 207 terms) from
Tryon's Comparative Austronesian Dictionary
(rekeyed)
Tim Usher's Indo-Pacific database (2002 version)
Paul Whitehouse's Australian and New Guinea
database (2002 version)
George Starostin's Dravidian database
Ilya Peiros' Mon Khmer database
Total of 1,384 languages, 3,090 lists online
Additional 3000 lists, up to 1850 terms per list,
most 300-500 words in length.

33
LangGator

A linguistic Wayback Machine
Language resource location and aggregation
Use alternate language names, spellings
Deutsch, Hochdeutsch, High German, Allemande
Fadicca, Fadicha, Fedija, Fadija, Fiadidja,
Fiyadikkya, and Fedicca
Character identification (inventory,
distribution)
Dera (Chadic, Nigeria)
Dera (Trans-New Guinea, Indonesia)
Seed crawler with Wordlist terms (see previous
slide), weighted towards longer terms
Archiving through Internet Archive
Serve results through the Rosetta site

34
Collaborations

Electronic Metastructure for Endangered Languages
Data (E-MELD)
General Ontology for Linguistic Description
(GOLD)
Open Language Archives Community (OLAC)

35
E-MELD

Electronic Metastructure for Endangered Language
Data
School of Best Practice http//emeld.org/school/in
dex.html
Guidelines and examples for putting linguistic
data into best practice digital formats
XML with XML Schema or DTD
Mapping terminology to ontology (GOLD)
FIELD lexical database tool http//emeld.org/tools
/field/beta/
Online collaborative tool to build linguistic
dictionaries, backed by ontology (GOLD)

36
GOLD

General Ontology for Linguistic Description
Built in OWL (Web Ontology Language), linked to
SUMO (Suggested Upper Merged Ontology)
Best practice resources should include a mapping
between the researchers terms, and a standard
set, known as the profile
independent (mine) main clause (GOLD)
obviative (mine) fourth person (GOLD)
The standard terminology set can then allow
sophisticated searches across disparate resources.

37
GOLD Community Model
38
OLAC

Open Language Archives Community
Set of 23 metadata elements and controlled
vocabularies (based on Dublin Core)
Subject.language (language described, rather than
audience language) uses SIL language codes
Type.linguistic (grammar, lexicon, text)
IMDI (Isle Metadata Initiative) has 135 elements
Recommended extensions (Discourse Types,
Linguistic Field, Participant roles
Enables searches across a network of archives
that use OLAC metadata http//www.language-archive
s.org/tools/search/

39
URLs

Electronic Metastructure for Endangered Language
Data (E-MELD) http//www.emeld.org (School of
Best Practice, FIELD Tool).
Endangered Language Query Rooms
http//rosettaproject.org8080/emeldbase/.
The Ethnologue http//www.ethnologue.com.
General Ontology for Linguistic Description
(GOLD) http//www.linguistics-ontology.org.
ISLE MetaData Initiative (IMDI)
http//www.mpi.nl/IMDI/.
National Science Digital Library (NSDL)
http//nsdl.org
Open Language Archives Community (OLAC)
http//www.language-archives.org.
The Rosetta Project, http//www.rosettaproject.org
/live. A preview of the new Web site (currently
under construction) is available at
http//preview.rosettaproject.org.