Mining Gazetteer Data from Digital Library Collections - PowerPoint PPT Presentation

1 / 12
About This Presentation
Title:

Mining Gazetteer Data from Digital Library Collections

Description:

Mining Gazetteer Data from Digital Library Collections. David Smith. Perseus Project ... Optionally winnow target text (e.g. non-capitalized words where applicable) ... – PowerPoint PPT presentation

Number of Views:31
Avg rating:3.0/5.0
Slides: 13
Provided by: david2291
Category:

less

Transcript and Presenter's Notes

Title: Mining Gazetteer Data from Digital Library Collections


1
Mining Gazetteer Data from Digital Library
Collections
  • David Smith
  • Perseus Project
  • Tufts University

2
Corpus Preview
3
Preview 1400-1600
4
What DLs can do for gazetteers
  • Directly manage gazetteers
  • Raw materials for gazetteers
  • Reference works
  • Monolingual and parallel corpora
  • Testbeds for improving these technologies
  • E.g. alignment helps name tagging, and name
    tagging helps alignment

5
Lexicographical parallels
  • Original slipping process
  • First, get a madman ...
  • Creation of Brown and other corpora
  • Kucera and Lewis
  • Cobuild dictionary and friends
  • But names get no respect in lexicography
    (McDonald, 1996)

6
Cultural dependencies
7
Toponym Results
8
Projection principles
  • Exploits asymmetry in human language technologies
    (Yarowsky, HLT 2001)
  • English, French, Chinese, Czech (!) have
  • POS taggers, morphological analyzers
  • Named entity identifiers
  • Parsers and bracketers
  • Parallel corpus alignment allows projection of
    these resources

9
Projection principles
10
Projection on the cheap
  • Align texts at coarse structural level
  • Geocode source text (English)
  • Optionally winnow target text (e.g.
    non-capitalized words where applicable)
  • Calculate mutual information (Church Hanks,
    1990)
  • Transliteration may be too ad hoc

11
Preliminary results
  • Greek/English testbed
  • 98 precision
  • 70.8 recall (Why?)
  • Ethnic designations present interesting problems
  • Stephanus of Byzantium
  • Morphology outside of English

12
Proposals
  • Preservation of gazetteer source materials
  • DLs as home for gazetteer slips
  • Parallel texts as key resource
  • (also cf. Berkeley TIDES work)
  • Persistent documents as training sets for
    automatic methods
  • http//www.perseus.tufts.edu
Write a Comment
User Comments (0)
About PowerShow.com