Metadata Extraction - PowerPoint PPT Presentation

1 / 14
About This Presentation
Title:

Metadata Extraction

Description:

Since 2000, 20 thematic, event-based collections. 100 TB of data collected. 12,500 URLs ... One record per recommended URL for each distinct collection ... – PowerPoint PPT presentation

Number of Views:127
Avg rating:3.0/5.0
Slides: 15
Provided by: gjon
Category:

less

Transcript and Presenter's Notes

Title: Metadata Extraction


1
Metadata Extraction Web Archives Automating
the Record Creation Process
  • Abbie Grotke / abgr_at_loc.gov
  • Gina Jones / gjon_at_loc.gov
  • Library of Congress
  • Office of Strategic Initiatives
  • Web Capture Team

2
Library of Congress Web Archives
  • Since 2000, 20 thematic, event-based collections
  • 100 TB of data collected
  • 12,500 URLs

http//www.loc.gov/lcwa
3
Web Archiving Tools
  • Crawling
  • Heritrix
  • WARC
  • Access
  • Wayback Machine
  • NutchWAX
  • International Internet
  • Preservation Consortium
  • netpreserve.org

4
LCs Web Archive Workflow
  • Identify select URLs (LS or LAW)
  • Determine crawl strategy, create a seed list for
    crawling (OSI)
  • Sites harvested by Internet Archive or in-house
    crawlers (OSI),
  • Quality Review (OSI curators)
  • Create catalogers list (OSI) and XML MODS
    template (LS) for metadata extraction

5
Describing the Archives
  • Collection-level MARC record in OPAC
  • Item-level MODS records in LCWA
  • One record per recommended URL for each distinct
    collection
  • With so many thousands of URLs to process, how do
    we streamline the process?

6
XML MODS Template
7
Metadata Extraction
  • For each URL that will be cataloged
  • Get archived web site metadata
  • Combine with URL Nominations Database metadata
  • If elections/campaign web site, metadata also
    pulled from our candidate Access database (used
    to create subject terms)
  • Using XML template, we add collection and record
    level metadata
  • Create a single file for delivery

8
Data Sources for Metadata Extraction
9
URL Nominations Database
  • URL
  • Access Rights
  • Language(s)
  • Category
  • Subject Terms

10
Election Candidate Metadata
  • Name
  • URL
  • Party Affiliation
  • State
  • Race
  • District (House)

11
Archived Web Site Metadata
  • From 1st capture
  • Document Title
  • Keywords
  • Abstract
  • Mime Types
  • From Wayback index
  • Capture Dates (First Last)

12
Combined Data in Template
13
Combined Data in Template
14
Combined Data in Template
Write a Comment
User Comments (0)
About PowerShow.com