BRITISH NEWSPAPERS 1800 - PowerPoint PPT Presentation

1 / 18
About This Presentation
Title:

BRITISH NEWSPAPERS 1800

Description:

BRITISH NEWSPAPERS 1800 1900. How to Access the Content: Nineteenth Century Newspapers ... Lloyd's Illustrated Newspaper. Manchester Times. North Wales ... – PowerPoint PPT presentation

Number of Views:179
Avg rating:3.0/5.0
Slides: 19
Provided by: jis4
Category:

less

Transcript and Presenter's Notes

Title: BRITISH NEWSPAPERS 1800


1
BRITISH NEWSPAPERS 1800 1900How to Access the
Content Nineteenth Century Newspapers
  • Jane Shaw
  • Project Manager

2
Overview
  • 46 titles
  • 2 million pages
  • ? articles
  • Variations in quality
  • Variations in structure
  • Daily vs weekly
  • Size
  • Layout

3
Newspaper Titles
  • Brighton Patriot
  • The Champion
  • Charter
  • Chartist
  • Chartist Circular
  • Cobbetts Weekly Political Register
  • London Dispatch
  • Northern Liberator
  • Northern Star
  • Odd Fellow
  • Operative
  • Poor Mans Guardian
  • Southern Star
  • The Examiner
  • The Graphic
  • Morning Chronicle
  • Trewmans Exeter Flying Post
  • Jacksons Oxford Journal
  • Newcastle Courant
  • Pall Mall Gazette
  • Leeds Mercury
  • Belfast Newsletter
  • Glasgow Herald
  • Aberdeen Journal
  • Baner
  • Bristol Mercury
  • Caledonian Mercury
  • Daily News
  • Derby Mercury
  • Freeman's Journal
  • Genedl
  • Goleuad
  • Hampshire/Portsmouth Telegraph
  • Hull Packet
  • Ipswich Journal
  • Liverpool Mercury
  • Lloyds Illustrated Newspaper
  • Manchester Times

4
Three ways to access information
  • By
  • Metadata title, place of publication, dates of
    publication, issue number, number of pages, page
    quality rating, illustration indicator
  • Browsing article images, page images, browse by
    issue or title
  • OCR actual text of page as rendered by
    automatic OCR process

5
Ways forward in Metadata
  • Refine current elements
  • Subject Category for Articles
  • Sub-categories and further classification
  • Type of Illustration
  • Photograph, map, cartoon, etc
  • Change to title
  • dates when occurred
  • Gather new elements
  • Original size
  • Price
  • Publisher
  • Author
  • User entered metadata
  • Keywords entered by a user when viewing article
  • Could be used in specific searches if desired

6
Ways forward in Browsing
  • Difficult without ITT specification
  • This Day in History
  • Interactive Timeline
  • Interactive Map for Browsing
  • Custom, personalised viewing Bookbag
  • Using statistics from usage

7
Quality of OCR being produced
8
Key factors affecting OCR accuracy
  1. Mass production environment impossible to
    hand-tweak every image, compromise between time
    and quality
  2. Software always improving and developing
  3. Quality of text varies within a run see images
  4. Complexity of layouts and formats varies between
    46 titles
  5. Microfilm source doesnt affect this project as
    the microfilm is of a very good standard, but
    could in future projects

9
Excessive Inking and Washed Out text
10
Quality of texts
  • Loss of text due to truncation Loss of text
    due to tight bindings

11
Why bother with OCR?
  • Calculating OCR character accuracy is time
    consuming and ultimately misleading
  • Character accuracy vs Word accuracy
  • Word accuracy vs Significant Words
  • Why OCR?
  • Provides smallest level of access into the
    information
  • Size of project is such that detailed
    descriptions in the metadata are impossible

12
Ways forward in OCR
  • Software and technology improvements
  • OCR from greyscale rather than bitonal
  • Increase resolution Is 300 dpi sufficient?
  • Give user raw text to view even though imperfect

13
Searchability Testing
  • OCR statistics not useful when determining value
    of full text searching
  • Difficult to asses whether the current standard
    of OCR will deliver searching ability without a
    delivery platform

14
Lessons learnt
  • Significant names and words repeat in newspapers,
    in longer articles this improves retrieval
    chances even when the OCR is poor
  • Poor microfilms can be duped darker to improve
    density, making faint lines darker and characters
    stronger
  • Microfilm standards BL are working to are
    consistently high
  • Quality of input image a good predictor to OCR
    accuracy - use greyscale as the source for the
    OCR
  • Quality of service images as expected from mass
    production run

15
Conclusions
  • For titles with poor OCR, only option for
    searching is to rekey (if legible to human eye)
  • Benefit of issues being bound in volumes is they
    are in correct sequence and missing issues can be
    identified
  • A darker density greyscale for OCR purposes is
    recommended
  • Using keywords to enhance article categorisation
    helps to clarify meaning for operators
  • Mark up style sheets only relevant for that title
    of limited use for a large group of diverse
    titles
  • Visual clues about style and structures per title
    could be explored
  • Mass production environment is a limitation
  • Title and issue level metadata, pagination and
    quality rating captured from originals by BL
    staff aids accuracy in later mass production run

16
Conclusions
  • Better way to define OCR accuracy, by proper
    names rather than characters
  • Rekeying article title gives high accuracy, if
    no title, first two lines are keyed
  • High Levels of QA necessary to deliver good
    quality source material for digitisation
    production environment
  • Random sampling of issues rather than articles to
    check sequence of pages is preferred
  • Condition information about each page being
    captured by BL should be included in metadata

17
Future Developments
  • We could progress zone types under structures
    e.g. main body of text, table, caption to
    illustration etc.?
  • Modern OCR engines are improving, can reprocess
    with better fuzzy matching
  • To increase interoperability scope use more
    Dublin Core standard elements e.g. dcpublisher,
    dcdescription for a limited number of articles
  • Yes/No Boolean indicator for illustrations could
    be expanded to type of illustration and
    restricted vocabulary use e.g. map, cartoon,
    photograph
  • Using automatically collected usage statistics to
    create links and associations which people have
    found and browsed

18
Summary
  • Access is determined by
  • The available technology e.g. OCR, document
    structure analysis
  • By the size of the project mass digitisation
    means no hand tweaking
  • By the source material there are limitations
    with poor source material
  • We have learnt a great deal, to give users
    better, quicker and fuller access to the content.
    This project is a good pilot for other future
    projects.
Write a Comment
User Comments (0)
About PowerShow.com