The Technical Aspects of Text Digitization - PowerPoint PPT Presentation

1 / 15
About This Presentation
Title:

The Technical Aspects of Text Digitization

Description:

Flatbed Scanner (the one on the right) Price ... Flatbed Scanner: 5ppm (USB 1.1 connection) 12ppm (USB 2.0 / Firewire) ... Flatbed simple, inexpensive ... – PowerPoint PPT presentation

Number of Views:30
Avg rating:3.0/5.0
Slides: 16
Provided by: inst7
Category:

less

Transcript and Presenter's Notes

Title: The Technical Aspects of Text Digitization


1
The Technical Aspects of Text Digitization
  • David Wiesenfeld, Matt Depetro, Jordan Stephens

2
Benefits of Scanning
  • No physical space needed data is stored on a
    hard drive
  • Documents/books searchable with Ctrl-F
  • Inter-library loan becomes a non-issue

3
How digitization works
  • The book/journal is scanned
  • The picture resulting from the scanning is fed
    through an optical character recognition program
    and converted to searchable text
  • OR A graphical representation can be used

4
Types of Scanners
  • Flatbed Scanner (the one on the right)
    Price 100 and up
  • Planetary Scanner (the one on the left)
    Price expensive
  • High Speed document feeder Scanner Price
    6000

5
Scanner Speed
  • Flatbed Scanner 5ppm (USB 1.1 connection)
    12ppm (USB 2.0 / Firewire)
  • Planetary Scanner 2 5 seconds/page, not
    counting page turning, dependent on quality
  • High speed document feeder 5 seconds/page

6
Pros/Cons of Different Scanner Types
  • Pros
  • Flatbed simple, inexpensive
  • Planetary does not damage book binding, high
    quality, quick page-turning
  • High Speed fast, no human interaction while
    scanning
  • Cons
  • Flatbed Slowest
  • Planetary Expensive, large, expensive software
    (4000)
  • High Speed can only scan sheets. So book
    bindings must be removed first

7
OCR Software
  • ReadIRIS, Omnipage, Abbyy Finereader are the 3
    main OCR software packages - Multiple
    languages
  • With basic text formatting, and modern-quality
    printing, makes 1 mistake per 1000 characters
  • Ratio is worse for older books

8
More OCR Software
  • OCR software can learn - Recognize
    handwriting - Can be taught new words and
    possibly new characters
  • Not good at handling strange formatting -
    Columns, tables - Mathematical equations

9
Very Rough Cost Estimate
  • Assume 2 pages/scan on scanner/copier
  • 5 seconds/scan 5 seconds to turn page avg.
    book length 200 pages work-study is paid
    9/hr
  • Time to scan book 17 minutes
  • Books/hr 3
  • Pay work-study student to do this gt cost/book
    3 for scanning only

10
Did Not Take Into Account
  • Time for OCR to run (probably significant)
  • Cropping of scanned document
  • That each page is saved as a separate file unless
    a document feeder is used so compilation time
  • PROOFREADING!! (This would take a lot of time and
    money)

11
Conclusions
  • Both time and money make scanning a large portion
    of the collection unrealistic
  • Could scan TOCs/indexes to add keywords for the
    book to HOLLIS - This would take much less
    time, and require less proofreading
  • - Done by the Library of Congress

12
Suggestions
  • Could scan only books that have been recalled
    from the library, before they are returned -
    manageable number at a time
  • Million Books Project could send the part of
    the collection from pre-1923 (public domain
    cutoff) to India to be scanned by the Project.
    Books would be returned, along with the scanned,
    searchable text

13
And now for something completely different
14
Half-Baked Idea about Journals
  • Prof. Shieber mentioned in his talk on the evils
    of for-profit journals that one of the reasons
    the editorial board and reviewers just didnt up
    and leave the publisher was that the publisher
    provided infrastructure
  • If a number of Universities set up large servers,
    journals could be published to those servers,
    giving access to everyone

15
Continuing that Thought
  • The journals could be set up with thumbnails in a
    serendipitous-browsing-friendly manner
  • Reference articles could be set up as links to
    those articles, making researching a topic much
    easier
  • A nice paper-archive copy could be printed out on
    nice, acid-free paper, bound, and sent out to the
    depository in case the servers decide to melt
Write a Comment
User Comments (0)
About PowerShow.com