Title: Digital Image Scanning
1Digital Image Scanning
- Instructor
- Geri Bunker Ingram
- geri_at_dimema.com
- An Infopeople Workshop
- August 2005
2This Workshop Is Brought to You By the Infopeople
Project
Infopeople is a federally-funded grant project
supported by the California State Library. It
provides a wide variety of training to California
libraries. Infopeople workshops are offered
around the state and are open registration on a
first-come, first-served basis. For a complete
list of workshops, and for other information
about the Project, go to the Infopeople website
at infopeople.org.
3Introductions
- Please tell us again, your
- Name
- Library
- Position and role within the Local History
Project - Are there lingering questions from yesterday that
we should discuss?
4Learning Objectives
- Understand the basics of digital imaging
- Interpret and evaluate scanning specifications
for your project - Differentiate among different technology options
for various formats - Understand the significance of standard metadata
- Learn about display and navigation options.
5Agenda
- 9001030 What is Digitization?
- 10301045 BREAK
- 10451200 Technology Infrastructure
- 1200100 LUNCH
- 100230 Metadata, Rights, Quality Control
- 230245 BREAK
- 245400 Effectiveness
6What is Digitization?
7What is Digitization?
- Process of digitization
- resolution
- bit depth
- The Local History Project guidelines and
standards - The implications of these standards
8A Refresher on Scanning
- Scanning takes reflected light signals and
changes them to digital data. - The resulting digitized image is made up of a
grid of individual picture elements. - Picture elements are known as pixels. Pixels
are made up of binary digits (bits) - Each bit is expressed as either 0 or 1.
9Controlling Spatial Detail and Accuracy
- Two settings affect spatial detail and accuracy
during the scanning process - bit depth
- resolution (the number of bits sampled)
10Adjusting Bit Depth
- Binary digit (bit) depth
- number of bits used to define each pixel.
- the greater the bit depth, the greater the number
of tones (grays or color) - Black and white (bitonal)1 bit per pixel
- Grayscale8 bits per pixel (256 shades of gray)
- Color24 bits per pixel (16.7 million color
tones)
11Adjusting Resolution
- Resolution is a sampling rate
- how many dots per inch will you scan?
- E.g., 400 dpi.
- The effect
- the higher the rate, the smoother the image
- the more it can be magnified before its
individual pixels become visible - High resolution many dots per inch
- Low resolution fewer dots per inch
12Sometimes Resolution Is Expressed As Absolute
Pixel Dimensions
- Pixel dimensions
- (dpi x width) x (dpi x height)
- Example 3200 x 4000 would be the pixel
dimensions of an 8 x 10 image scanned at 400
dpi using the formula
13Storing Your Images
- Very high quality images create very large files
- The higher the resolution, the greater the file
size - The higher the bit depth, the greater the file
size - For the exercise coming up
- two different formulas
- to figure out how much disk space images need
14Three Or More Files For Every Image
- Master image
- This is one you do not tamper with, and you
- use a file format that does not lose data when
you save it. - Two derivatives
- access (service) image
- small (thumbnail) image.
15Master Files
- Stored offline
- it is valuable,
- usually too large for common bandwidth
- Not uncommon to have multi-megabyte master
images. - The exception is the JPEG2000 format, which
enjoys a progressive display (details later).
16Service or Access Images
- By contrast, a common range for the service or
access image is - 100 to 500 KB
17Thumbnail The Smallest Access Image
- A thumbnail may be only a few KB, and typically
is no larger than - about 150-200 pixels on a side
18For the Local History Project
- Full resolution image and large service image
delivered directly to libraries - Import either of them to CONTENTdm to derive a
service image and thumbnail - Automatic with CONTENTdm software
19Keeping Your Master
- Retain on your local system, on the CDs
delivered, or in any other manner you like. - CDL will also receive a copy of both master and
derivative, - Store the master as your preservation copy.
- Important to understand the storage implications
of your master images
20Local History Project Scanning
- A common specification has been developed
- Scanning vendor (will have been) selected
- It is still important to understand the
specification and infrastructure issues.
21Exercise 1 Calculating File Sizes for Digital
Images
22Technology Infrastructure
23Technology Infrastructure
- In this unit we will discuss the hardware,
software and networking requirements of digital
projects. - We will touch on data storage again briefly and
will delve into the question of compression and
file formats.
24The Local History Project
- Will run on computers located around the state,
connected through the Internet. - The smooth operation of this distributed
infrastructure involves not only hardware and
software, but also depends upon good
communication among people.
25All Of This Takes Planning
- All the partners in the project
- including the info tech service providing
partners - Must demonstrate good communication skills and
consistently confer with each other
26Library Policies
- Security
- Intellectual property
- Policies must be in synch with info tech provider
- regardless of whom that may be
27CDL Will Be Providing Access To Your Collections
- They must be able to protect their networks from
misuse. - The end-users must be able to easily access
unrestricted material.
28Distributed Architecture
- Designed for the Local History Project, it has
local libraries feeding material into a central
databank - Fairly sophisticated, and yet divides the labor
according to appropriate tasks.
29The Local History Project Will Comprise A Set Of
Collections
- Each built locally
- Using the CONTENTdm Acquisition Station software,
and stored on the CONTENTdm server. The materials
will be copied to the CDL - Part of collaborative program for both access and
preservation
30Local History Project Offers At Least Three
Outlets For Collections
- The way your metadata will get into the CDL is
through the use of the CONTENTdm export function.
- A customized export/import mechanism writes your
metadata in the METS format - You will be trained in its use during your
CONTENTdm training session
31Managing The Digital Files
- Because your scanning will have been done by a
vendor, we will not discuss the attributes of
scanning software fully. - But you will need to know something about the
various pieces of software in use.
32The Processing That Will Be Done For You Includes
- Scanning representing a print item as a digital
image. E.g., the software that runs your digital
camera or your scanner. - OCR Software if you have text that you would
like made searchable, software such as Omnipage
then converts the words in the image to a text
file that can be searched. - Lastly, a Digital Asset Management System (e.g.
CONTENTdm) provides a way to organize the image
files, make derivatives and add metadata to each
image.
33CONTENTdm Selected
- High-performance tool
- Easy-to-use interface
- Will scale as the collections grow
- i.e., it will continue to perform well and be
manageable even when there are millions of
objects
34Hardware From Scanning To Storage
- The lifecycle of collections now includes
preservation of the digital image. - Before scanning hardware or specifications are
set - consider the technical issues
- for access AND for
- long-term preservation of the digital image
35Sustaining Collections Over Time
- Data needs to be saved and protected at every
stage in its life-cycle - Many ways of accomplishing this are in
experimental stages
36Preservation Of Digital Files
- Data migration
- e.g. moving files from CD to DVD
- Backup and archiving plans
- e.g. storing files online or on a central backup
server - Disaster recovery plansfor both analog and
digital resources - heaven forbid! The library burns down.what
happens to your CDs, your computers?
37Preservation Repository Must Also Be Managed
- Sized, weeded, protected and moved
- Because CDL is offering long-term preservation,
- your scans and metadata must meet the standards
set for the repository!
38Choosing Among File Formats
- One decision that affects collections
accessibility and preservation potential is - The format of the files you choose to keep
39Many File Formats (LHDRP is requiring these )
- TIFF
- JPEG2000
- GIF
- JPEG
- PDF
- MrSidproprietary, wavelet-based compression for
progressive display
40Choosing among the file formats means you need to
understand something about what the file format
specification implies.
41Compression Used To Reduce File Sizes
- Two kindslossy and lossless.
- Lossy- an irrecoverable loss of data,
- considerable size reductions (JPEG).
- Lossless (JPEG2000 and TIFF),
- no loss of data.
- TIFF no loss of data but the file size is not
reduced - JPEG2000 no loss of data, but can also reduce
the size of the file delivered for display, as it
is decompressed at the point of display.
42TIFF Tagged Image File Format
- TIFF itu-t.7
- IS A 24-bit storage format in widespread use.
- Useful for both color and bitonal (black white)
images - Provides a high level of detail. It is used for
archival files (masters). - When compression is used, it should be lossless.
43JPEG Joint Photographic Experts Group/JFIF
(JPEG File Interchange Format)
- JPEGs are commonly used in bitmap image editing
programs - e.g., Paintshop
- In viewers, and most important for our project,
- web browsers
- 24-bit, lossy compression format
- Well suited for screen and print presentations.
44JPEG2000
- Provides highly detailed views of objects
- Not a proprietary format
- but not all software can handle a JPEG2000 file
- both PhotoShop and CONTENTdm have that capability
- To view a file saved as JPEG2000, some products
require a browser plug-in. - CONTENTdm does not require one, but has a
built-in viewer in the extended server software. - CDL does not currently support JPEG2000, so for
this project, you will not create JPEG2000 files.
45GIF Graphics Interchange Format
- 8-bit, lossless compression format
- Well-suited to low resolution screen display
- Often used for thumbnails
- Supported by all major computer platforms and web
browsers
46PDF Portable Document Format
- Proprietary (Adobe) format, now
- de facto standard (is actually several formats)
- All need a plug-in or external application for
web display, - but that reader is free to download.
- Widely used for printing and viewing multi-page
documents
47A Word About File Naming
- Best practice is to use the standard 8.3
convention, e.g., house178.txt. - Use lower-case characters only as some operating
systems such as Unix are case-sensitive. - Avoid punctuation characters in filenames
altogether.
48File Naming
- Simplea single image
- Compoundmore than one image
- Components need to be named and stored in logical
fashion - E.g., when assembling, page_01.jpg will precede
page_02.jpg (alphanumeric sort) - E.g., when assembling a hierarchy, items need to
be stored in logical directories
49Local History Project Conventions
- Vendor must deliver files named with an
appropriate scheme - that works for your library
- And for the Local History Project
- Exercise will focus on file handling
- File formats, naming and organization
50Hardware
- Digital project hardware components will include
at minimum - Servers
- Desktop computers
- Network components
51Your CONTENTdm Environment
- Server located and managed remotely for the Local
History Project. - Computer on your desktop
- Network IT provider uses components
- e.g., routers, cables, access points, network
interface cards - to connect everything together and to the
internet.
52Data Storage Day-to-day, And Over The Long Haul
- As you populate your collections, it is important
to back up the workstations and network drives
regularly. At the site of the CONTENTdm server,
as well as at CDL, servers will also be regularly
backed up as well.
53Digital collection servers
- Remember form follows function.
- Hardware is sized for the project and for the
environment, - After the software has been chosen.
54CONTENTdm Server is Hosted by OCLC
- For LHDRP
- One-year license
- After that, depends on funding.if funded could
be extended
55Considerations If You Run A Server
- Processor style and speed
- Minimum RAM
- Minimum online storage
- These variables always depend upon the context of
your organization, the operating system
environments supported, and the application
requirements. - The minimum requirements for servers in general
assure good performance, i.e., you can very
rapidly search and retrieve dense data, and
display to many concurrent users.
56CONTENTdm 4 Minimum Server Requirements
- CPU Intel Pentium 4 or greater
- RAM 512 MB minimum
- Operating Systems
- Linux, unix, Sun Solaris 8 or higher, Windows
2000/2003 - Dedicated Web server
- IIS 4.0 or later with Windows, Apache with UNIX)
57Storage For Files
- Both derivatives (service images and
thumbnails) are - kept online
- The archival TIFF is stored offline
58The Files Most Commonly Seen As Derivative
(Access) Files
- JPEGs averaging 100 K (with most CONTENTdm
collections) - Estimate 500 jpgs will need about 50 MB space to
store the access (service, derivative) images - To size a CONTENTdm server, assume that a
- 1 GB disk
- Will store 10,000 jpgs for high-quality display
59To Populate The Collections, On The Desktop
Contentdm Requires
- Monitor capable of 1024 x 768 resolution
- 256 MB RAM (512 recommended)
- Disk capacity to hold images (temporarily) and
software - i.e. 100 MB for installation of Acquisition
Station - Windows 2000 or XP
- 128 Kbps minimum network connection
60A Desktop Wish Listnot Required, But Nice!
- A dedicated computer for digitization with
- A 19 or 21 inch display monitor
- 1 Gb RAM (for multi-media)
- 3.2GHz/800MHz processors optimized for image
manipulation - Graphics processors (up to 128 MB dedicated RAM)
for high quality video, multiple monitors, etc. - High-quality lupes, scales and updated targets
61Digitizing Devices Scanners and Cameras
- In this phase of the project, your scanning will
be outsourced - But info on scanners and cameras is included here
for future reference
62We Will Discuss The Primary Types Of These
Capture Devices
- Flatbed scanners
- Transparency scanners
- Overhead scanners
- Wide format scanners
- Cameras
- Copy stand cameras
- Camera backs
63The Flatbed Scanner
- Chances are you have one of these in your library
(or your home). They handle unbound material up
to 11 x 17 in size, and some come with
automatic document feeder attachments so that you
can stack a document for scanning. - The makes and models vary greatly in cost and
quality. Some have transparency adapters too, but
if you have a lot of film (slides) to scan, you
may look for a specialized scanner just for them.
64Transparency Scanner
- For transparent material, both negatives and
slides, there are many makes and models to choose
from, but a commonly used one is made by Nikon. - E.g., Nikon LS-2000 Film Scanner
- 36-bit color58mb file size20 second scan
speed2700 dpi resolution35mm film strip or
slide format
65Overhead Scanner
- If you do a lot of interlibrary loan, you may
already own an overhead scanner. it was designed
for books, other bound documents, so that the
page is protected from touch by the machine. - E.g., Minolta PS 3000 and PS 7000 are widely in
use
66Cameras
- For 3-dimensional items and sometimes for
oversize items, cameras are becoming very
popular. Discussions on various listservs such as
imagelib are lively with comparisons of cameras
from the consumer models we carry on our
vacations to high-quality professional set ups. - E.g., Nikon COOLPIX 3100
- Effective pixels 3.2 million (total pixels 3.34
million)
67Copy Stands
- are used for long exposures, repeated placement
of objects, etc. - An example of a high quality camera and copy
stand is the Leica S1 Pro Digital Camera used in
the digitization lab at the University of Utah.
It is described as - Triple linear color CCD line, high-performance
full step motor. - Full scan time is 185 seconds. Viewfinder offers
laterally correct image on a focusing screen with
a grid. - Produces file sizes of
- 75MB at 36 bit color or
- 150MB at 48 bit color.
68Camera Specification
- Resolution for cameras is often given as the
total number of pixels delivered by a device. For
example, a camera may be described as x number
of mega-pixels - A mega-pixel is 1,000,000 pixels.
- E.g., Canons S45 (4.5 Megapixel) maximum
resolution 2272 x 1704 which if you do the math,
is closer to 3.8 megapixels
69For Highest Quality Professional Work
- Photographers fit 4x5 traditional film cameras
with camera backs that store the images
digitally instead of in analog format. E.g.,
PhaseOne PowerPhase-- a digital back to a 4x5
view camera that can produce resolutions of
10,000 x 12,000 pixels.
70Camera vs. Scanner
- Scanners and cameras share broadly similar
technologies, and at this point there are
negligible quality differences at the high end.
Of course scanners can only handle 2-dimensional
or flat images, while cameras can handle both - 2-dimensional and 3-dimensional objects.
71Digital CamerasVersatile and Fast
- They are preferred for delicate or fragile
originals and increasingly for large flat works
such as maps and aerial photos. - But the lighting is hard to control to get
professional quality work you may find yourself
hiring a professional photographer to come in.
Rare materials should not be subjected to strong
light of course, so if doing that sort of
photography in-house, you might use a strobe
light.
72Prints From Digital What Does The User Need?
- Many libraries are creating revenue generating
(cost-recovery) programs that provide prints from
the collection. - With the advent of digitization programs, these
prints are increasingly made from digitized
copies of the original. - Occasionally users even purchase the digital file
itself.
73Cost of Commercial Printer
- To serve the occasional professional user
- Outsource to a commercial house or offer to sell
the digital image instead. - Pro-sumer photo-quality printers can be had for
under 100 - e.g., Canon i560s
- Some of your users may prefer to buy the TIFF and
print at home
74IF You Print From Digital, You Will House Large
Files
- The BH Photo house in New York City estimates
these file sizes for good output - Up to 3 MB Good for proofing, web use,
presentations 3-20 MB
Good for up to 8x10 prints 21-50 MB
Good for up to 16x20 prints 51-99 MB
Good for up to 24x30 prints100-125MB
Good for over 24x30 prints
75Networking Puts It All Together
- To move your digital images from your workstation
to your CONTENTdm server, you will use the
internet. - Your connection should have sufficient bandwidth
for the digital formats you are importing. - Your users will of course need to have
connections strong enough to download the images
in real time.
76Speeds
- T1 1.544 million bits per second (Mbps)this
bandwidth is sufficient for building the
collection. - T3 45 Mbps of course this is even better, much
faster.
77Wireless
- The most popular wireless mode
- 802.11 b/g (WiFi)
- shared 11 Mbps for b and 33-54 Mbps for g.
- This should be quite adequate for your end-users
to access your collections.
78Security
- Network access is made secure through various
methods, - IP ranges (addresses like 209.116.xxx.xxx)
- Passwords
- Mixed models
- Integrated with a parent organizations model!
79Exercise 2Materials Preparation
80Metadata, Rights, Quality Control
81Metadata
- Standards and schemes
- Access and preservation
- A full one-day workshop on the metadata
- Template for Local History Project in Project
Guide
82A Refresher What Do We Mean By Metadata?
- Metadata is information about the digital
object. - Good metadata helps in finding and preserving a
digital object or aggregation of digital objects.
83Metadata Schema Examples
- AACR2 (MARC format)
- Dublin Core (DC)
- Visual Resources Association Core (VRA Core)
- Metadata Object Descriptive Schema (MODS)
- Encoded Archival Description (EAD)
84Types of Metadata
- Descriptive
- Administrative
- Structural
- Technical
- Preservation
85Descriptive Metadata
- Terms that say what the digital object
representswhat it is about - Its what your users expectit identifies the
information resources in a way that allows them
to be discovered.
86Administrative Metadata
- Facilitates both short-term and long-term
management and processing of digital collections - Includes data pertinent to the creation of the
digital object - Includes rights management, access control and
use requirements
87Structural Metadata
- Facilitates navigation and presentation
- Provides information about the internal structure
of resources - including page, section, chapter numbering,
indexes, and table of contents - Describes the relationship among materials (e.g.,
photograph B was front of Postcard A)
88Technical Metadata
- Describes the features of the digital file
- e.g. resolution,
- pixel dimensions,
- and the compression factor used in saving the
file.
89Preservation Metadata
- The ability to preserve your digital resources
into the future depends in part on how completely
youve applied metadata, especially - administrative
- structural
- technical metadata
90LHDRP, CONTENTdm and the Dublin Core
- Format
- Identifier
- Source
- Language
- Relation
- Coverage
- Rights
- CONTENTdm offers Audience too
- Title
- Creator
- Subject
- Description
- Publisher
- Contributor
- Date
- Type
91You Will Use These Elements To Describe Your
Collections
- At the item level
- during the CONTENTdm building process.
- Later, your collections will be
- exported
- Imported to OAC
- Metadata and CONTENTdm classes scheduled
- There we will delve into applying the Dublin Core
element set
92Rights Metadata
- When material needs to be restricted
- The reasons should be made clear to the
end-users, - If possible, the right to access the objects
should be negotiated. - You will have to clear your materials of any
restrictions so that they can be freely displayed
on the CDLs public access site(s).
93Access To Contentdm Server
- The Dublin Core Rights field can be used to
explain the rights situation for the item - Mechanisms in place to allow you to restrict
access to materials at the item and the
collection level. - Some commonly used mechanisms for controlling
access to digital materials are user
name/password challenges and IP (internet
protocol) address ranges.
94CONTENTdm
- Uses both usernames/passwords and IP ranges
- Control access at the collection and the item
level - When your users are viewing your images on a
CONTENTdm server.
95Quality control
- Getting the materials off to the vendor
- appropriately packed, tagged and flagged
- Getting the materials back from the vendor
- what will you check for?
- texts and photosdifferent things to look for
96Texts And Images Of Texts
- The scan produces a file in image format, which
in itself is not searchable - There are a number of ways to create searchable
text from images of text.
97Converting Images To Text
- Re-keying
- very expensive, but high-quality
- handwritten text, or foreign language fonts
- you will have to create typescripts by hand.
- OCR (Optical Character Recognition) is the
automated way - With correction, expensive,
- but without correction lower accuracy
98What is OCR?
- OCR engines are
- pattern recognition algorithms which can
- convert images of alphanumeric characters
- into machine-recognizable characters.
99OCR Has Been Around Since The 1970s
- Much research to improve accuracy and extend the
readable language sets. - Very expensive in the early days
- Available to desktop consumers in the mid-late
1990s
100Now There Is Decent Pro-sumer Desktop Software
- Such as AbbyyFine Reader available
- (e.g., this is offered as an extension of
CONTENTdm.) - Service bureaus (vendors) have also developed
proprietary software - get up to 90 accuracy
- can handle large volumes
- use filters, formulas and multi-pass methods
101The Problem With OCR
- When used on barely legible old texts, film,
etc., creates dirty ASCII - Guesses are saved in a string
- not intended for human view. (These should be
cleaned up if display is important.) - can hide the dirty ASCII from display but allow
the search engine to index on it
102To What Degree Is The Accuracy Of The OCR
Important?
- This depends on the quality of the image being
processed, and on the intended use of the
captured text. - A rule of thumb high resolution, greater
bit-depth gives more accurate OCR (and larger
file sizes).
103Imaging Vendor ChecklistIdentifying
Unacceptable Scans
- Image not correct size
- File name is incorrect
- File format is incorrect
- Loss of detail
- Too light or too dark
- Image cropped incorrectly
- Image rotated incorrectly
- Image reversed
104Identifying Correct Packaging Of Digital Materials
- Object identifier
- The order of the compound objects parts
- corresponding file names and directory structure
- Verify to CALIFA
105Exercise 3Quality Control
106Effectiveness
107What Is Success ?
- Best practices in the digitization process,
evaluation and quality control. - Usability testing
- As technology changes,
- as long as you are relying on agreed-upon
standards, - You will be able to go back and correct, improve
and expand.
108User-driven Purposes
- Many reasons for undertaking a digitization
project, - All include to improve and expand end-user access
to your materials. - Even preserving the content and conserving the
originals - It is because someday a person may need to access
the resource
109 Late Turn-of-the-century History
- Regular use of digitization in cultural heritage
organizations - such as libraries and archives
- Leaders in the field like the California Digital
Library, the Digital Library Federation
documented best practices
110Principles, Part 1
- Leading practices proven over time
- Scan at the highest resolution appropriate to the
informational content of the originals - Scan at an appropriate level of quality to avoid
rescanning and re-handling of the originals in
the futurescan once - Create and store a master image file that can be
used to produce derivative image files and serve
a variety of current and future user needs - Use image file formats and compression techniques
that conform to industry standards - Create backup copies of all files on a stable
medium
111Principles, Part 2
- Create meaningful metadata for image files or
collections - Store media in an appropriate environment
- Monitor and recopy data as necessary
- Outline a migration strategy for transferring
data across generations of technology - Anticipate and plan for future technological
developments - Scan (or have your vendor scan) at the
appropriate settings for source material - Inspect master images at 100 magnification (all
or a sample)
112Local History Project Standards
- The California State Library, CALIFA and CDL
- Partnered to create a set of standards for
digital imaging and metadata - To ensure that your collections are accessible to
your public and well-preserved into the future. - Selected a digital collection management tool
- Prepared a straightforward path for your
materials from CONTENTdm to the CDL
113Lets Revisit Our Project Plans
- And make sure we chart our course for the next
steps!
114Exercise 4Assessing and Improving Your Local
History Project
115Conclusion
- Please fill out your evaluation forms
- See you in a few weeks for CONTENTdm training!