Title: Building collections with Greenstone
1Building collections with Greenstone
- How to Build a Digital Library
- Ian H. Witten and David Bainbridge
2Digital Library Collections
- There is a distinction between
- BUILDING collections
- DELIVERING information to users
- Similar to compile-time versus runtime
distinction in computer programming - Information structures should usually be prepared
in advance
3Building a Collection
- The Collector
- A subsystem that takes you step by step through
building a simple collection - Conceals details behind the scenes
- First locate information on your computer or the
Web - Plain text, HTML, Word, PDF, email file, etc.
4Plug-ins
- Plug-ins are software modules that handle
- Format conversion
- Metadata extraction
- Plug-ins promote extensibility
5Greenstone Archive Format
- Greenstone Archive Format
- XML-based file format
- File format for
- Documents
- Metadata
6Collection Configuration File
- Collection Configuration File
- Defines the structure of a collection
- Governs how the collection is built
- Specifies how the collection will appear to users
7Greenstone Extended Capabilities
- Extending the Capabilities of Greenstone
- Plug-ins
- Handle different document and metadata formats
- Classifiers
- Handle different kinds of browsing structures
- Format statements and Macros
- Govern the user interface content and appearance
8Why Greenstone?
9Benefits of Greenstone
- General system for constructing and presenting
digital collections - Handles millions of documents, text, images,
audio, video - User interfaces identical in Web-based and CD-ROM
versions - Installs on Windows and Linux
- Access locally or remotely using web browser
10Organization of Collections
- Each collection can be organized differently
- Format of source documents
- Metadata
- Directory structure
- Document structure
- Searching and browsing services
- Presentation
- Auxiliary services
11Variation of Source Format
- Source documents can be supplied in
- Plain text
- HTML
- PostScript
- PDF
- Word
- E-mail
- Other file types
- Images
- Video
- Audio
12Variation of Metadata
- Different types of metadata
- Metadata can be supplied differently
- fields in MS Word
- ltmetagt tags in HTML
- Information coded into filename and directories
- Spreadsheet or other data file
- Explicit metadata format like MARC
13Variation of Directory Structure
- Collections can vary in the directory structure
in which the information is located
14Variation of Document Structure
- Document structure
- Flat
- Divided sequentially into pages
- Hierarchical organization
- Title or other metadata available at each level
15Variation of Services
- Searching
- Metadata
- Indexes
- Hierarchical levels
- Browsing
- Metadata
- Browser type
16Variation of Presentation
- Results can be presented to users in various
ways - Format that target documents are shown in
- Search results page
- Metadata browsers
- Interface language
17Variation of Auxiliary Services
- A collection may require additional services
- User logging
- Etc.
18Collection Configuration FileAllows Variation
- A digital library collection is made by
- Gathering raw material
- Designing the collection
- Putting design information about the structure
and presentation of the collection in the
Collection Configuration File
19Front Page of Collection
- Statement of collections purpose
- Statement of collections coverage
- Explanation of how collection is organized
20Searching Involves Indexes
- Searching is provided by indexes built from
different parts of the documents - Entire documents
- Paragraphs
- Titles
- Sections
- Section headings
- Figure captions
21Indexes
- Indexes can be created automatically using
- Documents
- Supporting files
- Indexes can be rebuilt automatically
- New document in the same format becomes available
- Process can awake, check for new material, and
rebuild the indexes
22Plug-ins for Indexing
- Source documents are converted into standard XML
form for indexing using plug-ins - Standard plug-ins process
- Plain text
- HTML
- Word
- PDF
- Usenet and email messages
- New plug-ins can be written for other document
types
23Browsing Involves Lists
- Browsing involves lists that can be examined by
the user - Authors
- Titles
- Dates
- Hierarchical classification structures
24Classifier Modules
- Modules called classifiers are used to create
browsers and build browsing structures from
metadata - Scrollable lists
- Alphabetic selectors
- Dates
- Hierarchies
- Programmers can write new classifiers to create
novel browsing capabilities
25Search Terms
- Search Terms in Greenstone
- Alphabetic characters
- Digits
- Separated by white space
- Punctuation acts as white space
26Two Types of Queries
- Query for ALL of the words
- Boolean AND
- Query for SOME of the words
- Ranked
27Indexes to Search
- In most collections, you can choose different
indexes to search - Examples
- Author and title indexes
- Chapter and paragraph indexes
- Usually the full matching document is returned
regardless of index searched
28Preferences Page
- Preferences Page
- Allows advanced control over search operation
- Case-folding and stemming
- Advanced query mode where users specify Boolean
operators - Large-query interface
- Display search history
29Preferences Page
- Preferences Page
- Specify subcollections to be included in searches
- Specify presentation language
- Customize interface
- Textual vs. standard interface
- Suppress navigation bar
- Suppress alert system
30Using the Collector
31The Greenstone Collector
- Easiest way to build a simple collection
- The Collector allows you to
- Create a new collection
- Modify or add to an existing collection
- Delete a collection
32Starting the Collector
- Click the Collector link from the default
Greenstone home page - Log in
- When Greenstone is installed, an account called
admin is set up with a password chosen during
installation - The Collector works through a standard web
interface
33Creating a New Collection
- Collectors main purpose is to build a new
collection - Structure of a collection is determined when the
collection is set up - Simplest to copy the structure of an existing
collection and then edit
34Collection Building Steps
- Collection Information
- Source Data
- Configuration
- Building
- Viewing
35Collection Building Steps
- ? Collection Information
- ? Source Data
- ? Configuration
- ? Building
- ? Viewing
361. Collection Information
- Give the collection a name and provide associated
information - Title
- Short phrase used to identify the collection
within the digital library - Contact e-mail address
- Brief description
- Sets out the principles that govern what is
included in the collection
37Collection Building Steps
- ? Collection Information
- ? Source Data
- ? Configuration
- ? Building
- ? Viewing
382. Source Data
- Specify the location of the sources
- Clone existing collection
- Specify on a pull-down menu the existing
collection - Create a completely new collection
392. Source Data
- In the provided boxes, indicate where Source
Documents are located - Specification of sources
- file//
- http//
- ftp//
40file//
- File name on the Greenstone server system
- That file will be included in collection
- Directory name on the Greenstone server
- Everything in the folder and its subfolders will
be included
41http//
- Web page
- The web page will be downloaded
- All pages it links to (and all pages they link
to) that reside on the same site, below the URL,
will also be downloaded - URL that leads to a list of files
- Everything in the folder and its subfolders will
be included in collection
42ftp//
- File to be downloaded using FTP
- Directory name on the FTP server
- Downloads everything in the folder and its
subfolders
43Collection Building Steps
- ? Collection Information
- ? Source Data
- ? Configuration
- ? Building
- ? Viewing
443. Configuration
- This step can be bypassed
- Allows adjustment of configuration options
- The construction and presentation of all
collections are controlled by specifications in a
special collection configuration file
45Collection Building Steps
- ? Collection Information
- ? Source Data
- ? Configuration
- ? Building
- ? Viewing
464. Building
- The computer does the work of the building
process - Indexes are built
- For browsing
- For searching
- Following specifications in the collection
configuration file - Status line shows progress
- Warnings shown if files cant be found
47Collection Building Steps
- ? Collection Information
- ? Source Data
- ? Configuration
- ? Building
- ? Viewing
485. Viewing
- View the collection that has just been created
- E-mail can be sent to the collections contact
address - Must enable by editing main.cfg configuration file
49Working with Existing Collections
- Add more material and rebuild the collection
- Edit the configuration file to modify the
collections structure - Delete the collection
- Put the collection on CD-ROM
50Adding Material to a Collection
- Do not re-specify files that are already in the
collection - Files would be included twice
- If the building process fails, the old version
remains unchanged - Structure of collection can be changed
- Edit the configuration file
- May add plug-ins or an option to a plug-in
51Plug-ins Document Formats
- Plug-ins are specified in the collection
configuration file - File name determines document format
- Widely used document formats
TEXTPlug HTMLPlug WORDPlug PDFPlug
PSPlug EMAILPlug ZIPPlug
52Text Files
- TEXTPlug Plug-In
- .txt
- .text
- Plain text file
- Title metadata based on the first line of the file
53HTML Files
- HTMLPlug Plug-In
- .htm
- .html
- .shtml
- .shm
- .asp
- .php
- .cgi
54HTML Files
- HTMLPlug Plug-In
- Imports HTML files
- Title metadata extracted from the HTML lttitlegt
tag - Other HTML ltmetagt tag data can be extracted
- Parses and processes any links in the file
- Links to other files in the collection are
trapped and replaced by references to the document
55HTML Files
- file_is_url
- Optional switch within the HTML plug-in
- Causes URL metadata to be inserted into each
document, based on the file-name convention that
is adopted by the mirroring package. The
collection uses this metadata to allow readers to
refer to the original source material rather than
a local copy
56Microsoft Word Files
- WORDPlug Plug-In
- .doc
- Imports Microsoft Word documents
- Greenstone uses independent programs to convert
Word files to HTML - Many variants on the Word format
- Older Word formats use a simple text string
extraction
57PDF Files
- PDFPlug Plug-In
- .pdf
- Imports PDF Files
- Adobes Portable Document Format
- Greenstone uses independent programs to convert
PDF files to HTML
58PostScript Files
- PSPlug Plug-In
- .ps
- Imports PostScript Files
- Works best when a standard conversion program is
already installed on the computer - Uses simple text extraction algorithm if no
conversion program is present
59Email Files
- EMAILPlug
- .email
- Imports files containing email
- Each source is checked for e-mail contents
- Extracts metadata
- Subject
- To
- From
- Date
- Deals with common formats
- Netscape, Eudora, Unix mail readers
60Compressed Archived Files
- ZIPPlug Plug-In
- .zip
- .tar
- .gz
- .z
- .tgz
- .bz
- Relies on standard utility programs being present
61Building Collections Manually
62Building a Collection
- Building a Collection
- The process of taking a set of documents and
metadata information and creating all the indexes
and data structures that support the searching,
browsing, and viewing operations that the
collection offers
63Building a Collection
- Four Phases in Building a Collection
- Make
- Make a skeleton framework structure to contain
the collection - Import
- Import the documents and metadata, convert to a
Greenstone standard form - Build
- Build the required indexes and data structures
- Install
- Make the collection operational
64Building Collections Manually
- ? Getting Started
- ? Making a framework for the collection
- ? Importing the documents
- ? Building the indexes
- ? Installing the collection
65Getting Started
- Locate the command prompt
- Go to the directory where Greenstone was
installed - cd C\Program Files\gsdl
- Tell system where to find Greenstone files
- setup.bat
- Sets the variable GSDLHOME to the Greenstone home
directory - To return later
- cd GSDLHOME
66Building Collections Manually
- ? Getting Started
- ? Making a framework for the collection
- ? Importing the documents
- ? Building the indexes
- ? Installing the collection
67Make a framework for the collection
- Use the Perl program mkcol.pl to make a
collection - Get description of usage and arguments
- perl S mkcol.pl
- mkcol.pl
- May leave off first part if system recognizes
that .pl files are associated with Perl
68Make a framework for the collection
- perl S mkcol.pl creator emailAddress
collectionName
69Make a framework for the collection
- Examine the file structurecd GSDLHOME\collect\
collectionName - List directory contentsdir
- Seven subdirectories are created
images import index perllib
archives building etc (contains collect.cfg file)
70Make a framework for the collection
- collect.cfg File
- emailAddress placed in the creator and maintainer
lines - collectionName placed in collection-meta lines
- Plug-ins are inserted
71Building Collections Manually
- ? Getting Started
- ? Making a framework for the collection
- ? Importing the documents
- ? Building the indexes
- ? Installing the collection
72Importing the documents
- The collections import directory should contain
the source material - Drag the directory containing the source material
into the import directory - You may drag several source directories and
hierarchies
73Importing the documents
- The import process
- Brings documents into the Greenstone system
- Standardizes document format(the way that
metadata is specified) - Standardizes the file structure(that contains
the documents)
74Importing the documents
- To get a list of options for the import program
- perl S import.pl
- The basic import command is
- perl S import .pl collectionName
75Importing the documents
- You may be in any directory when the import
command is issued - The software works by knowing the collections
name and the Greenstone home directory - Warnings may appear
- When files are found without corresponding
plug-ins - These files will be ignored
76Building Collections Manually
- ? Getting Started
- ? Making a framework for the collection
- ? Importing the documents
- ? Building the indexes
- ? Installing the collection
77Building the indexes
- Use the program buildcol.pl
78Building the indexes
- Modify collect.cfg file to customize the
collections appearance - collectionname
- Web browsers receive this name as the title of
the collections front page - collectionextra
- Description of the collection
- Appears under About this collection on the
collections home page - Enter as a single line in the editor
79Building the indexes
- Modify collect.cfg file to customize the
collections appearance - iconcollection
- Give the collection an icon image
- Put the location of the image between quotes
- If absent, the collections name will be used
- Use _httpprefix_ as a shorthand way of beginning
any URL that points within the Greenstone file
area - Example_httpprevix_/collect/collectionName/image
s/icon.gif
80Building the indexes
- To get a list of options for the build program
- perl S buildcol.pl
- The basic build command is
- perl S buildcol .pl collectionName
81Building the indexes
- The building process takes about a minute on
small collections and can take much longer for
very large collections - You may ignore most warning messages
- Serious problems will cause the program to
terminate
82Building Collections Manually
- ? Getting Started
- ? Making a framework for the collection
- ? Importing the documents
- ? Building the indexes
- ? Installing the collection
83Installing the collection
- Building is done in the building directory
- Collection must be moved to the index directory
before users can see it - Drag contents of the building directory to the
index directory - If index already contains files, remove them
first - Forgetting to move the contents of building to
index is a common mistake
84Installing the collection
- To view the newly built collection
- Restart Greenstone
- If using the Local Library version
- Reload Greenstone Home Page
- If using the Web version
85Importing and Building
86General Information
- Two Main Parts to Collection Building
- Importing (import.pl)
- Building (buildcol.pl)
87Files and Directories
88Collection Specific Directories
- GSDLHOME
- collect all the digital library collections
- collectionName directory of collection
- import original source material
- archives result of import process
- building temporary, contents manually moved to
index - index bulk of info served to users
- (import, archives and building can be deleted)
- etc contains collect.cfg file
- images icons used for the collection
- perllib Perl programs specific to collection
89Other Greenstone Directories
- GSDLHOME
- lib common software for both the collection
server and receptionist - bin programs used for building process
- script Perl programs used
- (mkcol.pl, import.pl, buildcol.pl)
- perllib Perl modules
- plugins Perl plugins
- classify Perl classifiers
- cgi-bin Greenstone runtime system
- (absent in Local Library version)
- src source code in C
- colservr the collection server
- recpt the receptionist
90Other Greenstone Directories
- GSDLHOME
- packages source code for external software
packages used by Greenstone - (indexing and compression program, database
manager program, etc.) - (each package is stored in a directory of its
own with a readme file) - bin executables
- mappings Unicode translation tables
- etc configuration files for the entire
system, initialization and error logs, user
authorization database - images user interface images and icons
- macros small code fragments that drive the
user interface - tmp temporary files
- docs documentation for the system
91Object Identifiers
- Documents permanent name in the system
- Remain the same when collection rebuilt
- Assigned by the import process
- Stored as an attribute in the document archive
file - Character strings starting with the letters HASH
(HASH0109d3850a6de440c4d1ca2) - Used to name directory where archive file is
stored
92Plug-Ins
- Plug-ins do most of the work of the import
process - Operate in the order in which they are listed in
the collect.cfg file - Input file is passed to each plug-in until one is
found that can process it - If there is no plug-in that can process a file, a
warning is printed - Plug-ins determine the traversal of the
subdirectory structure in the import directory - RecPlug - processes directories, recurses through
directory structures and passes the name through
the plug-in list - GAPlug processes Greenstone Archive Format
documents (in the archives directory structure) - ArcPlug used during building, processes list of
document OIDs produced during import (list is
stored in archives.inf file)
93The Import Process
94The Import Process
- Brings documents and metadata into the system in
a standardized XML form - Original material placed in import directory
- Import process transforms it to files in the
archives directory - The original material can be deleted
- Collection can be rebuilt from archive files
- New material added to collection by placing it in
import directory and re-executing the import
process - The new material finds it way into archives along
with existing files - To keep the source form of collections
- Do not delete the archives
- Source form can be augmented and rebuilt later
95The Build Process
96The Build Process
- Creates the indexes and data structures that make
the collection operational - Indexes for the whole collection are built all at
once - Build process does not work incrementally
- Adding new material to archives requires that
entire collection be rebuilt (by issuing
buildcol.pl) - Most collections can be rebuilt overnight
97Options for Import and Build
98Additional Options for Import
99Additional Options for Build
100Options for Import and Build
- To see options for any Greenstone script, type
its name at the command prompt - Options for Import and Build help with debugging
(see Table 6.5 on page 310) - verbosity
- archivedir
- maxdocs
- collectdir
- out
- keepold
- debug
101Greenstone Archive Documents
102Greenstone Archive Format
lt!DOCTYPE GreenstoneArchive lt!ELEMENT Section
(Description,Content,Section)gt lt!ELEMENT
Description (Metadata)gt lt!ELEMENT Content
(PCDATA)gt lt!ELEMENT Metadata (PCDATA)gt ltATTLIST
Metadata name CDATA REQUIREDgt gt
103Document Metadata
- Metadata descriptive information about author,
title, date and keywords - Stored with metadata name
- Stored at the beginning of the section
- Example
- ltMetadata nameTitlegtFreshwater Resources in
Arid Landslt/Metadatagt
104Document Metadata
- Dublin Core a metadata standard
- New metadata types can be invented
- Metadata can be assigned by an automatic process
rather than manually entered
105The Dublin Core
106Collection Configuration File
107Collection Configuration File
108Default Configuration File
109Getting the Most Out of Your Documents
110Basic Plug-In Options
111Document Processing Plug-ins
112Document Processing Plug-ins
113Document Processing Plug-ins
114Assigning Metadata from a File
- XML Document Type Definition (DTD)
- Example XML Metadata File
115Document Type Definition (DTD)
lt!DOCTYPE GreenstoneDirectoryMetadata lt!ELEMENT
DirectoryMetadata (FileSet)gt lt!ELEMENT FileSet
(FileName,Description)gt lt!ELEMENT FileName
(PCDATA)gt lt!ELEMENT Description
(Metadata)gt lt!ELEMENT Metadata
(PCDATA)gt ltATTLIST Metadata name CDATA
REQUIREDgt ltATTLIST Metadata mode
(accumulateoverride) "override"gt gt
116Example XML Metadata File
lt?xml version"1.0" ?gt lt!DOCTYPE
GreenstoneDirectoryMetadata SYSTEM "http//greenst
one.org/dtd/GreenstoneDirectoryMetadata/1.0/Greens
toneDirectoryMetadata.dtd"gt ltDirectoryMetadatagt ltF
ileSetgt ltFileNamegtnugget.lt/FileNamegt ltDescription
gt ltMetadata name"Title"gtNugget Point
Lighthouselt/Metadatagt ltMetadata name"Place"
mode"accumulate"gtNugget Pointlt/Metadatagt lt/Descri
ptiongt lt/FileSetgt ltFileSetgt ltFileNamegtnugget-point
-1.jpglt/FileNamegt ltDescriptiongt ltMetadata
name"Title"gtNugget Point Lighthouselt/Metadatagt ltM
etadata name"Subject"gtLighthouselt/Metadatagt lt/Des
criptiongt lt/FileSetgt lt/DirectoryMetadatagt
117Tagging Document Files
lt!-- ltSectiongt ltDescriptiongt ltMetadata
name"Title"gt Realizing human rights for
poor people Strategies for achieving the
international development targets
lt/Metadatagt lt/Descriptiongt --gt (text of section
goes here) lt!-- lt/Sectiongt --gt
118Classifiers
119Format Statements
120Format Statements
121Examples of Format Strings