Building collections with Greenstone - PowerPoint PPT Presentation

1 / 121
About This Presentation
Title:

Building collections with Greenstone

Description:

Title: Chapter One Last modified by: John Leggett Created Date: 2/7/2003 5:52:10 AM Document presentation format: On-screen Show Other titles: Arial Garamond Times ... – PowerPoint PPT presentation

Number of Views:246
Avg rating:3.0/5.0
Slides: 122
Provided by: csdlTamu7
Category:

less

Transcript and Presenter's Notes

Title: Building collections with Greenstone


1
Building collections with Greenstone
  • How to Build a Digital Library
  • Ian H. Witten and David Bainbridge

2
Digital Library Collections
  • There is a distinction between
  • BUILDING collections
  • DELIVERING information to users
  • Similar to compile-time versus runtime
    distinction in computer programming
  • Information structures should usually be prepared
    in advance

3
Building a Collection
  • The Collector
  • A subsystem that takes you step by step through
    building a simple collection
  • Conceals details behind the scenes
  • First locate information on your computer or the
    Web
  • Plain text, HTML, Word, PDF, email file, etc.

4
Plug-ins
  • Plug-ins are software modules that handle
  • Format conversion
  • Metadata extraction
  • Plug-ins promote extensibility

5
Greenstone Archive Format
  • Greenstone Archive Format
  • XML-based file format
  • File format for
  • Documents
  • Metadata

6
Collection Configuration File
  • Collection Configuration File
  • Defines the structure of a collection
  • Governs how the collection is built
  • Specifies how the collection will appear to users

7
Greenstone Extended Capabilities
  • Extending the Capabilities of Greenstone
  • Plug-ins
  • Handle different document and metadata formats
  • Classifiers
  • Handle different kinds of browsing structures
  • Format statements and Macros
  • Govern the user interface content and appearance

8
Why Greenstone?
9
Benefits of Greenstone
  • General system for constructing and presenting
    digital collections
  • Handles millions of documents, text, images,
    audio, video
  • User interfaces identical in Web-based and CD-ROM
    versions
  • Installs on Windows and Linux
  • Access locally or remotely using web browser

10
Organization of Collections
  • Each collection can be organized differently
  • Format of source documents
  • Metadata
  • Directory structure
  • Document structure
  • Searching and browsing services
  • Presentation
  • Auxiliary services

11
Variation of Source Format
  • Source documents can be supplied in
  • Plain text
  • HTML
  • PostScript
  • PDF
  • Word
  • E-mail
  • Other file types
  • Images
  • Video
  • Audio

12
Variation of Metadata
  • Different types of metadata
  • Metadata can be supplied differently
  • fields in MS Word
  • ltmetagt tags in HTML
  • Information coded into filename and directories
  • Spreadsheet or other data file
  • Explicit metadata format like MARC

13
Variation of Directory Structure
  • Collections can vary in the directory structure
    in which the information is located

14
Variation of Document Structure
  • Document structure
  • Flat
  • Divided sequentially into pages
  • Hierarchical organization
  • Title or other metadata available at each level

15
Variation of Services
  • Searching
  • Metadata
  • Indexes
  • Hierarchical levels
  • Browsing
  • Metadata
  • Browser type

16
Variation of Presentation
  • Results can be presented to users in various
    ways
  • Format that target documents are shown in
  • Search results page
  • Metadata browsers
  • Interface language

17
Variation of Auxiliary Services
  • A collection may require additional services
  • User logging
  • Etc.

18
Collection Configuration FileAllows Variation
  • A digital library collection is made by
  • Gathering raw material
  • Designing the collection
  • Putting design information about the structure
    and presentation of the collection in the
    Collection Configuration File

19
Front Page of Collection
  • Statement of collections purpose
  • Statement of collections coverage
  • Explanation of how collection is organized

20
Searching Involves Indexes
  • Searching is provided by indexes built from
    different parts of the documents
  • Entire documents
  • Paragraphs
  • Titles
  • Sections
  • Section headings
  • Figure captions

21
Indexes
  • Indexes can be created automatically using
  • Documents
  • Supporting files
  • Indexes can be rebuilt automatically
  • New document in the same format becomes available
  • Process can awake, check for new material, and
    rebuild the indexes

22
Plug-ins for Indexing
  • Source documents are converted into standard XML
    form for indexing using plug-ins
  • Standard plug-ins process
  • Plain text
  • HTML
  • Word
  • PDF
  • Usenet and email messages
  • New plug-ins can be written for other document
    types

23
Browsing Involves Lists
  • Browsing involves lists that can be examined by
    the user
  • Authors
  • Titles
  • Dates
  • Hierarchical classification structures

24
Classifier Modules
  • Modules called classifiers are used to create
    browsers and build browsing structures from
    metadata
  • Scrollable lists
  • Alphabetic selectors
  • Dates
  • Hierarchies
  • Programmers can write new classifiers to create
    novel browsing capabilities

25
Search Terms
  • Search Terms in Greenstone
  • Alphabetic characters
  • Digits
  • Separated by white space
  • Punctuation acts as white space

26
Two Types of Queries
  • Query for ALL of the words
  • Boolean AND
  • Query for SOME of the words
  • Ranked

27
Indexes to Search
  • In most collections, you can choose different
    indexes to search
  • Examples
  • Author and title indexes
  • Chapter and paragraph indexes
  • Usually the full matching document is returned
    regardless of index searched

28
Preferences Page
  • Preferences Page
  • Allows advanced control over search operation
  • Case-folding and stemming
  • Advanced query mode where users specify Boolean
    operators
  • Large-query interface
  • Display search history

29
Preferences Page
  • Preferences Page
  • Specify subcollections to be included in searches
  • Specify presentation language
  • Customize interface
  • Textual vs. standard interface
  • Suppress navigation bar
  • Suppress alert system

30
Using the Collector
31
The Greenstone Collector
  • Easiest way to build a simple collection
  • The Collector allows you to
  • Create a new collection
  • Modify or add to an existing collection
  • Delete a collection

32
Starting the Collector
  • Click the Collector link from the default
    Greenstone home page
  • Log in
  • When Greenstone is installed, an account called
    admin is set up with a password chosen during
    installation
  • The Collector works through a standard web
    interface

33
Creating a New Collection
  • Collectors main purpose is to build a new
    collection
  • Structure of a collection is determined when the
    collection is set up
  • Simplest to copy the structure of an existing
    collection and then edit

34
Collection Building Steps
  1. Collection Information
  2. Source Data
  3. Configuration
  4. Building
  5. Viewing

35
Collection Building Steps
  • ? Collection Information
  • ? Source Data
  • ? Configuration
  • ? Building
  • ? Viewing

36
1. Collection Information
  • Give the collection a name and provide associated
    information
  • Title
  • Short phrase used to identify the collection
    within the digital library
  • Contact e-mail address
  • Brief description
  • Sets out the principles that govern what is
    included in the collection

37
Collection Building Steps
  • ? Collection Information
  • ? Source Data
  • ? Configuration
  • ? Building
  • ? Viewing

38
2. Source Data
  • Specify the location of the sources
  • Clone existing collection
  • Specify on a pull-down menu the existing
    collection
  • Create a completely new collection

39
2. Source Data
  • In the provided boxes, indicate where Source
    Documents are located
  • Specification of sources
  • file//
  • http//
  • ftp//

40
file//
  • File name on the Greenstone server system
  • That file will be included in collection
  • Directory name on the Greenstone server
  • Everything in the folder and its subfolders will
    be included

41
http//
  • Web page
  • The web page will be downloaded
  • All pages it links to (and all pages they link
    to) that reside on the same site, below the URL,
    will also be downloaded
  • URL that leads to a list of files
  • Everything in the folder and its subfolders will
    be included in collection

42
ftp//
  • File to be downloaded using FTP
  • Directory name on the FTP server
  • Downloads everything in the folder and its
    subfolders

43
Collection Building Steps
  • ? Collection Information
  • ? Source Data
  • ? Configuration
  • ? Building
  • ? Viewing

44
3. Configuration
  • This step can be bypassed
  • Allows adjustment of configuration options
  • The construction and presentation of all
    collections are controlled by specifications in a
    special collection configuration file

45
Collection Building Steps
  • ? Collection Information
  • ? Source Data
  • ? Configuration
  • ? Building
  • ? Viewing

46
4. Building
  • The computer does the work of the building
    process
  • Indexes are built
  • For browsing
  • For searching
  • Following specifications in the collection
    configuration file
  • Status line shows progress
  • Warnings shown if files cant be found

47
Collection Building Steps
  • ? Collection Information
  • ? Source Data
  • ? Configuration
  • ? Building
  • ? Viewing

48
5. Viewing
  • View the collection that has just been created
  • E-mail can be sent to the collections contact
    address
  • Must enable by editing main.cfg configuration file

49
Working with Existing Collections
  • Add more material and rebuild the collection
  • Edit the configuration file to modify the
    collections structure
  • Delete the collection
  • Put the collection on CD-ROM

50
Adding Material to a Collection
  • Do not re-specify files that are already in the
    collection
  • Files would be included twice
  • If the building process fails, the old version
    remains unchanged
  • Structure of collection can be changed
  • Edit the configuration file
  • May add plug-ins or an option to a plug-in

51
Plug-ins Document Formats
  • Plug-ins are specified in the collection
    configuration file
  • File name determines document format
  • Widely used document formats

TEXTPlug HTMLPlug WORDPlug PDFPlug
PSPlug EMAILPlug ZIPPlug
52
Text Files
  • TEXTPlug Plug-In
  • .txt
  • .text
  • Plain text file
  • Title metadata based on the first line of the file

53
HTML Files
  • HTMLPlug Plug-In
  • .htm
  • .html
  • .shtml
  • .shm
  • .asp
  • .php
  • .cgi

54
HTML Files
  • HTMLPlug Plug-In
  • Imports HTML files
  • Title metadata extracted from the HTML lttitlegt
    tag
  • Other HTML ltmetagt tag data can be extracted
  • Parses and processes any links in the file
  • Links to other files in the collection are
    trapped and replaced by references to the document

55
HTML Files
  • file_is_url
  • Optional switch within the HTML plug-in
  • Causes URL metadata to be inserted into each
    document, based on the file-name convention that
    is adopted by the mirroring package. The
    collection uses this metadata to allow readers to
    refer to the original source material rather than
    a local copy

56
Microsoft Word Files
  • WORDPlug Plug-In
  • .doc
  • Imports Microsoft Word documents
  • Greenstone uses independent programs to convert
    Word files to HTML
  • Many variants on the Word format
  • Older Word formats use a simple text string
    extraction

57
PDF Files
  • PDFPlug Plug-In
  • .pdf
  • Imports PDF Files
  • Adobes Portable Document Format
  • Greenstone uses independent programs to convert
    PDF files to HTML

58
PostScript Files
  • PSPlug Plug-In
  • .ps
  • Imports PostScript Files
  • Works best when a standard conversion program is
    already installed on the computer
  • Uses simple text extraction algorithm if no
    conversion program is present

59
Email Files
  • EMAILPlug
  • .email
  • Imports files containing email
  • Each source is checked for e-mail contents
  • Extracts metadata
  • Subject
  • To
  • From
  • Date
  • Deals with common formats
  • Netscape, Eudora, Unix mail readers

60
Compressed Archived Files
  • ZIPPlug Plug-In
  • .zip
  • .tar
  • .gz
  • .z
  • .tgz
  • .bz
  • Relies on standard utility programs being present

61
Building Collections Manually
62
Building a Collection
  • Building a Collection
  • The process of taking a set of documents and
    metadata information and creating all the indexes
    and data structures that support the searching,
    browsing, and viewing operations that the
    collection offers

63
Building a Collection
  • Four Phases in Building a Collection
  • Make
  • Make a skeleton framework structure to contain
    the collection
  • Import
  • Import the documents and metadata, convert to a
    Greenstone standard form
  • Build
  • Build the required indexes and data structures
  • Install
  • Make the collection operational

64
Building Collections Manually
  • ? Getting Started
  • ? Making a framework for the collection
  • ? Importing the documents
  • ? Building the indexes
  • ? Installing the collection

65
Getting Started
  • Locate the command prompt
  • Go to the directory where Greenstone was
    installed
  • cd C\Program Files\gsdl
  • Tell system where to find Greenstone files
  • setup.bat
  • Sets the variable GSDLHOME to the Greenstone home
    directory
  • To return later
  • cd GSDLHOME

66
Building Collections Manually
  • ? Getting Started
  • ? Making a framework for the collection
  • ? Importing the documents
  • ? Building the indexes
  • ? Installing the collection

67
Make a framework for the collection
  • Use the Perl program mkcol.pl to make a
    collection
  • Get description of usage and arguments
  • perl S mkcol.pl
  • mkcol.pl
  • May leave off first part if system recognizes
    that .pl files are associated with Perl

68
Make a framework for the collection
  • perl S mkcol.pl creator emailAddress
    collectionName

69
Make a framework for the collection
  • Examine the file structurecd GSDLHOME\collect\
    collectionName
  • List directory contentsdir
  • Seven subdirectories are created

images import index perllib
archives building etc (contains collect.cfg file)
70
Make a framework for the collection
  • collect.cfg File
  • emailAddress placed in the creator and maintainer
    lines
  • collectionName placed in collection-meta lines
  • Plug-ins are inserted

71
Building Collections Manually
  • ? Getting Started
  • ? Making a framework for the collection
  • ? Importing the documents
  • ? Building the indexes
  • ? Installing the collection

72
Importing the documents
  • The collections import directory should contain
    the source material
  • Drag the directory containing the source material
    into the import directory
  • You may drag several source directories and
    hierarchies

73
Importing the documents
  • The import process
  • Brings documents into the Greenstone system
  • Standardizes document format(the way that
    metadata is specified)
  • Standardizes the file structure(that contains
    the documents)

74
Importing the documents
  • To get a list of options for the import program
  • perl S import.pl
  • The basic import command is
  • perl S import .pl collectionName

75
Importing the documents
  • You may be in any directory when the import
    command is issued
  • The software works by knowing the collections
    name and the Greenstone home directory
  • Warnings may appear
  • When files are found without corresponding
    plug-ins
  • These files will be ignored

76
Building Collections Manually
  • ? Getting Started
  • ? Making a framework for the collection
  • ? Importing the documents
  • ? Building the indexes
  • ? Installing the collection

77
Building the indexes
  • Use the program buildcol.pl

78
Building the indexes
  • Modify collect.cfg file to customize the
    collections appearance
  • collectionname
  • Web browsers receive this name as the title of
    the collections front page
  • collectionextra
  • Description of the collection
  • Appears under About this collection on the
    collections home page
  • Enter as a single line in the editor

79
Building the indexes
  • Modify collect.cfg file to customize the
    collections appearance
  • iconcollection
  • Give the collection an icon image
  • Put the location of the image between quotes
  • If absent, the collections name will be used
  • Use _httpprefix_ as a shorthand way of beginning
    any URL that points within the Greenstone file
    area
  • Example_httpprevix_/collect/collectionName/image
    s/icon.gif

80
Building the indexes
  • To get a list of options for the build program
  • perl S buildcol.pl
  • The basic build command is
  • perl S buildcol .pl collectionName

81
Building the indexes
  • The building process takes about a minute on
    small collections and can take much longer for
    very large collections
  • You may ignore most warning messages
  • Serious problems will cause the program to
    terminate

82
Building Collections Manually
  • ? Getting Started
  • ? Making a framework for the collection
  • ? Importing the documents
  • ? Building the indexes
  • ? Installing the collection

83
Installing the collection
  • Building is done in the building directory
  • Collection must be moved to the index directory
    before users can see it
  • Drag contents of the building directory to the
    index directory
  • If index already contains files, remove them
    first
  • Forgetting to move the contents of building to
    index is a common mistake

84
Installing the collection
  • To view the newly built collection
  • Restart Greenstone
  • If using the Local Library version
  • Reload Greenstone Home Page
  • If using the Web version

85
Importing and Building
86
General Information
  • Two Main Parts to Collection Building
  • Importing (import.pl)
  • Building (buildcol.pl)

87
Files and Directories
88
Collection Specific Directories
  • GSDLHOME
  • collect all the digital library collections
  • collectionName directory of collection
  • import original source material
  • archives result of import process
  • building temporary, contents manually moved to
    index
  • index bulk of info served to users
  • (import, archives and building can be deleted)
  • etc contains collect.cfg file
  • images icons used for the collection
  • perllib Perl programs specific to collection

89
Other Greenstone Directories
  • GSDLHOME
  • lib common software for both the collection
    server and receptionist
  • bin programs used for building process
  • script Perl programs used
  • (mkcol.pl, import.pl, buildcol.pl)
  • perllib Perl modules
  • plugins Perl plugins
  • classify Perl classifiers
  • cgi-bin Greenstone runtime system
  • (absent in Local Library version)
  • src source code in C
  • colservr the collection server
  • recpt the receptionist

90
Other Greenstone Directories
  • GSDLHOME
  • packages source code for external software
    packages used by Greenstone
  • (indexing and compression program, database
    manager program, etc.)
  • (each package is stored in a directory of its
    own with a readme file)
  • bin executables
  • mappings Unicode translation tables
  • etc configuration files for the entire
    system, initialization and error logs, user
    authorization database
  • images user interface images and icons
  • macros small code fragments that drive the
    user interface
  • tmp temporary files
  • docs documentation for the system

91
Object Identifiers
  • Documents permanent name in the system
  • Remain the same when collection rebuilt
  • Assigned by the import process
  • Stored as an attribute in the document archive
    file
  • Character strings starting with the letters HASH
    (HASH0109d3850a6de440c4d1ca2)
  • Used to name directory where archive file is
    stored

92
Plug-Ins
  • Plug-ins do most of the work of the import
    process
  • Operate in the order in which they are listed in
    the collect.cfg file
  • Input file is passed to each plug-in until one is
    found that can process it
  • If there is no plug-in that can process a file, a
    warning is printed
  • Plug-ins determine the traversal of the
    subdirectory structure in the import directory
  • RecPlug - processes directories, recurses through
    directory structures and passes the name through
    the plug-in list
  • GAPlug processes Greenstone Archive Format
    documents (in the archives directory structure)
  • ArcPlug used during building, processes list of
    document OIDs produced during import (list is
    stored in archives.inf file)

93
The Import Process
94
The Import Process
  • Brings documents and metadata into the system in
    a standardized XML form
  • Original material placed in import directory
  • Import process transforms it to files in the
    archives directory
  • The original material can be deleted
  • Collection can be rebuilt from archive files
  • New material added to collection by placing it in
    import directory and re-executing the import
    process
  • The new material finds it way into archives along
    with existing files
  • To keep the source form of collections
  • Do not delete the archives
  • Source form can be augmented and rebuilt later

95
The Build Process
96
The Build Process
  • Creates the indexes and data structures that make
    the collection operational
  • Indexes for the whole collection are built all at
    once
  • Build process does not work incrementally
  • Adding new material to archives requires that
    entire collection be rebuilt (by issuing
    buildcol.pl)
  • Most collections can be rebuilt overnight

97
Options for Import and Build
98
Additional Options for Import
99
Additional Options for Build
100
Options for Import and Build
  • To see options for any Greenstone script, type
    its name at the command prompt
  • Options for Import and Build help with debugging
    (see Table 6.5 on page 310)
  • verbosity
  • archivedir
  • maxdocs
  • collectdir
  • out
  • keepold
  • debug

101
Greenstone Archive Documents
102
Greenstone Archive Format
lt!DOCTYPE GreenstoneArchive lt!ELEMENT Section
(Description,Content,Section)gt lt!ELEMENT
Description (Metadata)gt lt!ELEMENT Content
(PCDATA)gt lt!ELEMENT Metadata (PCDATA)gt ltATTLIST
Metadata name CDATA REQUIREDgt gt
103
Document Metadata
  • Metadata descriptive information about author,
    title, date and keywords
  • Stored with metadata name
  • Stored at the beginning of the section
  • Example
  • ltMetadata nameTitlegtFreshwater Resources in
    Arid Landslt/Metadatagt

104
Document Metadata
  • Dublin Core a metadata standard
  • New metadata types can be invented
  • Metadata can be assigned by an automatic process
    rather than manually entered

105
The Dublin Core
106
Collection Configuration File
107
Collection Configuration File
108
Default Configuration File
109
Getting the Most Out of Your Documents
110
Basic Plug-In Options
111
Document Processing Plug-ins
112
Document Processing Plug-ins
113
Document Processing Plug-ins
114
Assigning Metadata from a File
  • XML Document Type Definition (DTD)
  • Example XML Metadata File

115
Document Type Definition (DTD)
lt!DOCTYPE GreenstoneDirectoryMetadata lt!ELEMENT
DirectoryMetadata (FileSet)gt lt!ELEMENT FileSet
(FileName,Description)gt lt!ELEMENT FileName
(PCDATA)gt lt!ELEMENT Description
(Metadata)gt lt!ELEMENT Metadata
(PCDATA)gt ltATTLIST Metadata name CDATA
REQUIREDgt ltATTLIST Metadata mode
(accumulateoverride) "override"gt gt
116
Example XML Metadata File
lt?xml version"1.0" ?gt lt!DOCTYPE
GreenstoneDirectoryMetadata SYSTEM "http//greenst
one.org/dtd/GreenstoneDirectoryMetadata/1.0/Greens
toneDirectoryMetadata.dtd"gt ltDirectoryMetadatagt ltF
ileSetgt ltFileNamegtnugget.lt/FileNamegt ltDescription
gt ltMetadata name"Title"gtNugget Point
Lighthouselt/Metadatagt ltMetadata name"Place"
mode"accumulate"gtNugget Pointlt/Metadatagt lt/Descri
ptiongt lt/FileSetgt ltFileSetgt ltFileNamegtnugget-point
-1.jpglt/FileNamegt ltDescriptiongt ltMetadata
name"Title"gtNugget Point Lighthouselt/Metadatagt ltM
etadata name"Subject"gtLighthouselt/Metadatagt lt/Des
criptiongt lt/FileSetgt lt/DirectoryMetadatagt
117
Tagging Document Files
lt!-- ltSectiongt ltDescriptiongt ltMetadata
name"Title"gt Realizing human rights for
poor people Strategies for achieving the
international development targets
lt/Metadatagt lt/Descriptiongt --gt (text of section
goes here) lt!-- lt/Sectiongt --gt
118
Classifiers
119
Format Statements
120
Format Statements
121
Examples of Format Strings
Write a Comment
User Comments (0)
About PowerShow.com