Title: Building Greenstone Collections from the Command Line
1Building Greenstone Collections from the Command
Line
2Basic commands
- Type setup.bat (for Windows users) or
setup.sh for (Unix/Linux users) when youre in
the Greenstone installation directory - To create a collection, type perl S mkcol.pl
creator youremail_at_somewhere.com collection_name - To import documents into a collection, type perl
S import.pl collection_name - To build a collection, type perl S buildcol.pl
collection_name - For further details, read page 9 19 of the
developers guide
3Building A Collection In Greenstone
XML documents
Browsing and full text
Web
Import
Archives
Index
import.pl (plugins)
build.pl (classifiers)
4Importing documents
- Plugins are used to process source documents in
different formats and associate the corresponding
metadata to them - The output of this process is XML documents
encoded in the Greenstone Archive format
specified by the following DTD - lt!DOCTYPE GreenstoneArchive
- lt!ELEMENT Section (Description,Content,Section)gt
- lt!ELEMENT Description (Metadata)gt
- lt!ELEMENT Content (PCDATA)gt
- lt!ELEMENT Metadata (PCDATA)gt
- ltATTLIST Metadata name CDATA REQUIREDgt
- gt
5Automating collection building tasks
- Batch files can automate many of the tasks
- You can create a batch file to import and rebuild
a collection - Try copy and paste the following lines into a
batch file named rebuild.bat - Perl S import.pl removeold 1
- Perl S buildcol.pl 1
- Execute the batch file by typing rebuild.bat
collection_name - There are many commands that you can combined in
a batch file
6Importing documents (cont.)
- An example
- ltSectiongt
- ltDescriptiongt
- ltMetadata name"gsdlsourcefilename"gtec158e.txtlt/Me
tadatagt - ltMetadata name"Title"gtFreshwater Resources in
Arid Landslt/Metadatagt - ltMetadata name"Identifier"gtHASH0158f56086efffe592
636058lt/Metadatagt - ltMetadata name"gsdlassocfile"gtcover.jpgimage/jpe
glt/Metadatagt - ltMetadata name"gsdlassocfile"gtp07a.pngimage/png
lt/Metadatagt - lt/Descriptiongt
- ltSectiongt
- Note gsdlsourcefile is the original file from
which the Greenstone archive file was generated,
and gsdlassocfile is File associated with the
document (e.g. an image file)
7Document Metadata
- Greenstone Plugins recognize only a small set of
metadata tags - There are three ways to assign metadata to
documents in a collection 1) index.txt, 2)
metadata.xml and 3) modify an existing Greenstone
plugin - An index.txt file is a space separated file that
assigns a list of metadata to documents in a
collection. It should be placed in the collection
import directory
8Document Metadata (cont.)
- To inform Greenstone about the existence of this
file, include the IndexPlug plugin in your
collect.cfg file or add this plugin to your
plugin list in GLI - An example of the index.txt file is as follows
- key Title Date Cast Director
- "analyze.html" "Analyze That" "2002" "Robert De
Niro, Billy Crystal, Lisa Kudrow" "Harold Ramis - "majestic.html" "Majestic, The" "2001" "Jim
Carrey, Bob Balaban, Jeffrey DeMunn" "Frank
Darabont - Each of the fields in this file are seperated by
a space and enclosed in double quotes. Their
offsets are matched with the listing of fields
shown in the first lien of the file - Note that the first field of this listing must be
the filename of a document - The trailers collection uses this approach to
assign metadata to documents in a collection
9Document Metadata (cont.)
- The second approach uses an XML file to assign
metadata to documents in a collection - To inform Greenstone that you would like to use
the metadata.xml file, include the string plugin
RecPlug -use_metadata_files in your collect.cfg
file or check the use_metadata_files flag after
clicking on the configure plugin button in the
GLI - The benefits of using an XML file over the
previous approach is that the browser can perform
tag checking for you
10Document Metadata (cont.)
- lt?xml version"1.0" ?gt
- ltDirectoryMetadatagt
- ltFileSetgt
- ltFileNamegtMARTYN_DR_02002066.htmllt/FileNamegt
- ltDescriptiongt
- ltMetadata name"PlayerID"gtMARTYN_DR_02002066lt/Met
adatagt - ltMetadata name"PlayerProfile"gtlt/Metadatagt
- ltMetadata name"PlayerName"gtDamien Richard
Martynlt/Metadatagt - ltMetadata name"FullSizeImage"gthttp//www-usa.cri
cket.org//perl/picture.cgi/030730lt/Metadatagt - ltMetadata name"ThumbnailImage"gthttp//www-usa.cr
icket.org//perl/picture.cgi/030730/inline?alt1lt/M
etadatagt - ltMetadata name"CoverImage"gtMARTYN_DR_02002066.jp
glt/Metadatagt - ltMetadata name"Country"gtAustralialt/Metadatagt
- ltMetadata name"BattingStyle"gtRight Hand
Batlt/Metadatagt - ltMetadata name"BowlingStyle"gtRight Arm
Mediumlt/Metadatagt - lt/Descriptiongt
- lt/FileSetgt
- ltFileSetgt
- ltFileNamegtPOTHECARY_JE_03001137.htmllt/FileName
gt - ltDescriptiongt
11Document Metadata (cont.)
- Heres the answer
- ltDirectoryMetadatagt
- ltFileSetgt
- ltFileNamegttext lt/FileNamegt
- ltDescriptiongt
- ltMetadata namename1"gtsome textlt/Metadatagt
- ltMetadata name" name 2"gt some text lt/Metadatagt
- other Metadata tags
- lt/Descriptiongt
- lt/FileSetgt
- other FileSet tags
- ltDirectoryMetadatagt
- Note that XML is case sensative
- The cricket collection uses the metadata.xml to
assign metadata to the documents
12Document Metadata (cont.)
- We can also customize a plugin to extract
metadata from a document - We will look at modifying the TextPlug to extract
Ratings, Genre and Subject from a few documents
in the trailers collection
13Structuring Documents into Sections
- Sometimes source documents have to be structured
into sections and subsections - This can be done easily by incorporating the
following HTML tags into your documents - lt!--
- ltSectiongt
- ltDescriptiongt
- ltMetadata name"Title"gt Realizing human rights
for poor - people Strategies for achieving the
international - development targets lt/Metadatagt
- lt/Descriptiongt
- --gt
- (text of section goes here)
- lt!--
- lt/Sectiongt
- --gt
- You can also embed subsections within another
section by embedding another level of ltSectiongt
before the lt/Sectiongt tag - Look at one of the HTML files in the demo
collection for an example
14Browsing Indexes
15Types of Browsing Indexes
- SectionList
- AZList
- AZSectionList
- DateList
- Hierarchy
16Creating Browsing Indexes
- Certain classifiers generate browsing structures
that are hierarchical - They are useful for subject classifications and
organization hierarchies - Therefore specific hierarchies will have to be
provided using the flag hfile ltfilenamegt when
the classifier is defined in the collect.cfg file - For example
- classify Hierarchy hfile sub.txt metadata
Subject sort Title
17Creating Browsing Indexes (cont.)
- Note that sub.txt has to reside in the /etc
directory - Certain classifiers dont require explicit
hierarchies to be defined. For instance, the
AZList, DateList and List classifiers that
generates a selection list of the corresponding
metadata - classify List metadata Howto
- classify AZList metadata Title
18Creating Browsing Indexes (cont.)
- Explicit hierarchies have to be define according
to the following format - ltidentifiergt ltposition in hierarchygt ltnamegt
- For example
- 1 1 General reference
- 1.2 1.2 Something else
- 2 2 .
- What this means is that the metadata type
associated to the current classifier will be
assigned to the first classification if it has
the value 1 within the document - Look at the demo collections for examples
19Creating Browsing Indexes (cont.)
- Documents are treated internally as tree nodes by
Greenstone - There are three types of nodes Vlist, Hist and
Datelist - For example, an AZList consists of a collection
of Vlist nodes that represent documents - Arguments accepted by various classifiers are in
page 48 of the developers guide
20Formatting Browsing Indexes
- Each classifier has an implicit name from its
position in the collect.cfg file. For example,
the third classifier specified in the file is
called CL3 - Tags in the formatting strings
- Text document text
- link /link link to the document itself
- icon icon representing the resource
- metadata-name value of the metadata
associated to this document
21Formatting Browsing Indexes (cont.)
- For example
- format CL4Vlist ltbrgtlinkHowto/link
- Conditional statements are supported in the
formatting string. They are enclosed by the
and characters in these formats - Ifmetadata, then clause, else clause
- Oraction, another-action, another-action,
etc - The If statement is the same as most program
languages - The Or statement evaluates the items in the
list and stops when one of them is non-null. Its
value is sent to the output and evaluation is
terminated.
22Formatting Browsing Indexes (cont.)
- For example
- format VList "lttd valigntopgtlinkltimg
src_httpprefix_/collect/cricket/images/PlayerID
.jpg border0gtlt/linkgtlt/tdgtlttdgtlinkTitle/link
lt/tdgtlttdgtIf HasAudio,lta hrefaudioURLgtltimg
src_httpprefix_/collect/cricket/images/wav.jpg
border0gtlt/agtlt/tdgt"
23Customizing the look and feel of Greenstone
24Customizing the look and feel of Greenstone
- Involved files are in gsdl/macros directory
- Base.dm global macros, such as custom buttons
- English.dm text for the corresponding language
- Home.dm The main GSDL page
- Gsdl.dm About Greenstone page
- Style.dm Page layout
- Query.dm Query form layout
25Customizing the look and feel of Greenstone
(cont.)
- Background image (chalk.gif)
- Base.dm
- _httpiconchalk_ _httpimg_/chalk.gif
- _widthchalk_ 2000
- _heightchalk_ 10
- Custom Button
- Base.dm
- _Genrewidth_ _widthtGenrex_
- _imageGenre_ _gsimage_(_httpbrowseGenre_,_httpico
ntGenreof_,_httpicontGenreon_,Genre,_textimageGenr
e_) - _icontabGenregreen_ ltimg
- src"_httpicontGenregr_" width_widthtGenrex_
border0gt - _icontabGenregreen_v1 _texticontabGenregreen_
26Customizing the look and feel of Greenstone
(cont.)
- Document.dm
- _textGenrepage_ _texticonhGenre_
- _iconGenrepage_ ltimg src"_httpiconhGenre_"
width"_widthhGenre_" - height"_heighthGenre_"gt
- _iconGenrepage_ v1 lth2gt_texticonhGenre_lt/h2gt
27Customizing the look and feel of Greenstone
(cont.)
- English.dm
- _textimageGenre_ Browse by Genre
- _texticontabGenregreen_Genre
- _httpicontGenregr__httpimg_/tGenregr.gif
- _httpicontGenreon__httpimg_/tGenreon.gif
- _httpicontGenreof__httpimg_/tGenreof.gif
- _widthtGenrex_ 114
- _texticonhGenre_ Genre
- _httpiconhGenre_ _httpimg_/h\_Genre.gif
- _widthhGenre_ 250
- _heighthGenre_ 57
- _textGenreshort_ access publications by Genre
- _textGenrelong_ ltpgtYou can ltigtaccess my
documents by