Title: From Data to Discovery
1From Data to Discovery
- Building Automated Cataloguing Tools with Perl
Huw Jones Cambridge University Library
2(No Transcript)
3Cambridge
Small city, big University lots of libraries!
4(No Transcript)
5(No Transcript)
6(No Transcript)
7(No Transcript)
8Lots of libraries lots of books
9Bibliographic records
- University Library 3.85 M
- Other libraries 2.5 M
- 8 databases
10Data problems
11Quality - fullness
of 2.5 M records in our databases
1 M are short records
12Quality coding
13Duplication
14Effects
- Difficulty in resource discovery
- Patchy retrieval
- Lack of authority control
- Difficulty with standard deduplication
- Burden on staff time
- Ties us to multiple database model
15Aims
- Better records
- Fewer records
16Existing Solutions?
- Manual recataloguing
- Commercial solutions
- Universal catalogue
- Discovery layer
- Either dont solve the core problem, or expensive
and/or time consuming
17Our solution
- Automated Cataloguing Tools!
- Short record enrichment
- Automated MARC correction
- Deduplication
- Order important full, well coded records are
easier to deduplicate
18General principles
- Retrieve some records from a Voyager database
- Examine and/or manipulate them
- If necessary, make changes in the database
- N.B. Watch indexes and table space!
19General tools
- Perl holds everything together
- Perl DBI connects to databases
- SQL retrieves records from database
- MARCRecord modules (from CPAN) to
examine/manipulate records - Pbulkimport/Batchcat to make changes to the
database
20Batchcat vs Pbulkimport
- Batchcat installed on PC with Voyager
- More versatile
- Cant be used on server
- Pbulkimport limited functionality
- Needs Bibliographic Detection Profile and Bulk
Import Rule (SYSADMIN) - Can be used on server
21Books
- Learning Perl / Randal L. Schwartz and Tom
Phoenix. 3rd ed. (Sebastopol, Calif. OReilly,
2001). ISBN 0596001320 - Programming the Perl DBI / Alligator Descartes
and Tim Bunce. (Sebastopol, Calif. OReilly,
2000). ISBN 1565926994
22Enriching short records
23 24Basic mechanism
- Take short record
- Find a matching full record
- Overlay short record with full record
- Need a source of full records
- In Cambridge - University Library - large
database of full, authority controlled records
25File of SHORT RECORD bib ids
Connects to LOCAL database and checks if a valid
bib id
Connects to EXTERNAL source. Finds best FULL
RECORD match and scores it
Retrieves SHORT RECORD info from local database
Compares match score to overlay threshold. If OK,
retrieves MARC record for FULL RECORD
Corrects FULL MARC record. Removes inappropriate
fields. Inserts fields to be retained from SHORT
RECORD
In local database overlays SHORT RECORD with FULL
RECORD
26Output
27Interface
28Results
- Service has been running for 1 year (much of
which was testing) - 18 libraries subscribed to use service
- 90,000 short records upgraded
29MARC checking and correction
- Bibliographic standard agreed minimum standard
for cataloguing - Every week, libraries receive an automatically
generated file of MARC coding errors for
correction - Based on MARCLint module with many alterations
30Output
31Mechanism
- Connects to database using Perl DBI
- Retrieves MARC record for records created/edited
in last week - Runs them through MARC check
- Prints errors to file
- Emails file to library
- Over 100,000 errors pointed out so far!
32MARC Correction
- LDR 00472nam\\2200157\a\4500
- 001 662002
- 005 20071205064734.0
- 008 071129s1985\\\\nyua\\\\\\\\\\001\0\eng\d
- 020 \\a9780961751111
- 100 1\aBroecker, W.S.,d1931-
- 245 10aHow to build a habitable planet cBy
Wallace S. Broecker. - 260 \\aNew York bEldigio Press,cc1985
- 300 \\a291p bill c23cm
- 504 \\aIncludes index.
- 650 \0aAstronomy.
- 650 \0aAstrophysics.
33- LDR 00453nam 2200157 a 4500
- 001 662002
- 005 20071205064734.0
- 008 071129s1985\\\\nyua\\\\\\\\\\001\0\eng\d
- 020 \\a9780961751111
- 100 1\aBroecker, W. S.,d1931-
- 245 10aHow to build a habitable planet /cby
Wallace S. Broecker. - 260 \\aNew York bEldigio Press,cc1985.
- 300 \\a291 p. bill. c23 cm.
- 504 \\aIncludes index.
- 650 \0aAstronomy.
- 650 \0aAstrophysics.
34MARC Correction
- Version of module which, where there is no
ambiguity, corrects errors - Built into short record upgrade program
- Also offered as a retrospective service to clean
up legacy records - Possibility of building it into weekly check
35Mechanism
- Connects to database using Perl DBI
- Retrieves full MARC record
- Runs against correction module
- Replaces corrected record in database
36Output
- Bib id 662002
- How to build a habitable planet By Wallace S.
Broecker. - 100 UPDATE Spaces inserted between initials in
subfield _a - 245 UPDATE By uncapitalised at start of
subfield c - 245 UPDATE Space forward slash inserted before
subfield _c - 260 UPDATE Full stop inserted at end of field
- 260 UPDATE Space colon inserted before subfield
_b - 300 UPDATE Full stop inserted after the p in
pagination - 300 UPDATE Full stop inserted at end of field
- 300 UPDATE Illustration abbreviation has been
corrected - 300 UPDATE Space colon inserted before subfield
_b - 300 UPDATE Space inserted between digits and cm
- 300 UPDATE Space inserted between digits and p
in pagination - 300 UPDATE Space semi-colon inserted before
subfield c
37Results
- In testing 70,000 records processed
- Corrected over 200,000 MARC coding errors
- May run ALL our existing records through at some
stage
38Deduplication in progress!
- Three stages
- Identification of groups of duplicates
- Identification/construction of best record
- Deletion of other records relinking of
holdings/items/Purchase Orders to best record
39Identification of duplicates
- Connect to a database with Perl DBI
- Use SQL to retrieve records
- For each record, retrieve all available data from
tables - Use matching algorithm to identify groups of
duplicates
40- And youll end up with something like this
41Identification of best record
- For each of group of duplicates, MARC records
retrieved - Passed to scoring algorithm
- Record with highest score forms basis of best
record - Retains set fields (i.e. subject headings) from
other records - Corrects any MARC coding errors
42(No Transcript)
43(No Transcript)
44(No Transcript)
45(No Transcript)
46But
- No relinking functionality, even in BatchCat
- No viable workaround for libraries using
Acquisitions/without losing circulation history
47In conclusion
- Tools for librarians, not replacements!
- Do the stuff programs do well, allowing humans to
concentrate on what humans do well - Wont do all the work, just makes a solution to
major data problems feasible
48Questions?