Title: Introducing Voyager with Unicode
1Voyager with Unicode A Catalogers Session
Connie Braun Training Consultant
2Agenda
Introduction Your Work Environment
Conversion New Features Learning More QA
3Release Update
- General release occurred October 6, 2004!
- 4 production partners
- 1 Windows Server, 3 Solaris
- 8 test server partners
- 4 Task Force members (large non-roman
collections) - 1 large consortia with Universal Borrowing
Universal Catalog - 2 European customers
- As of 01/20/05, 71 customers have upgraded and
are functioning in a production environment with
Voyager with Unicode. Approximately 50 upgrades
are scheduled between now and May 2005.
4Why Unicode in Voyager?
- Brings Voyager up to current IT standards
- Finds and displays records in the native
language - Create and edit any MARC record using UTF-8
- Import and export of records with any supported
character set - Operators may select a Unicode-compliant font of
their choice - Display Unicode characters in OPAC without
proprietary software
5Implementing Voyager with Unicode
For our customers, its business as usual, but
with some interesting changes and improvements,
especially in Cataloging. Helping everyone to
implement a Unicode-compliant system is
Endeavors aim. The Unicode standard is an
important step towards realizing that
goal. Implementing the Unicode standard is an
extension of Endeavors original mission access
to information regardless of location or format.
6Following Standards
- Follows Standards (not proprietary)
- See http//www.unicode.org for much more detail
on these standards. - See http//lcweb.loc.gov/marc/specifications/specc
harucs.html for details on LCs format of MARC
records that use Unicode. Voyager follows this
specification. - Specifics on the Code Tables may be viewed at
http//www.loc.gov/marc/specifications/specchartab
les.html - The Voyager implementation of the Unicode
standard gives libraries and their users greater
flexibility when accessing collection materials
that contain both Roman and non-Roman text.
7Multilingual Input and Display
- By introducing improved multilingual input and
display capabilities in Voyager, characters now
display correctly according to the Unicode and
MARC standards. - Greater script coverage for cataloging items in
your collections, published in languages around
the world. - How many? The total number of possible characters
for UTF-8 is 2,147,483,648!
8Preview Server
- Anyone interested in trying out Voyager with
Unicode before your upgrade? You can! - http//support.endinfosys.com/cust/voy/upgrade/uni
code/testwv_pre.html provides all the details
necessary to get you started - Preview Server uses the Voyager training database
that has been augmented with numerous records in
both Roman and non-Roman languages - Try keyword searches
- non roman script japanese
- non roman script arabic
- roman script french
- roman script italian
9Agenda
- Introduction
- Your Work Environment
- Workstation Requirements
- Setting Up For Languages Other Than English
- Tag Tables
- Session Defaults and Preferences
- Conversion
- New Features
- QA
10Workstation Requirements
- In order to enjoy the full range of benefits, PCs
must have up-to-date operating systems and
productivity software. - This means that staff PCs will need
- Windows 2000 or XP operating system
- Unicode standard compliant Internet browser
- IE 6
- Netscape 6
- Unicode-compliant font Lucida and Arial Unicode
MS
11MS Windows
- Voyager is more integrated with Windows in terms
of - Standard Windows 2000/XP Unicode support
- Standard Unicode fonts
- Standard input using Input Method Editors (IMEs)
- Standard browser support
12Setting Up for Languages Other Than English
- Workstations need to be specifically configured
to work with languages other than English - Likely will require technical IT assistance to
install needed languages on staff PCs - Best to install all languages so that cataloger
may easily include new ones as necessary
13Adding Languages to PCs
- Regional and language options are specific to
each PC - Among options available via Start Settings
Control Panel - Details button on Languages tab lets operator
view or change languages and methods to enter
text - Can include supplemental language support, too
14Choosing Languages
- Languages added to PCs will match languages for
items found in your collections - Add and remove according to your needs as few or
many as necessary - May also set preferences for language bar and key
settings
15Tag Tables
- MARC Tag Tables have been completely revised and
rewritten for Voyager with Unicode
16Tag Tables
- Ability to modify tag table configuration remains
the same as in earlier releases - But, may not specify anything for Leader position
9 since that byte is now hard-coded to identify
records that have been converted to UTF-8 - May want to consider whether or not library will
need or want to revise Tag Tables for local use - See Appendix A of Cataloging Users Guide for
full details on revising, maintaining and
updating the Tag Tables
17Record Validation
- MARC validation
- MARC21 character set validation
- Authority control validation
- Decomposition of accented characters for MARC21
18Session Defaults and PreferencesRecord
Validation
- Bypass MARC21 Character set validation
- Uses MARC21 Repertoire.cfg to control validation
of the MARC21 character set - Helps to enforce MARC21 standard
- Bypass Decomposition of accented characters for
MARC21 - Allows records to be saved to the database
without decomposing the characters - IMPORTANT If you select this option, MARC21
rules are ignored. We strongly recommend that
this check box be un-checked, in order to comply
with the MARC21 standard.
19Session Defaults and PreferencesMapping Tab
- Expected Character Set of Imported Records now
has six options
20Session Defaults and Preferences Colors/Fonts Tab
21Agenda
- Introduction
- Your Work Environment
- Conversion
- Data Conversion
- Conversion Error Logging
- Conversion Details
- Identifying Non-Unicode Data
- The Rest of Voyager
- New Features
- Learning More
- QA
22Data Conversion
- Conversion process during upgrade treats data
differently than when importing records through
Cataloging client or via BulkImport - MARC records are converted from VRLIN (Voyager
legacy encoding) to MARC21 compliant UTF-8
encoding - Leader position 9 becomes an a
- Conversion Log Created
- UTF-8 allows for variable length characters. The
majority of characters in the database occupy the
same amount of space as before conversion. - Note All indexes and database columns with MARC
data are regenerated after conversion.
23Conversion Details
- IMPORTANT! NO RECORDS ARE LOST
- Each field in the record handled individually.
- As each field is processed, it may change length,
requiring adjustments to the leader and directory
of the record. - Records are saved to the database with a leader
position 9 a. - Both record-level and field-level checking are
performed. In rare cases an entire record might
fail conversion it is more likely that an
individual field fails to be converted. - Records may not convert if they contain text that
cannot be mapped into Unicode according to the
standard MARC-8 to Unicode mappings. - Records that do not convert are stored in the
database as is, without being converted to
Unicode.
24Conversion Error Logging
- Libraries need to know the details about the
- results of the conversion process.
- Full error checking and logging is included as
part of the upgrade - Technical Users Guide, Chapter 4
- Cataloging Users Guide, Appendix C
- Library designates should review this file to
plan for correcting any records that have errors
25Sample from Conversion Log File
26Conversion Log Details 1
- 1 2 3 4 5 6 7
- 11 secs read982 changed791 8800 okay982
errors0 written982 - 21 secs read1931 changed1558 8800 okay1931
errors0 written1931 - 29 secs read2848 changed2087 8800 okay2848
errors0 written2848 - 36 secs read3699 changed2533 8800 okay3699
errors0 written3699 - 43 secs read4607 changed3076 8800 okay4607
errors0 written4607 - 51 secs read5519 changed3610 8800 okay5519
errors0 written5519
Legend 1 number of seconds used by job so far 2 readnumber of records processed 3 changednumber of records changed 4 880how many records contain 880s 5 okay records processed successfully 6 errors records not processed due to errors 7 written records written to the database
27Conversion Log Details 2
- 1 2 3 4 5 6 7 8
- bib 6213 17(700) c-gt8 loose char page0 at
20 '091e .. - 9
- bib 35322 14(856) c-gt8 undefined char page0
at 61 'fc7220486973746f .r Histo - 10
- bib 35516 23(856) c-gt8 no char to combine to
page0 at 82 '1e .
1 record type and id 2 index within record of field that generated error 3 tag that generated error 4 c-gt8 indicates conversion to UTF-8 encoding 5 description of error 6 pagesubset to which source character belongs 7 at position of source character that caused error 8 hex dump of source character 9 description of error 10 description of error
28Conversion Log Details 3
loose char a warning message indicating that a character not strictly part of Voyager encoding has been converted (e.g. unexpected carriage return)
no char to combine to a warning message indicating that a combining character appeared but it lacks a base character with which to combine (e.g. umlaut but no a, o, u base letter)
undefined char an error message indicating that there is a single character that cannot be mapped to UTF-8
29Identifying non-Unicode data
- To identify a non-Unicode record in the
Cataloging client, select a color for Conversion
records in Session Defaults and Preferences gt
Colors-Fonts tab.
30Identifying non-Unicode data
- Any non-converted record displays in the color
selected in Options/Preferences.
31Identifying non-Unicode data
- There are other ways to identify records that
have conversion errors.
Records that cannot be converted to Unicode are
viewable in the Cataloging module with nc (not
converted) displayed in the Title Bar.
Any characters that cannot be matched or
recognized are replaced with a Unicode
substitution character.
32Fonts and Unicode
- A MARC record may contain non-Roman characters
even though you cannot see them. - Records are sure to display correctly if a
Unicode-compliant font has been selected. - Lucida Sans Unicode installed by default with
Windows - Arial Unicode MS
- Good choice for libraries with mixed cataloging
- Included with Microsoft Office and other
Microsoft products
33The Rest of Voyager
- Non-MARC data is not converted
- Acquisitions data
- Circulation data (patron info, etc.)
- Item data
- Reporter
- Not Unicode standard compliant
- Translates data to LATIN1
- Dots appear where you used to see squares
34Agenda
- Introduction
- Your Work Environment
- Conversion
- New Features
- Cataloging
- Diacritics Special Characters, Importing
Records, New Record Views, Search URIs - WebVoyáge
- Browsers, Searching, Displaying
- Interacting with Other Systems
- Learning More
- QA
35Diacritic and Special Character Entry
- Cataloging practices then and now
- Pre-Unicode input in Cataloging accent
character (diacritic) precedes the base
character. - Example Espana
- Post-Unicode input in Cataloging accent
character (diacritic) follows the base character. - Example Espana
- Ability to display combined characters is an
improvement over past versions and a way to
insure accurate entry - Example España
36Special Characters.cfg
SpecialCharacters.cfg, located in the
C\Voyager\Catalog folder, defines the content of
the special character entry dialog box. Operators
may define their most frequently used characters
here.
37Special Character Entry
This is what the dialog box in Cataloging looks
like.
The key press column identifies the keyboard
equivalent that may be used instead of turning on
Special Character Mode in Cataloging.
38Finding Little Used Characters
- For situations where a character not part of the
Special Characters list is needed, operator can
use Character Map from MS Windows - Start Programs Accessories System Tools
Character Map - Locate character or perform search
- Select and Copy character, then paste into
position in bib record
39Cataloging Input of Non-Roman Text
Voyager with Unicode allows Cataloging operators
to use all of the standard Microsoft Windows
keyboard and input method editors (IMEs). With
this functionality in place, operators may search
for, display, and edit the contents of all MARC
records using the full range of UTF-8
characters. Entire JACKPHY group is part of the
UTF-8 character set which includes right-to-left
input needed for Arabic, Persian, Hebrew and
Yiddish. Reminder JACKPHY Japanese, Arabic,
Chinese, Korean, Persian, Hebrew, Yiddish
40Linking in a MARC21 Record
Tag I1 I2 Subfield Data
100 1 6 880-01 a An, Zhen.
245 1 0 6 880-02 a Ri yue yun yan / c An Zhen zhu.
250 6 880-03 a Di 1 ban.
260 6 880-04 a Changchun Shi b Changchun chu ban she, c 1997.
300 a 4, 2, 291 p. c 21 cm.
440 0 6 880-05 a Zhongguo li dai wang chao xing shuai qu shi lu
500 a Non-Roman script Chinese
651 0 a China x History y Ming dynasty, 1368-1644.
880 1 6 100-01/1 a ? ?.
880 1 0 6 245-02/1 a ?? ?? / c ? ? ?.
880 6 250-03/1 a ?1?.
880 6 260-04/1 a ??? b ?? ???,c 1997.
880 0 6 440-05/1 a ?? ?? ?? ?? ???
41Using On-Screen Keyboard
- Typically, the path is StartProgramsAccessories
AccessibilityOn-Screen Keyboard
42Importing Records
- Conversion process is separate and distinct from
the process of importing records - Important distinction for operators who import
records through the Cataloging client or via
BulkImport - Expected character set needs to be accurately
identified if records are to be imported
correctly - Some experimentation may be necessary to
determine the correct character set - Lets look at some details to help everyone
understand what is happening
43Record Exchange Scenarios
44Voyager 2001.2 and earlier
- In Voyager 2001.2 and earlier, there were several
options from which to choose regarding the
character set - Latin1
- OCLC
- RLIN legacy
- MARC21 MARC8
- Until now it has been quite simple to choose the
correct option when importing records through the
Cataloging client or processing large numbers of
records through BulkImport.
45After Upgrade to Voyager 2003.1
- From Voyager 2003.1 forward, there are numerous
options from which to choose regarding the
character set - Latin1 (non-Unicode)
- MARC21 MARC8 (non-Unicode)
- MARC21 UTF8
- OCLC (non-Unicode)
- RLIN legacy (non-Unicode)
- Voyager legacy (non-Unicode)
- With Voyager 2003.1 and beyond, it is very
important to determine the character set of
records before importing records through the
Cataloging client or processing large numbers of
records through BulkImport. Some experimentation
may be necessary. - transition to MARC21 UTF8 occurs as Unicode
standard becomes pervasive
46One Year From Now
- In Voyager 2003.1 and beyond, numerous options
for character sets will continue to be needed - Latin1 (non-Unicode)
- MARC21 MARC8 (non-Unicode)
- MARC21 UTF8
- OCLC (non-Unicode)
- RLIN legacy (non-Unicode)
- Voyager legacy (non-Unicode)
- But, the Unicode standard will be much more
pervasive, having been adopted and deployed by
bibliographic utilities, vendors who massage
records, vendors who supply records, and others. - This means that selecting the correct option will
again be simpler, even though knowing the
character sets will continue to be very
important.
47Bulk Import
- Bulk Import of MARC Records
- Fundamentally the same as before
- Leader byte 9 is checked against the incoming
character set identified in the import rule. -
- Blank non-Unicode converted imported
- a Unicode imported
- Neither Blank nor a errors out not imported
- See log.imp.yyyymmdd for details on import
success - Records that cannot be converted are not
imported found in err.imp.yyyymmdd
48Bulk Import and Expected Character Set
- Character set mapping for Bulk Import is
designated in the Bulk Import rule in SysAdmin gt
Cataloging gt Bulk Import Rules.
49MARC Export
- Default export character set is MARC21 UTF-8
- Use the a option to choose different character
set (in the command line) - See page 10-8, in Technical Users Guide for more
detail - LATIN1 records will get a dot exported for
characters outside the LATIN1 character set - If mapping for a composed character is not found,
it decomposes and Voyager attempts to find a
match for each part.
50New ISBN Indexes
- For improved duplicate detection
- New ISBN Index
- 020N 020a Number only
- 020R 020z Number only
- 020 a 1234567890 (Knopf)
- 020 a 1234567890
- ? Check Bibliographic and Authority duplicate
detection profiles in System Administration!
51HTTP Posting
- Much easier access to WebVoyáge display from
clients - Available in Cataloging, Acquisitions
Circulation - Toggle record view from staff client to WebVoyáge
- Record menu in Cataloging contains a Send Record
to option - Send Record To WebVoyáge
- LinkFinderPlus available in Cataloging,
Acquisitions Circulation - Record menu in Cataloging contains a Send Record
to option - Send Record To LinkFinderPlus
- Configured in voyager.ini file MARC POSTing
stanza
52Enabling HTTP Posting
- To enable HTTP posting, a stanza is added to
the voyager.ini file. An example is shown below. - MARC POSTing
- WebVoyage"http//train20031-c1db.comet.endinfosys
.com/cgi-bin/Pbibredirect.cgi" - LinkfinderPlus"http//207.56.64.116/cgi-bin/Phttp
linkresolver.cgi"
53Easier Access to OPAC Display
- Send Record To.in Cataloging
- Send Record To.in Acquisitions
54Search URI
- Staff Client Search URI in Cataloging,
Circulation and Acquisitions - Drive searches to resources on the web
- Add new button to search interface in staff
clients - Click buttona browser is opened search is
executed - This is PC specific (voyager.ini)
- Possible applications
- Link to another OPAC
- Link to one of your vendors
- Link to an online book seller
55Presenting Search URI
Staff client search URI
Available in Cataloging, Circulation, and
Acquisitions
56Adding Search URIs
- clipped from voyager.ini
- SearchURI
- NameGoogle
- URIhttp//www.google.com
- CopyY
- SearchSyntax/search?qltsearchtextgt
- NameBarnesNoble
- URIhttp//search.barnesandnoble.com
- CopyY
- SearchSyntax/booksearch/results.asp?WRDltsearcht
extgt - NameGale Group
- URIhttp//www.galegroup.com
- CopyY
- SearchSyntax/servlet/SearchPageServlet?region9
imprintltsearchtextgt
57WebVoyáge and Unicode
- MARC data supplied to the browser in UTF-8
- IE 6 generally displays Unicode characters
correctly. Some characters do not display
correctly unless a Unicode-compliant font is
selected. - Netscape 6 figures out that it needs to display
Unicode characters without any special settings - Consider new help text in your OPAC to help
patrons understand about language options,
especially if there are records using different
languages in your database - New UTF-8 download/save format
58Searching in WebVoyáge
- Search and display in native languages for staff
and users. - WebVoyáge and Cataloging allow Unicode character
input you can search for and retrieve records in
native languages. - Record display includes non-Latin scripts,
including right-to-left scripts like Arabic and
Hebrew. Voyager takes advantage of the web
browsers native rendering support.
59Records with Other Languages in the OPAC
60Displaying Records in WebVoyáge
61Linking in a MARC21 Record
Tag I1 I2 Subfield Data
100 1 6 880-01 a An, Zhen.
245 1 0 6 880-02 a Ri yue yun yan / c An Zhen zhu.
250 6 880-03 a Di 1 ban.
260 6 880-04 a Changchun Shi b Changchun chu ban she, c 1997.
300 a 4, 2, 291 p. c 21 cm.
440 0 6 880-05 a Zhongguo li dai wang chao xing shuai qu shi lu
500 a Non-Roman script Chinese
651 0 a China x History y Ming dynasty, 1368-1644.
880 1 6 100-01/1 a ? ?.
880 1 0 6 245-02/1 a ?? ?? / c ? ? ?.
880 6 250-03/1 a ?1?.
880 6 260-04/1 a ??? b ?? ???,c 1997.
880 0 6 440-05/1 a ?? ?? ?? ?? ???
62Interacting with Other Systems
- Incoming Z39.50 Connections
- Records in Unicode databases are UTF8 encoded
- z3950svr may send either or both MARC8-encoded or
UTF8-encoded records - Default is set to send MARC8 encoded records
- But, two different z3950svr ports can be
configured to provide records in both formats,
thereby accommodating all sites connecting to
database
63Interacting with Other Systems
- Outgoing Z39.50 Connections
- Retrieves and displays records of any type in
UTF-8 - Converts incoming records based on new Database
Definitions setting in System Administration
called Source Character Set - Latin1 (non Unicode)
- MARC 21 MARC8 (non Unicode)
- MARC21 UTF8
- OCLC (non Unicode)
- RLIN legacy (non Unicode)
- Voyager legacy (non Unicode)
64Agenda
Introduction Your Work Environment
Conversion New Features Learning More Final QA
65If you want to know more about..
Coded Character Sets - EndUser 2004 Session
29 Title Coded Character Sets A Technical
Primer for Librarians Presenters Michael Doran,
Systems Librarian, University of Texas at
Arlington Dan Sweeney, Business Analyst II,
Endeavor Information Systems Great Website
http//rocky.uta.edu/doran/charsets/ Strategie
s and Tools for Cleaning Up Your Data -- EndUser
2004 Session 45 Title Transitioning To Unicode
Strategies for Tidying Your Data Presenters Fran
Budde, Acquisitions Cataloging Specialist,
Pacific Lutheran University Francesca Lane
Rasmus, Director, Technical Services, Pacific
Lutheran University Layne Nordgren, Director of
Instructional Technologies/Library Systems,
Pacific Lutheran University
66If you want to know more about..
- Special Character Input/Issues EndUser
2004Session 65 - Title Why Unicode?
- Presenter Martin Heijdra, Chinese Bibliographer/
Head of Public Services, - East Asian Library, Princeton University
- Preparing for Unicode Conversion Cataloging
Issues EndUser 2004 Session 74 - Title Unicode Conversion at the Library of
Congress - Presenter Ann Della Porta, Assistant
Coordinator, Integrated Systems - Office, Library of Congress
- SupportWeb KnowledgeBase, EndUser archives
- http//support.endinfosys.com/cust/index.html
67If you want to know more about.
- 880 Alternate Graphic Representation (R)
- http//www.loc.gov/marc/bibliographic/ecbdhold.htm
lmrcb880 - OCLC Character Sets
- http//www.oclc.org/support/documentation/worldcat
/records/subscription/5/5.pdf - Original Scripts in RLG Databases
- http//www.rlg.org/origscripts.html
- MARC 21 Concise Bibliographic Control Subfields
- http//www.loc.gov/marc/bibliographic/ecbdcntf.htm
l - MARC 21 Concise Bibliographic Multiscript
Records - http//www.loc.gov/marc/bibliographic/ecbdmulti.ht
ml
68Thank you!