Title: Capturing Untapped Descriptive Data: Creating Value for Librarians and Users
1Capturing Untapped Descriptive Data Creating
Value for Librarians and Users
- Lynn Silipigni Connaway
- OCLC Research
- ASIST 2006 Conference
- November 9, 2006
2WorldCat July 2006
Manifestations (records) 67,282,165
Works 53,472,668
Total holdings 1,071,507,045
Digital Items 1,571,803
Institutions 26,236
Physical Items 1.6 billion Estimated
3Origin of materials represented in WorldCat
Unknown 14
Rest of World 40
US 34
Canada 3
UK 9
4Some aspects of Global WorldCat
Materials w/non-US origins 35.3 million
(52) Top 5 UK 6.1 million Germany 4.0
million France 2.9 million Netherlands 2.2
million Canada 2.1 million
Content Languages 476 43 of WC non-English Top
5 non-English German 4.5 million French 4.2
million Spanish 2.9 million Dutch 2.1
million Chinese 1.6 million
Non-English Metadata Language 9.3 million (20
languages) Top 5 Dutch 4.1 million
Japanese 0.7 million French 1.4 million
Finnish 0.7 million German 1.0 million
5OCLC WorldCatTM Decision-making Resource
- Collection management
- Cooperative collection development
- Comparative collection analysis
- Collection assessment
- Mass digitization
- Off-site storage
- Preservation
- Services
- Virtual reference
- Recommender services
- Systems
- Precision
6OCLC WorldCatTM Data Mining Research Projects
- Audience Level
- Publisher Name Server
- WorldMap
7Audience Level Rationale and Objectives
Holdings represent selection decisions by
librarians implies there are about 1 billion
individual selection decisions in the WorldCat
holdings file
- Selections are made to serve the interests of a
librarys target community - Associate target community (audience level) to
particular library profiles - e.g., ARL, non-ARL
academic, public, K-12 school
?
- Implies we can infer materials audience level
from holdings patterns, which in turn can
support - Collection management
- Readers advisory services
- Reference services
- Information retrieval
8(No Transcript)
9(No Transcript)
10(No Transcript)
11(No Transcript)
12(No Transcript)
13(No Transcript)
14Example Mother Goose
15Publisher Name Server Research Objectives
- Resolve for data mining and quality of WorldCat
- ISBN prefixes to publisher name
- Variant publisher names to a preferred form
- Complement Collection Analysis Service
- Librarians
- Publishers
- Capture and make available various attributes of
individual publishers - Location of publisher
- Language(s) of materials published
- Genre(s)/format(s) of materials published
- Dominant subject domain(s) of the publisher's
output - Parent company and subsidiaries
16Publisher Name Server Methodology
- Programmatically cluster publishers using ISBN
prefixes - Data clustering (The Free Dictionary)
- "The science of extracting useful information
from large data sets or databases" - Classification of similar objects into different
groups - Partitioning of a data set into
subsets (clusters) - Data in each subset (ideally) share some common
trait - Hand parse the entities and resolve ISBN prefixes
17Publisher Name Server Database
- To date gt800 records
- Relational database, preserving hierarchical
relationships - Begins with high-occurrence entities to identify
- Top 10 lists (USA, UK, Canada, Australia,
Germany, France, Japan, Italy) - Top university presses
- Mergers and acquisitions
18Top U.S. Publishing Entities in
WorldCat(22,680,201 total U.S. records)
19Publisher Name Server Database
- Database Fields
- Publisher Name, Preferred Form
- Source of Preferred Form
- Former Names
- Variant Forms
- ISBN Prefixes
- HQ City
- HQ Country
- Other Cities
- URL
- -----
- Languages
- Formats
- DDC Subjects
- LCC Subjects
- Data Sources
- U.S. Library of Congress, National Authority
File, 110 (Corporate Name) field - Books In Print Online (W.W. Bowker)
- The International ISBN Registry (K.G. Saur)
- Publishers Weekly Online
- Hoovers Handbook Online
- Standard and Poors Corporate Descriptions
- The Directory of Corporate Affiliations (DIALOG)
- Company websites
- DATA MINING
20Entity-Parsing in a World of Mergers and
Acquisitions
Pearson PLC
Pearson Canada
Pearson Technology Group
Penguin Books
Copp Clark
Adobe Press
Cisco Press
Allen Lane
Ladybird Books
Riverhead Books
Puffin Books
Putnam Books
Berkeley Publishing Group
Pearson Education, Inc.
Avery
Addison-Wesley Publishing Company
Prentice-Hall, Inc.
Allyn and Bacon
Dominie Press
Benjamin/Cummings Publishing Company
Scott, Foresman and Company
HarperCollins Educational Publishers
Longmans, Green, and Co.
21OCLC WorldMapTM Objectives
- Geographically represent library data from
UNESCO, ARL, and NCES - Number of libraries
- Amount of library expenditures
- Number of volumes and titles
- Number of librarians
- Number of users
22OCLC WorldMapTM Objectives
- Research prototype
- Test geographical representation of WorldCat
- Titles and holdings by country of publication
- Support data mining research area
- Visually display mined data to ease review and
analysis - Internal use
- Sales and marketing
- External use
- Library collection assessment and comparison
- Complement the AAU/ARL Global Resources Network
project - Project of the Council on Library and Information
Resources (CLIR)
23OCLC WorldMapTM Technology
- First implemented SVG
- Open standard maintained by W3C
- Simple XML file
- Young technology
- Browser support limited
- Requires plug-in
- Converted to Flash
- Browser compatibility
- Plug-in compatibility (if a plug-in was
installed!) - For a detailed comparison of SVG and Flash, see
http//www.carto.net/papers/svg/comparison_flash_s
vg/
24OCLC WorldMapTM
25(No Transcript)
26(No Transcript)
27(No Transcript)
28(No Transcript)
29(No Transcript)
30(No Transcript)
31(No Transcript)
32(No Transcript)
33(No Transcript)
34(No Transcript)
35(No Transcript)
36(No Transcript)
37(No Transcript)
38Potential Future Projects
- Audience Level
- Integrate into WorldCat.org and OPACS to limit
searches and retrieved sources - Publisher Name Server
- Integrate into OCLC Collection Analysis Service
for publisher business intelligence - WorldMap
- Subject information aboutness
- Language of item
- Content language
- Metadata language
- Holdings by country of library
39Presentation will be available at
http//www.oclc.org/research/presentations/default
.htmPrototypes available at http//www.oclc.org/
research/researchworks/default.htmProject Web
Sitehttp//www.oclc.org/research/projects/defaul
t.htm
40Questions and Discussion
- Contact Information
- connawal_at_oclc.org