Capturing Untapped Descriptive Data: Creating Value for Librarians and Users

About This Presentation
Title:

Capturing Untapped Descriptive Data: Creating Value for Librarians and Users

Description:

... Free Dictionary) ... Books In Print Online (W.W. Bowker) The International ISBN ... Publishers' Weekly Online. Hoover's Handbook Online. Standard and Poor's ... –

Number of Views:34
Avg rating:3.0/5.0
Slides: 41
Provided by: OCLCRe6
Learn more at: https://www.oclc.org
Category:

less

Transcript and Presenter's Notes

Title: Capturing Untapped Descriptive Data: Creating Value for Librarians and Users


1
Capturing Untapped Descriptive Data Creating
Value for Librarians and Users
  • Lynn Silipigni Connaway
  • OCLC Research
  • ASIST 2006 Conference
  • November 9, 2006

2
WorldCat July 2006
Manifestations (records) 67,282,165
Works 53,472,668
Total holdings 1,071,507,045
Digital Items 1,571,803
Institutions 26,236
Physical Items 1.6 billion Estimated
3
Origin of materials represented in WorldCat
Unknown 14
Rest of World 40
US 34
Canada 3
UK 9
4
Some aspects of Global WorldCat
Materials w/non-US origins 35.3 million
(52) Top 5 UK 6.1 million Germany 4.0
million France 2.9 million Netherlands 2.2
million Canada 2.1 million
Content Languages 476 43 of WC non-English Top
5 non-English German 4.5 million French 4.2
million Spanish 2.9 million Dutch 2.1
million Chinese 1.6 million
Non-English Metadata Language 9.3 million (20
languages) Top 5 Dutch 4.1 million
Japanese 0.7 million French 1.4 million
Finnish 0.7 million German 1.0 million
5
OCLC WorldCatTM Decision-making Resource
  • Collection management
  • Cooperative collection development
  • Comparative collection analysis
  • Collection assessment
  • Mass digitization
  • Off-site storage
  • Preservation
  • Services
  • Virtual reference
  • Recommender services
  • Systems
  • Precision

6
OCLC WorldCatTM Data Mining Research Projects
  • Audience Level
  • Publisher Name Server
  • WorldMap

7
Audience Level Rationale and Objectives
Holdings represent selection decisions by
librarians implies there are about 1 billion
individual selection decisions in the WorldCat
holdings file
  • Selections are made to serve the interests of a
    librarys target community
  • Associate target community (audience level) to
    particular library profiles - e.g., ARL, non-ARL
    academic, public, K-12 school

?
  • Implies we can infer materials audience level
    from holdings patterns, which in turn can
    support
  • Collection management
  • Readers advisory services
  • Reference services
  • Information retrieval

8
(No Transcript)
9
(No Transcript)
10
(No Transcript)
11
(No Transcript)
12
(No Transcript)
13
(No Transcript)
14
Example Mother Goose
15
Publisher Name Server Research Objectives
  • Resolve for data mining and quality of WorldCat
  • ISBN prefixes to publisher name
  • Variant publisher names to a preferred form
  • Complement Collection Analysis Service
  • Librarians
  • Publishers
  • Capture and make available various attributes of
    individual publishers
  • Location of publisher
  • Language(s) of materials published
  • Genre(s)/format(s) of materials published
  • Dominant subject domain(s) of the publisher's
    output
  • Parent company and subsidiaries

16
Publisher Name Server Methodology
  • Programmatically cluster publishers using ISBN
    prefixes
  • Data clustering (The Free Dictionary)
  • "The science of extracting useful information
    from large data sets or databases"
  • Classification of similar objects into different
    groups
  • Partitioning of a data set into
    subsets (clusters)
  • Data in each subset (ideally) share some common
    trait
  • Hand parse the entities and resolve ISBN prefixes

17
Publisher Name Server Database
  • To date gt800 records
  • Relational database, preserving hierarchical
    relationships
  • Begins with high-occurrence entities to identify
  • Top 10 lists (USA, UK, Canada, Australia,
    Germany, France, Japan, Italy)
  • Top university presses
  • Mergers and acquisitions

18
Top U.S. Publishing Entities in
WorldCat(22,680,201 total U.S. records)
19
Publisher Name Server Database
  • Database Fields
  • Publisher Name, Preferred Form
  • Source of Preferred Form
  • Former Names
  • Variant Forms
  • ISBN Prefixes
  • HQ City
  • HQ Country
  • Other Cities
  • URL
  • -----
  • Languages
  • Formats
  • DDC Subjects
  • LCC Subjects
  • Data Sources
  • U.S. Library of Congress, National Authority
    File, 110 (Corporate Name) field
  • Books In Print Online (W.W. Bowker)
  • The International ISBN Registry (K.G. Saur)
  • Publishers Weekly Online
  • Hoovers Handbook Online
  • Standard and Poors Corporate Descriptions
  • The Directory of Corporate Affiliations (DIALOG)
  • Company websites
  • DATA MINING

20
Entity-Parsing in a World of Mergers and
Acquisitions
Pearson PLC
Pearson Canada
Pearson Technology Group
Penguin Books
Copp Clark
Adobe Press
Cisco Press
Allen Lane
Ladybird Books
Riverhead Books
Puffin Books
Putnam Books
Berkeley Publishing Group
Pearson Education, Inc.
Avery
Addison-Wesley Publishing Company
Prentice-Hall, Inc.
Allyn and Bacon
Dominie Press
Benjamin/Cummings Publishing Company
Scott, Foresman and Company
HarperCollins Educational Publishers
Longmans, Green, and Co.
21
OCLC WorldMapTM Objectives
  • Geographically represent library data from
    UNESCO, ARL, and NCES
  • Number of libraries
  • Amount of library expenditures
  • Number of volumes and titles
  • Number of librarians
  • Number of users

22
OCLC WorldMapTM Objectives
  • Research prototype
  • Test geographical representation of WorldCat
  • Titles and holdings by country of publication
  • Support data mining research area
  • Visually display mined data to ease review and
    analysis
  • Internal use
  • Sales and marketing
  • External use
  • Library collection assessment and comparison
  • Complement the AAU/ARL Global Resources Network
    project
  • Project of the Council on Library and Information
    Resources (CLIR)

23
OCLC WorldMapTM Technology
  • First implemented SVG
  • Open standard maintained by W3C
  • Simple XML file
  • Young technology
  • Browser support limited
  • Requires plug-in
  • Converted to Flash
  • Browser compatibility
  • Plug-in compatibility (if a plug-in was
    installed!)
  • For a detailed comparison of SVG and Flash, see
    http//www.carto.net/papers/svg/comparison_flash_s
    vg/

24
OCLC WorldMapTM
25
(No Transcript)
26
(No Transcript)
27
(No Transcript)
28
(No Transcript)
29
(No Transcript)
30
(No Transcript)
31
(No Transcript)
32
(No Transcript)
33
(No Transcript)
34
(No Transcript)
35
(No Transcript)
36
(No Transcript)
37
(No Transcript)
38
Potential Future Projects
  • Audience Level
  • Integrate into WorldCat.org and OPACS to limit
    searches and retrieved sources
  • Publisher Name Server
  • Integrate into OCLC Collection Analysis Service
    for publisher business intelligence
  • WorldMap
  • Subject information aboutness
  • Language of item
  • Content language
  • Metadata language
  • Holdings by country of library

39
Presentation will be available at
http//www.oclc.org/research/presentations/default
.htmPrototypes available at http//www.oclc.org/
research/researchworks/default.htmProject Web
Sitehttp//www.oclc.org/research/projects/defaul
t.htm
40
Questions and Discussion
  • Contact Information
  • connawal_at_oclc.org
Write a Comment
User Comments (0)
About PowerShow.com