Lynn Silipigni Connaway, Ph'D' - PowerPoint PPT Presentation

1 / 33
About This Presentation
Title:

Lynn Silipigni Connaway, Ph'D'

Description:

Books In Print Online (W.W. Bowker) The International ISBN Registry (K.G. Saur) ... Publishers' Weekly Online. Hoover's Handbook Online. Standard and Poor's ... – PowerPoint PPT presentation

Number of Views:38
Avg rating:3.0/5.0
Slides: 34
Provided by: ocl778
Category:

less

Transcript and Presenter's Notes

Title: Lynn Silipigni Connaway, Ph'D'


1
Data Mining, Advanced Collection Analysis, and
Publisher Profiles An Update on the OCLC
Publisher Name Authority File
  • Lynn Silipigni Connaway, Ph.D.
  • Senior Research Scientist
  • OCLC Research
  • Timothy J. Dickey, Ph.D.
  • Post-Doctoral Researcher
  • OCLC Research

2
Overall Research Goals
  • To Build a Database that Will
  • Identify
  • Authoritative strings for publisher names
  • Common variants for names and locations
  • Hierarchical references indicating relationships
    and nesting of subsidiaries
  • Definitions of publishing entities

3
Overall Research Goals
  • To Build a Database that Will
  • Produce
  • Profiles, including data-mined information
    regarding formats, languages, subjects, etc. for
    publishers
  • Conform
  • to international authority and standards
    practice, and
  • inter-operate with other OCLC products

4
Issues Challenges
  • Database Quality
  • Historical Practices
  • the shortest form in which it can be
    understood. AACR2 2004
  • Different versions of cataloging rules
  • Abbreviations
  • Errors and misspellings
  • Local Practices

5
Method Data Mining in an Aggregate Collection
  • Data Mining and Analysis of WorldCat
  • affords high-level perspective on historical
    patterns, suggests future trends, and supplies
    useful intelligence with which to inform decision
    making.
  • Lavoie, B.F., Connaway, L. S., ONeill, E. T.
    (2007). Mapping WorldCats digital landscape.
    Library Resources Technical Services, 51,
    106-115 at 107.

6
WorldCat July 2008
Manifestations (records) 108,828,533
Works 84,096,107
Total holdings 1,292,763,300
Digital Items 3,182,550
Institutions 69,000
Physical Items 1.2 billion
7
Global Origins of WorldCat Materials
Germany 10
Rest of World 27
Unknown 17
France 4
Canada 3
UK 8
US 28
8
Global Origins of WorldCat Materials
Materials w/non-US origins 57.9 million
(55) Top 5 Germany 10.0 million UK 8.8
million France 4.2 million Netherlands 2.9
million Canada 2.9 million
Content Languages 478 49 of WC non-English Top
5 non-English German 12 million French 6.1
million Spanish 3.5 million Dutch 2.6
million Japanese 2.4 million
Non-English Metadata Language 28 million (66
languages) Top 5 German 11 million French
1.8 million Dutch 5.0 million
Finnish 0.7 million Swedish 1.9 million
9
OCLC Publisher Name Server
10
Publisher Name Server Objectives
  • Resolve for data mining and quality of WorldCat
  • ISBN prefixes to publisher name
  • Variant publisher names to a preferred form
  • Complement Collection Analysis Service
  • Librarians Publishers

11
Publisher Name Server Objectives
  • Capture and profile attributes of individual
    publishers
  • Location(s)
  • Language(s) of materials published
  • Genre(s)/format(s)
  • Dominant subject domain(s)
  • Parent company and subsidiaries

12
Publisher Name Server Methodology
  • Programmatically cluster publishers records
    using ISBN prefixes
  • Data clustering
  • Classification of similar objects into different
    groups
  • Partitioning of a data set into
    subsets (clusters)
  • Hand parse the entities and resolve ISBN prefixes

13
Publisher Name Server Database
  • 1750 publishing entities
  • Relational database, preserving hierarchical
    relationships
  • Begins with high-occurrence entities
  • Top 10 lists
  • Top 10 university presses
  • Mergers and acquisitions, last 8 years

14
Example Top U.S. Publishing Entities by ISBN
15
Publisher Name Server Data Captured
  • Data
  • Publisher Name, Preferred Form
  • Source of Preferred Form
  • Former Names
  • Variant Forms
  • ISBN Prefixes
  • HQ City
  • HQ Country
  • Other Cities
  • URL
  • -----
  • Languages
  • Formats
  • Conspectus Subjects
  • Sources
  • U.S. Library of Congress, National Authority
    File, 110 (Corporate Name) field
  • Books In Print Online (W.W. Bowker)
  • The International ISBN Registry (K.G. Saur)
  • Publishers Weekly Online
  • Hoovers Handbook Online
  • Standard and Poors Corporate Descriptions
  • The Directory of Corporate Affiliations (DIALOG)
  • Company websites
  • DATA MINING

16
(No Transcript)
17
Publisher Name Server Current Scope
  • More than 56,000 separate strings mapped to 1750
    entities
  • 8.5 million OCLC records
  • 22 of these are Library of Congress records
  • 490 million holdings
  • Hierarchical relationships maintained

18
Entity-Parsing in a World of Mergers and
Acquisitions
Pearson PLC
Pearson Canada
Pearson Technology Group
Penguin Books
Copp Clark
Adobe Press
Cisco Press
Allen Lane
Ladybird Books
Riverhead Books
Puffin Books
Putnam Books
Berkeley Publishing Group
Pearson Education, Inc.
Avery
Addison-Wesley Publishing Company
Prentice-Hall, Inc.
Allyn and Bacon
Dominie Press
Benjamin/Cummings Publishing Company
Scott, Foresman and Company
HarperCollins Educational Publishers
Longmans, Green, and Co.
19
Publisher Profiles within WorldCat
  • Oxford University Press
  • 119,237 records with ISBNs mapped to 210,095
    records (0.19 of WorldCat)
  • Pearson PLC
  • Includes 14 subsidiaries and acquisitions
  • Aggregate 291,433 records (0.27 of WorldCat)
  • Springer (Firm)
  • 197,263 records (0.18 of WorldCat)
  • Reed Elsevier PLC
  • Includes dozens of subsidiaries
  • Aggregate 370,029 records (0.34 of WorldCat)

20
WorldCat Publisher Profiles Top Languages
  • Pearson PLC
  • English 95.27
  • Spanish 1.43
  • German 1.33
  • French 0.60
  • Dutch 0.55
  • Latin 0.26
  • Malay 0.06
  • Ancient Greek 0.05
  • Portuguese 0.05
  • Italian 0.04
  • Oxford Univ. Press
  • English 96.74
  • Latin 0.51
  • German 0.39
  • Chinese 0.39
  • French 0.37
  • Spanish 0.28
  • Afrikaans 0.14
  • Middle English 0.13
  • Malay 0.09
  • Swahili 0.09

21
WorldCat Publisher Profiles Top Languages
  • Reed Elsevier PLC
  • English 83.64
  • French 9.34
  • Dutch 2.32
  • Spanish 0.95
  • Italian 0.60
  • Latin 0.27
  • Afrikaans 0.16
  • Ancient Greek 0.12
  • Portuguese 0.09
  • Polish 0.06
  • Springer (Firm)
  • English 61.25
  • German 37.10
  • French 1.02
  • Italian 0.29
  • Polish 0.13
  • Czech 0.04
  • Spanish 0.04
  • Hungarian 0.03
  • Dutch 0.02
  • Danish 0.02

22
WorldCat Publisher Profiles - Formats
  • Oxford University Press
  • Printed Material 89.57
  • Computer File 8.23
  • Microform 1.39
  • Sound Recording 0.50
  • Video Recording 0.16
  • Springer (Firm)
  • Printed Material 81.69
  • Computer file 17.51
  • Microform 0.71
  • Video Recording 0.05
  • Pearson PLC
  • Printed Material 92.98
  • Microform 2.82
  • Computer File 2.15
  • Video Recording 0.70
  • Sound Recording 0.67
  • Reed Elsevier PLC
  • Printed Material 92.31
  • Computer File 5.46
  • Microform 1.85
  • Video Recording 0.14

23
WorldCat Publisher Profiles Conspectus Divisions
  • Pearson PLC
  • Language/ Literature 18.67
  • Business/ Economics 13.30
  • Computer Science 9.42
  • Engineering 8.04
  • History 7.59
  • Mathematics 6.04
  • Education 5.64
  • Sociology 4.18
  • Philosophy/ Religion 3.81
  • Physical Sciences 2.75
  • Oxford Univ. Press
  • Language/ Literature 27.12
  • History 11.92
  • Music 9.78
  • Philosophy/ Religion 9.55
  • Business/ Economics 6.15
  • Medicine 4.36
  • Law 3.85
  • Sociology 3.75
  • Political Science 3.58
  • Biology 2.60

24
WorldCat Publisher Profiles Conspectus
Categories
  • Pearson PLC
  • English language 7.74
  • Business admin. 4.62
  • English literature 3.63
  • Economics 2.94
  • Comp. programming 2.39
  • Electrical engineering 2.24
  • Early childhood ed. 2.05
  • Computer software 1.88
  • U.S. federal law 1.80
  • Computer Science 1.54
  • Oxford Univ. Press
  • English literature 10.66
  • English language 5.86
  • Instrumental music 3.48
  • Vocal music 3.09
  • Literature on music 2.26
  • History Britain 1.82
  • Economic history 1.38
  • American lit. 1.35
  • History S. Asia 1.30
  • General history 1.29

25
WorldCat Publisher Profiles Conspectus Subjects
  • Pearson PLC
  • English modern 7.68
  • Management 2.53
  • Programming 1.74
  • Arithmetic 1.09
  • Economic theory 1.06
  • Marketing 1.06
  • General algebra 1.04
  • Accounting 0.97
  • Juvenile lit. 0.93
  • English lit. 19th c. 0.89
  • Oxford Univ. Press
  • English modern 5.57
  • English lit. prose 2.51
  • English lit. 19th c. 2.23
  • Juvenile lit. 1.06
  • English lit. poetry 1.03
  • English lit. collections 0.80
  • Biographies 0.76
  • English lit. 1900-1960 0.74
  • Shakespeare 0.68
  • Sacred choruses 0.66

26
WorldCat Publisher Profiles Conspectus Divisions
  • Reed Elsevier PLC
  • Language/ Literature 14.18
  • Law 11.78
  • Engineering 11.73
  • Business/ Economics 6.82
  • Medicine 6.50
  • Physical Sciences 5.01
  • History 4.57
  • Biology 4.32
  • Health Professions 3.70
  • Chemistry 3.51
  • Springer (Firm)
  • Computer Science 16.83
  • Engineering 15.12
  • Mathematics 12.96
  • Medicine 9.93
  • Physical Sciences 9.83
  • Biology 5.22
  • Business/ Economics 5.13
  • Health Professions 4.48
  • Chemistry 3.14
  • Geography 2.58

27
WorldCat Publisher Profiles Conspectus
Categories
  • Reed Elsevier PLC
  • English literature 5.84
  • Health professions 3.40
  • English language 2.79
  • U.S. federal law 2.32
  • General engineering 2.26
  • Electrical engineering 2.10
  • General law 1.70
  • Industrial economics 1.65
  • Business admin. 1.53
  • U.S. state law 1.46
  • Springer (Firm)
  • Computer science 5.23
  • General math 4.48
  • Health professions 4.03
  • Electrical engineering 3.73
  • General engineering 3.25
  • Mathematical analysis 3.06
  • Computer software 2.37
  • Comp. programming 2.34
  • Probability/ Statistics 2.20
  • Mech. engineering 2.17

28
WorldCat Publisher Profiles Conspectus Subjects
  • Reed Elsevier PLC
  • English modern 2.68
  • English - prose 2.06
  • Health professions 1.92
  • U.S. state law 1.37
  • Industrial management 1.22
  • Legal periodicals 1.16
  • English lit. - 1900-1960 1.15
  • Engineering materials 0.86
  • English fiction 0.83
  • Nuclear physics 0.68
  • Springer (Firm)
  • Health professions 3.56
  • Math collections 2.76
  • Computer science 1.84
  • Programming 1.46
  • Access/ security 1.10
  • Artificial intelligence 1.03
  • Mathematical stats 1.03
  • Analytical physics 1.02
  • Industrial management 0.99
  • Engineering materials 0.90

29
Projected MARC coding of Authorized Forms
  • 710 Added Entry Corporate Name
  • Add 4 for publisher name
  • Add 2 NAF where preferred form matches existing
    authority record (44 of current PNAF)
  • 752 Added Entry Hierarchical Place Name
  • Add 2 FAST where place of publication matches
    FAST geographical subject headings

30
Ongoing Research
  • Further data mining
  • Profile other aspects of publication output
  • Profile other publishers
  • Trends over time
  • Author clusters
  • Geographic holdings patterns
  • Collection Analysis

31
Ongoing Research
  • Plan for long-term maintenance
  • ISBN-13 compliance
  • File expansion of ongoing mergers/ acquisition
    activities
  • Deeper scaling into WorldCat (beyond ISBN)

32
OCLC Publisher Name Server
  • Project page
  • http//www.oclc.org/research/projects/publisherns
    /

33
Thank You!
  • Questions and Discussion
  • Lynn Silipigni Connaway connawal_at_oclc.org
  • Timothy J. Dickey dickeyt_at_oclc.org
Write a Comment
User Comments (0)
About PowerShow.com