Caltech Theses Collection Usage Analysis - PowerPoint PPT Presentation

1 / 30
About This Presentation
Title:

Caltech Theses Collection Usage Analysis

Description:

PowerPoint Presentation – PowerPoint PPT presentation

Number of Views:114
Avg rating:3.0/5.0
Slides: 31
Provided by: Troy101
Category:

less

Transcript and Presenter's Notes

Title: Caltech Theses Collection Usage Analysis


1
(No Transcript)
2
Caltech Theses Collection Usage Analysis
  • Ed Sponsler
  • George Porter
  • Betsy Coles
  • California Institute of Technology
  • Library System

3
Three Kinds of Lies
  • White Lies
  • Damned Lies
  • Statistics

4
The Devils in the Datas Details
5
Examinig the Datas Details
  • Study the data What created it? Human? Computer?
    What does it mean?
  • WRONG How can the data address my questions?
  • RIGHT What questions can the data address?

6
Lets Put Some Honesty into Statistics
7
Caltech Theses Facts
  • First Digital Deposit July, 2001
  • Number of Theses 1208
  • Software Used VT ETDdb (but not for much longer)
  • Campus Mandate June, 2002
  • Defense Date Range 1922 to present

8
Caltech Theses Statistics
  • Data Source Apache Web Logs
  • What is an access?
  • What can be ignored and why?
  • What do human v robot accesses look like?
  • What is a referrer? User Agent? Host IP?
    Requested Object?

9
Apache Combined Log Format
  • 63.89.199.36 - - 21/Jul/2003125301 -0700
    "GET
  • /etd/available/etd-12182002-190040/unrestricted/th
    esis.pdf
  • HTTP/1.1" 200 15767
  • "http//etd.caltech.edu/etd/available/etd-2182002-
    190040/"
  • "Mozilla/4.0 (compatible MSIE 6.0 Windows NT
    5.1 .NET
  • CLR 1.0.3705)"

10
DeDupe
The dedupe filter ensures that a host may access
a thesis only one time. Duplicate attempts are
ignored, even if the request is for a different
file from the same thesis, such as a different
Chapter.
11
DeDupe
The result of the dedupe filter is an access_log
containing at most one log entry for each unique
host that has accessed any file of a given thesis.
12
DeDupe Data Structure
Theses ID etd-3493 etd-1139 etd-944
Host IP 131.212.13.22 124.24.21.1 145.46.55.6
access_log 131.212.13.22 - - 21/Jul/200312 124.
24.21.1 - - 12/Aug/200315 145.46.55.6 - -
05/Sep/200305 131.212.13.22 - -
20/Sep/200304 133.25.5.12 - -
28/Sep/200311 154.21.78.9 - -
03/Oct/200309 131.215.12.22 - -
05/Janl/200402 133.42.3.99 - -
09/Jan/200407 101.24.21.99 - - 14/Feb/200401
Host IP 131.212.13.22 133.25.5.12 154.21.78.9
Host IP 131.215.12.22 133.42.3.99 101.24.21.99
13
DeDupe Processing
14
Apache Status Codes
15
User Agents
16
User Agents
Internet Explorer 60 Known Human Users
71 Netscape 11
Googlebot 14 Bots/Harvesters/Other
29 Other 15
17
Search Servers
18
PDF Downloads from7/1/2001 - 5/31/2004
19
Country of Origin Report
GeoIP database contains IP blocks and their
country of origin More useful and complete than
top level domain names (.edu, .de, .uk, etc)
20
Geographic Analysis153 countries represented
  • United States 76294
  • China 7943
  • Germany 4763
  • United Kingdom 4646
  • Canada 3918
  • India 3328
  • Japan 3271
  • France 2887
  • Italy 2066
  • Taiwan 2063
  • Korea 1639
  • Spain 1300
  • Australia 1249
  • Netherlands 1239
  • Iran 1208
  • Malaysia 1160
  • Hong Kong 1007
  • Turkey 961
  • Brazil 860
  • Poland 853
  • Singapore 847
  • Russian Fed. 812
  • Switzerland 810
  • Sweden 759
  • Israel 743
  • Belgium 735
  • Mexico 724
  • Thailand 648
  • Egypt 542
  • Greece 511
  • Romania 480
  • Vietnam 455
  • Indonesia 451
  • Portugal 438
  • Finland 419
  • Philippines 418

21
Most Popular Theses
Count Defense Date 3322 2000-10-23 3199
2002-08-07 3174 2002-07-16 2457
2001-10-23 2153 2002-10-02 2120
2002-09-25 2098 2001-05-18 2073
2002-10-04 1959 2002-11-05 1848
2003-01-14 1675 2002-08-14 1614
2002-05-02
Count Defense Date 1486 2002-09-04 1378
2003-09-02 1304 2001-02-09 1296
2003-05-15 1176 2003-05-15 1134
2001-05-07 1130 2002-01-16 1124
2001-03-08 1123 2003-06-02 1091
2001-01-19 1087 2003-03-20
22
Most Popular Theses
Defense Date Title (gt1000 downloads)
2000-10-23 Blocking Adhesion to Cell and
Tissue Surfaces via Steric Stabilization with
Graft Copolymers containing Poly(Ethylene
Glycol) and Phenylboronic Acid 2002-08-07
Electrochemical Sensors Based on DNA- Mediated
Charge Transport Chemistry 2002-07-16
Effects of Surface Modification on
Charge-Carrier Dynamics at Semiconductor
Interfaces 2001-10-23 I. Seafloor
Morphology of the Osbourn Trough and Kermadec
Trench and II. Multiscale Dynamics of
Subduction Zones 2002-10-02 I.
Structure-Function Analysis of the
Mechanosensitive Channel of Large
Conductance. II. Design of Novel Magnetic
Materials using Crystal Engineering.
23
Most Popular Theses
Defense Date Title
2002-09-25 Modeling a Hox Gene Network
Stochastic Simulation with Experimental
Perturbation 2001-05-18 All-Optical Logic
Circuits based on the Polarization Properties
of Non-Degenerate Four- Wave Mixing 2002-10-04
Site-specific incorporation of synthetic
amino acids into functioning ion
channels 2002-11-05 Impact-Ionization Mass
Spectrometry of Cosmic Dust 2003-01-14
Force-Detected Nuclear Magnetic Resonance
Independent of Field Gradients 2002-08-14
Fast, High-Order Methods for Scattering by
Inhomogeneous Media- 2002-05-02 Neural
dynamics underlying complex behavior in a
songbird 2002-09-04 Spectroscopic
Characterization of DNA-mediated Charge Transfer
24
Most Popular Theses
Defense Date Title
2003-09-02 Protein Engineering Through in
vivo Incorporation of Phenylalanine
Analogs 2001-02-09 Synthesis, Passivation
and Charging of Silicon Nanocrystals 2003-05-15
Sensitizer-linked substrates as probes of
heme enzyme structure and catalysis 2003-05-15
Mirror Thermal Noise in Interferometric
Gravitational Wave Detectors 2001-05-07
Analysis and Design of Turbo-like
Codes 2002-01-16 Computational Enzyme
Design 2001-03-08 An Investigation of Ion
Engine Erosion by Low Energy Sputtering 2003-06-
02 Laboratory Evolution of Cytochrome P450
Peroxygenase Activity 2001-01-19 Passive
Hypervelocity Boundary Layer Control Using an
Acoustically Absortive Surface 2003-03-20
Mapping the cytochrome c folding landscape
25
Human / Robot Split
Human activity identified by MSIE or
Mozilla In the User Agent field of the
apache_log
26
Referrers by Human UseMSIE Mozilla
  • etd.caltech.edu 33
  • www.google.com 32
  • search.yahoo.com 8
  • www.google.de 3
  • all others lt2 (each)
  • 492 total referrers

27
Most Active RobotsSince April, 2004
  • Googlebot 3524
  • Googlebot/Test 1100
  • TurnitinBot 362
  • Wget 252
  • msnbot 162
  • DA 41
  • Contype 36
  • ia_archiver 33
  • FAST-WebCrawler 18
  • NPBot 16
  • NetAnts 16

28
Summary
  • Keep Statistics Honest understand and scrub your
    data before analysis
  • Google is key for discovery
  • Theses are popular because they are new and have
    useful content

29
Next Steps
  • Compare download frequencies, not just totals
  • Create local IP -gt domain name database
  • Adapt DeDupe to CODA EPrints Archives

30
Caltech Library Systems Online Digital Archives
  • Theses
  • http//etd.caltech.edu
  • All Archives
  • http//coda.caltech.edu
Write a Comment
User Comments (0)
About PowerShow.com