Aggregation and Analysis of Usage Data: - PowerPoint PPT Presentation

1 / 49
About This Presentation
Title:

Aggregation and Analysis of Usage Data:

Description:

Aggregation and Analysis of Usage Data: – PowerPoint PPT presentation

Number of Views:239
Avg rating:3.0/5.0
Slides: 50
Provided by: niso4
Category:

less

Transcript and Presenter's Notes

Title: Aggregation and Analysis of Usage Data:


1
Aggregation and Analysis of Usage Data A
Structural, Quantitative Perspective Johan
Bollen Digital Library Research Prototyping
Team Los Alamos National Laboratory - Research
Library jbollen_at_lanl.gov Acknowledgements Herbe
rt Van de Sompel (LANL), Marko A. Rodriguez
(LANL), Lyudmila L. Balakireva (LANL) Wenzhong
Zhao (LANL), Aric Hagberg (LANL) Research
supported by the Andrew W. Mellon
Foundation. NISO Usage Data Forum November
1-2, 2007 Dallas, TX
2
BatgirlWhos going to be batman now?
3
Usage data has arrived.
  • Value of usage data/statistics is undeniable
  • Business intelligence
  • Scholarly assessment
  • Monitoring of scholarly trends
  • Enhanced end-user services
  • Production chain
  • Recording
  • Aggregation
  • Analysis
  • Challenges and opportunities at all links in
    chain.
  • But of interest to this workshop in particular
  • Recording data models, standardization
  • Aggregation restricted sampling vs.
    representativeness
  • Analysis frequentist vs. structural

This presentation overview of lessons learned by
MESUR project at LANL. Technical project details
later this afternoon.
4
Structure of presentation.
  1. MESUR introduction and overview
  2. Aggregation value proposition
  3. Data models
  4. Usage data representation
  5. Aggregation frameworks
  6. Usage data providers technical and
    socio-cultural issues, a survey
  7. Sampling strategies theoretical issues
  8. Discussion way forward, roadmap

5
Structure of presentation.
  1. MESUR introduction and overview
  2. Aggregation value proposition
  3. Data models
  4. Usage data representation
  5. Aggregation frameworks
  6. Usage data providers technical and
    socio-cultural issues, a survey
  7. Sampling strategies theoretical issues
  8. Discussion way forward, roadmap

6
The MESUR project.
Johan Bollen (LANL) Principal investigator. Herbe
rt Van de Sompel (LANL) Architectural
consultant. Aric Hagberg (LANL) Mathematical and
statistical consultant. Marko Rodriguez (LANL)
PhD student (Computer Science, UCSC). Lyudmila
Balakireva (LANL) Database management and
development. Wenzhong Zhao (LANL) Data
processing, normalization and ingestion.
The Andrew W. Mellon Foundation has awarded a
grant to Los Alamos National Laboratory (LANL) in
support of a two-year project that will
investigate metrics derived from the
network-based usage of scholarly information. The
Digital Library Research Prototyping Team of
the LANL Research Library will carry out the
project. The project's major objective is
enriching the toolkit used for the assessment of
the impact of scholarly communication items, and
hence of scholars, with metrics that derive from
usage data.
7
Project data flow and work plan.
4a
2
3
1
4b
metrics survey
reference data set
aggregation
ingestion
negotiation
8
Project timeline.
We are here!
9
Lecture focus MESUR lessons learned
10
Structure of presentation.
  1. MESUR introduction and overview
  2. Aggregation value proposition
  3. Data models
  4. Usage data representation
  5. Aggregation frameworks
  6. Usage data providers technical and
    socio-cultural issues, a survey
  7. Sampling strategies theoretical issues
  8. Discussion way forward, roadmap

11
Lesson 1 aggregation adds significant value.
Focus on institution-oriented services and
analysis
Multiple institutions, each with focus on
institution-oriented services and analysis
Note MESURnot services
  • Aggregating usage data provides
  • Validation reliability and span
  • Perspective group-based services and analysis

12
Aggregation value validation
  • Aggregating usage data provides
  • Reliability overlaps and intersections
  • Validation error checking
  • Reliability repeated measurements
  • Span
  • Diversity multiple communities
  • Representativeness sample of scholarly community

13
Aggregation value convergence.
CSU Usage PR IF (2003) Title (abbv.)
1 78.565 21.455 JAMA-J AM MED ASSOC
2 71.414 29.781 SCIENCE
3 60.373 30.979 NATURE
4 40.828 3.779 J AM ACAD CHILD PSY
5 39.708 7.157 AM J PSYCHIAT
LANL Usage PR IF (2003) Title (abbv.)
1 60.196 7.035 PHYS REV LETT
2 37.568 2.950 J CHEM PHYS
3 34.618 1.179 J NUCL MATER
4 31.132 2.202 PHYS REV E
5 30.441 2.171 J APPL PHYS
MSR Usage PR IF (2005) Title (abbv.)
1 15.830 30.927 SCIENCE
2 15.167 29.273 NATURE
3 12.798 10.231 PNAS
4 10.131 0.402 LECT NOTES COMP SCI
5 8.409 5.854 J BIOL CHEM
  • Convergence!
  • Open research questions
  • Is this guaranteed?
  • To what? A common-baseline?
  • What we do know
  • Institutional perspective can be contrasted to
    baseline.
  • As aggregation increases in size, so does value.

14
We have COUNTER/SUSHI. How about the aggregation
of item-level usage data?
If there is value in aggregating COUNTER and
other reports, there is considerable value in
aggregating item-level usage data.
15
Structure of presentation.
  1. MESUR introduction and overview
  2. Aggregation value proposition
  3. Data models
  4. Usage data representation
  5. Aggregation frameworks
  6. Usage data providers technical and
    socio-cultural issues, a survey
  7. Sampling strategies theoretical issues
  8. Discussion way forward, roadmap

16
Lesson 2 We need a standardized representation
framework for item-level usage data.
  • MESUR Ad hoc parsing is highly problematic
  • - field semantics
  • - field relations
  • - various data models
  • Framework objectives
  • Minimize data loss
  • Preserve event info
  • Preserve sequence info
  • Preserve document metadata
  • realistic scalability and granularity.
  • Apply to variety of usage data, i.e. no inherent
    bias towards specific type of usage data

17
Requirements for usage data representation
framework.
  • Needs to minimally represent following concepts
  • Event ID distinguish usage events
  • Referent ID DOI, SICI, metadata
  • User/Session ID define groups of events related
    by user
  • Date and time ID identify data and time of event
  • Request types identiby type of request issued by
    user
  • Implications
  • Sequence session ID and date/time preserves
    sequence
  • Privacy session ID groups events not by user ID
  • Request types filter on types of usage

18
COUNTER reports information loss
From www.niso.org/presentations/MEC06-03-Shepherd
.pdf
  1. Event ID distinguish usage events ?
  2. Referent ID DOI, SICI, metadata
  3. User/Session ID define groups of events related
    by user?
  4. Date and time ID identify data and time of event
  5. Request types identiby type of request issued by
    user

19
From same usage data journal usage graphs
20
From same data article usage graphs
21
Usage graphs connect this domain to50 years of
network science
  • social network analysis
  • small world graphs
  • network science
  • graph theory
  • social modeling
  • Good reads
  • Barabasi (2003) Linked.
  • Wasserman (1994). Social network analysis.

Heer (2005) - Large-Scale Online Social Network
Visualization
22
Structure of presentation.
  1. MESUR introduction and overview
  2. Aggregation value proposition
  3. Data models
  4. Usage data representation
  5. Aggregation frameworks
  6. Usage data providers technical and
    socio-cultural issues, a survey
  7. Sampling strategies theoretical issues
  8. Discussion way forward, roadmap

23
Lesson 3 aggregating item-level usage data
requires standardized aggregation framework.
  • Standardization objectives similar to work done
    for COUNTER and SUSHI
  • Serialization (COUNTER)
  • standard to serialize usage data
  • Suitable for large-scale, open aggregation
  • Event provenance and identification
  • Transfer protocol (SUSHI)
  • Communication of usage data between log archive
    and aggregator
  • Allow open aggregation across stakeholders in
    scholarly community
  • Privacy standard
  • Standards need to address privacy concerns
  • Should allow emergence of trusted intermediaries
    aggregation ecology

LANL has made proposals based on community
standards
24
OpenURL ContextObject to represent usage data
lt?xml version1.0 encodingUTF-8?gt ltctxcontex
t-object timestamp2005-06-01T102233Z
identifierurnUUID58f202ac-22cf-11d1-b12d-00203
5b29062 gt ltctxreferentgt ltctxidentifiergtinf
opmid/12572533lt/ctxidentifiergt
ltctxmetadata-by-valgt ltctxformatgtinfoofi/fmt
xmlxsdjournallt/ctxformatgt ltctxmetadatagt
ltjoujournal xmlnsjouinfoofi/fmtxmlxsd
journalgt ltjouatitlegtIsolation of common
receptor for coxsackie B
ltjoujtitlegtSciencelt/joujtitlegt lt/ctxreferentgt
ltctxrequestergt ltctxidentifiergturnip
63.236.2.100lt/ctxidentifiergt
lt/ctxrequestergt ltctxservice-typegt
ltfull-textgtyeslt/full-textgt
lt/ctxservice-typegt Resolver
Referrer . lt/ctxcontext-objectgt
Event information event datetime globally
unique event ID
Referent identifier metadata
Requester User or user proxy IP, session,
ServiceType
Resolver identifier of linking server
25
Aggregation framework existing standards
Log Repository 1
Serialization OpenURL ContextObjects
Log harvester
Log Repository 2
Aggregated Usage Data
Log DB
Aggregated logs
Log Repository 3
Tranfer protocol OAI-PMH
Johan Bollen and Herbert Van de Sompel. An
Architecture for the aggregation and analysis of
scholarly usage data. In Joint Conference on
Digital Libraries (JCDL2006), pages 298-307, June
2006.
26
The issue of anonymization
  • Privacy and anonymization concerns play on
    multiple levels that standard needs to address
  • Institutions where was usage data recorded?
  • Providers who provided usage data?
  • Users who is the user?
  • Goes beyond naming and masking identity simple
    statistics can reveal identity
  • User identity can be inferred from activity
    pattern (AOL search data)
  • Law enforcement issues
  • MESUR
  • Session ID preserve sequence without any
    references to individual users
  • Negotiated filtering of usage data

27
Structure of presentation.
  1. MESUR introduction and overview
  2. Aggregation value proposition
  3. Data models
  4. Usage data representation
  5. Aggregation frameworks
  6. Usage data providers technical and
    socio-cultural issues, a survey
  7. Sampling strategies theoretical issues
  8. Discussion way forward, roadmap

28
Lesson 4 socio-cultural issues matter.
  • BUT
  • 12 months
  • 550 email messages
  • Countless teleconferences
  • Correspondence reveals socio-cultural issues for
    each class of providers
  • Privacy
  • Business models
  • Ownership
  • World-view
  • Study by means of email term analysis of
    correspondence
  • Term frequencies tag clouds
  • Term co-occurrences concept networks
  • MESURs efforts
  • Negotiated with
  • Publishers
  • Aggregators
  • Institutions
  • Results
  • 1B usage events obtained
  • Usage data provided ranging from 100M to 500K
  • Spanning multiple communities and collections

29
The world according to usage data
providersinstitutions
Note strong focus on technical matters related
to sharing SFX data
30
The world according to usage data
providersinstitutions
31
The world according to usage data
providerspublishers
Note strong focus on procedural matters related
to agreements and project objectives
32
The world according publishers
33
The world according to usage data
providersaggregators
Note strong focus on procedural matters and
project/research objectives
34
The world according to usage data
providersaggregators
35
The world according to usage data providersOpen
Access Publishers
Note strong focus on technical matters related
to sharing of data
36
The world according to usage data providersOpen
Access publishers
37
Summary diagram of provider concerns
38
Lesson 5 size isnt everything
  • Return on investment analysis
  • Value generated relative to investment
  • Investment correspondence
  • Value usage events loaded
  • Extracted from each provider correspondence with
    MESUR
  • Investment
  • Delay between first and last email
  • Number of emails
  • Number of bytes in emails
  • Timelines of email intensity
  • Value
  • Number of usage events loaded
  • (bytes wont work different formats)
  • MESURs efforts
  • 12 months
  • 550 email messages
  • Good results but large variation
  • Usage data provided ranging from 100M to 500K

39
Return on investment (ROI)
value usage events loaded work 1) days
between first contact 2) emails sent 3) bytes
communicated
ROI value / work
Days Emails Bytes Events (M) Type
1 167 47 39858 149.5 A
2 6 38 18185 22.5 OAP
3 178 93 53380 25.3 P
4 126 82 43427 15.1 I
5 190 43 29304 5.6 A
6 55 70 34209 3.4 I
7 114 41 39410 1.0 A
40
Return on investment (ROI)
Survivor bias! Another way to look at it
ROI value / work
Value usage data obtained Work asking
Type Asked Positive Negative pos
Publishers 14 8 6 57
Aggregators 3 3 0 100
Institutions 4 4 0 100
Totals 21 14 7
41
Return on investment ETA?
All
Publisher
Aggregator
42
Return on investment timelines
Institution
Note same pattern, but much less traffic.
43
Structure of presentation.
  1. MESUR introduction and overview
  2. Aggregation value proposition
  3. Data models
  4. Usage data representation
  5. Aggregation frameworks
  6. Usage data providers technical and
    socio-cultural issues, a survey
  7. Sampling strategies theoretical issues
  8. Discussion way forward, roadmap

44
Lesson 6 Sampling is (seriously) difficult
  • Usage data sample
  • Sample of whom?
  • Different communities
  • Different collections
  • Different interfaces
  • Who to aggregate from?
  • Different providers
  • Interfaces
  • Systems
  • Representation frameworks
  • Result difficult choices

45
A tapestry of usage data providers
  • Each represent different, and possibly
    overlapping, samples of the scholarly community.
  • Institutions
  • Institutional communities
  • Many collections
  • Aggregators
  • Many communities
  • Many collections
  • Publishers
  • Many communities
  • Publisher collection
  • Main players
  • Individual institutions
  • Aggregators
  • Publishers
  • We will hear from several in this workshop

46
A tapestry of usage data providers
Which providers to choose? Different problems
  • A Technical
  • Aggregation
  • Normalization
  • Deduplication
  • B Theoretical
  • Who Sample characteristics?
  • What resulting aggregate?
  • Why sampling objective?
  • C Socio-cultural
  • World views
  • Concerns
  • Business models

47
Structure of presentation.
  1. MESUR introduction and overview
  2. Aggregation value proposition
  3. Data models
  4. Usage data representation
  5. Aggregation frameworks
  6. Usage data providers technical and
    socio-cultural issues, a survey
  7. Sampling strategies theoretical issues
  8. Discussion way forward, roadmap

48
Conclusion
  • There exists considerable interest in aggregating
    item-level usage data.
  • This requires standardization on multiple levels
  • Recording and processing
  • Field semantics what do the various recorded
    usage items mean?
  • Requests standardize on request types and
    semantics?
  • Standardized representation needs to be scalable
    yet minimize information loss
  • Aggregation
  • Serialization standardized serialization for
    resulting usage data
  • Transfer standard protocol for exposing and
    harvesting of usage data
  • Privacy protect identity and rights of
    providers, institutions and users
  • Although technical issues can be resolved
  • Socio-cultural issues
  • Worldviews, business models and ownership shape
    aggregation strategies.
  • Greatly improved by existing standards and
    infrastructure for usage recording, processing
    and aggregating
  • Theoretical questions of sampling MESURs
    research project will provide insights.

49
Some relevant publications.
  • Marko A. Rodriguez, Johan Bollen and Herbert Van
    de Sompel. A Practical Ontology for the
    Large-Scale Modeling of Scholarly Artifacts and
    their Usage, In Proceedings of the Joint
    Conference on Digital Libraries, Vancouver, June
    2007
  • Johan Bollen and Herbert Van de Sompel. Usage
    Impact Factor the effects of sample
    characteristics on usage-based impact metrics.
    Journal of the American Society for Information
    Science and technology, 59(1), pages 001-014
    (cs.DL/0610154).
  • Johan Bollen and Herbert Van de Sompel. An
    architecture for the aggregation and analysis of
    scholarly usage data. In Joint Conference on
    Digital Libraries (JCDL2006), pages 298-307,
    June 2006.
  • Johan Bollen and Herbert Van de Sompel. Mapping
    the structure of science through usage.
    Scientometrics, 69(2), 2006.
  • Johan Bollen, Marko A. Rodriguez, and Herbert
    Van de Sompel. Journal status. Scientometrics,
    69(3), December 2006 (arxiv.orgcs.DL/0601030)
  • Johan Bollen, Herbert Van de Sompel, Joan Smith,
    and Rick Luce. Toward alternative metrics of
    journal impact a comparison of download and
    citation data. Information Processing and
    Management, 41(6)1419-1440, 2005.
Write a Comment
User Comments (0)
About PowerShow.com