Co-developing access to the UK Web Archive - PowerPoint PPT Presentation

1 / 14
About This Presentation
Title:

Co-developing access to the UK Web Archive

Description:

Co-developing access to the UK Web Archive Helen Hockx-Yu Head of Web Archiving, British Library Ten years of archiving the UK Web Archive Started web archiving in ... – PowerPoint PPT presentation

Number of Views:52
Avg rating:3.0/5.0
Slides: 15
Provided by: netp7
Category:

less

Transcript and Presenter's Notes

Title: Co-developing access to the UK Web Archive


1
Co-developing access to the UK Web Archive
  • Helen Hockx-Yu
  • Head of Web Archiving, British Library

2
Ten years of archiving the UK Web Archive
  • Started web archiving in 2004, non-print Legal
    Deposit since April 2013
  • Three collections over six billion resources and
    over 100TB compressed data
  • Focus not just on content collection
  • Proactive development of access and use, through
    close engagement with researchers
  • User survey
  • Content selection and curation
  • Brain-storming sessions and workshops to
    formulate research questions
  • Research projects

3
JISC UK Web Domain Dataset 1996-2013
  • Funded by JISC to create a research collection of
    historical UK websites
  • Collaboration between the Internet Archive, JISC
    and the British Library
  • Copy of subset of the Internet Archives web
    collection that relates to the UK
  • c.300 million resources, 60TB in total
  • No local access possible through the Internet
    Archive
  • Can be used to generate secondary datasets

4
Co-design at every stage
  • Research use case articulated
  • Generic user requirements abstracted
  • Requirements refined following feedback
  • Iterative development cycles Develop -gt user
    testing -gt feedback -gt develop

5
Use cases (generalised)
  • Full-text/facet search -gt individual resource
  • Full-text/facet search -gt analysis/visualisation
  • Search -gt corpus creation -gt annotation/curation
  • Corpus creation -gt full-text search -gt individual
    resource
  • Corpus -gt search -gt analysis/visualisation
  • Derived datasets -gt take-away
  • Direct access to WARC/CDX -gt take-away

6
High-level requirements
  • Query building
  • Corpus formation and handling
  • Annotation and curation
  • In-corpus analysis
  • Whole-dataset analysis

7
Prototype Shine
  • Full-text search, with proximity options, and to
    exclude specified text strings
  • Apply and remove multiple facet filters to result
    sets
  • Content type, public suffix, domain, crawl year
  • Also available postcode, links to public suffix,
    language, links domains
  • Exclude single resources, or whole hosts from
    result sets
  • Save a query
  • Export basic query results, as CSV or similar
  • Available at http//webarchive.org.uk/shine
  • https//github.com/ukwa/shine

8
Advanced Search
9
Ngram
  • Same search terms, different datasets
  • Broadly similar trends
  • Interesting to examine turning point
  • Not useful without understanding of scope
  • Visualisation not the end point

10
Pages mentioning Gordon Brown (2007)
11
Trends analysis
12
Access to data supporting trends
13
Next steps
  • Inclusion of the full JISC dataset seamless
    interface to all 3 components of UK Web Archive
  • Better support for corpus creation (eg
    combination of existing corpus)
  • Annotation and sharing of corpus
  • (standard) analysis and visualisation of corpus
  • Faceted search within user-define corpus
  • (semantic) clustering of search results

14
Lessons learnt
  • A learning process for both
  • Not a choice between big data or small data
  • Macroscope of the UK web history
  • a single data point, .. both visualised at scale
    in the context of a billion other data points,
    and drilled down to its smallest compass
  • Context and paratext just as important
  • User expectation / assumption
  • Maximum transparency
  • Scale remains a challenge
Write a Comment
User Comments (0)
About PowerShow.com