The Million Book Challenge: data mining for scholarship - PowerPoint PPT Presentation

1 / 17
About This Presentation
Title:

The Million Book Challenge: data mining for scholarship

Description:

... of Lancaster Semantic Analysis System) ... Such analysis could be done on any dataset ... Scholars need to have a copy of the corpus / dataset to be analysed ... – PowerPoint PPT presentation

Number of Views:48
Avg rating:3.0/5.0
Slides: 18
Provided by: alastair9
Category:

less

Transcript and Presenter's Notes

Title: The Million Book Challenge: data mining for scholarship


1
The Million Book Challenge data mining for
scholarship
  • Alastair Dunning
  • Digitisation Programme Manager, JISC
  • 0203 006 6065
  • a.dunning_at_jisc.ac.uk

2
From Keyhole to Open Door
  • Many scholars still approach digitised data with
    the searchbrowse paradigm
  • All digitised resources are initially constructed
    in this way
  • E.g. British Library 19th-Century Newspapers
    over 1m pages of text, billions of words
    keyword searches tend to reveal lists of 100s of
    pages
  • Yet digitised resources can be analysed in their
    entirety

3
Open Door possibilities
  • Machine translation
  • Analogue to text e.g. identifying footnotes
    within text, spotting the beginning and end of
    entries, encyclopedias, and gazetteers
  • Information Extraction
  • (Semantic) recognition of people, places, dates,
    and organizations, citations etc.
  • ? For scholars, new types of research in
    understanding primary (or secondary) sources

4
A Case Study 17th-Cent. News
  • Thanks to Ian Gregory, Andrew Hardie (University
    of Lancaster) for this study
  • Lancaster Newsbooks Corpus
  • 800,000 words of 1650s English newsbooks
  • Every surviving newsbook from mid-Dec 1653 to end
    of May 1654
  • Freely available via AHDS Catalogue
  • Methodological and technical problems exist
    skirted over here

5
1. Recognising geographies
  • Extracting individual mentions of place names
    from the corpus
  • The identification of proper nouns is
    accomplished via part-of-speech tagging, a well
    established technology within linguistics

6
  • In the Patrick of Liverpoole which we lately
    recovered from the Brest men of War was one
    Walter Roche who was to carry her to Brest
    and he informed us - that there are these Ships
    following belonging to Brest who do so vex us
    in these Seas viz.
  • ltpgt In_II the_AT ltemgt Patrick_NP1 lt/emgt of_IO
    ltemgt ltreg orig"Liverpoole"gt Liverpool_NP1 lt/reggt
    lt/emgt ,_, which_DDQ we_PPIS2 lately_RR
    recovered_VVN from_II the_AT ltemgt Brest_JJT lt/emgt
    men_NN2 of_IO War_NN1 ,_, was_VBDZ one_MC1 ltemgt
    Walter_NP1 Roche_NP1 lt/emgt ,_, who_PNQS was_VBDZ
    to_TO carry_VVI her_PPHO1 to_II ltemgt Brest_NP1
    lt/emgt ,_, and_CC he_PPHS1 informed_VVD us_PPIO2
    ,_, that_CST there_EX are_VBR these_DD2 Ships_NN2
    following_II belonging_VVG to_II ltemgt Brest_NP1
    lt/emgt ,_, who_PNQS do_VD0 so_RR vex_VVI us_PPIO2
    in_II these_DD2 Seas_NN2 ,_, ltemgt viz._REX lt/emgt
    lt/pgt

7
2. Extracting place names and assigning
co-ordinates
  • Proper nouns compared to a gazetteer
  • We chose http//www.world-gazetteer.com
  • Places outside Europe filtered out
  • SQL database relational join
  • Filters out non-place-name proper nouns
  • Problem duplicate place names (e.g. Newcastle in
    Ireland)
  • Each instance of a place name is associated with
    (one or more) sets of coordinates

8
3. And on to GIS
Google Earth
ArcGIS
Density smoothing in ArcGIS
(GIS Geographical Information System)
9
4. Mapping by theme
  • What is being discussed in relation to each
    mention of each place-name?
  • We cannot tell just from the dates and
    co-ordinates
  • Solution concordance semantic tagging
  • USAS system (University of Lancaster Semantic
    Analysis System)
  • Finding all terms related to a theme, e.g. money,
    cash, sterling, pound.

10
Identifying thematic associations (a) semantic
tags in immediate context
  • lthit_wordgtDunkirklt/hit_wordgt
  • lttextgtDutchDiurn03lt/textgt
  • of a rich Fleet fromZ5 Z5 I11u Z2 Z5
  • DunkirkZ2
  • , consisting of about forty A18u Z5 A134 N1

11
Mapping war
  • Problems
  • March 18 mentions, 2 places
  • Munster 12 mentions, 3 places
  • Newcastle 5 mentions in west of Ireland
  • Manchester
  • Middleton 63 mentions, General in a
    rebellion in Scotland
  • Whalley 10 mentions, General in a regiment
    of horse

Tag G3 warfare, etc 780 mentions
12
Mapping money and government
I1 Money 140 mentions
G1 Government 293 mentions
13
1m Books Challenge (I)
  • Such analysis could be done on any dataset
  • Concept developed by Greg Crane, Tufts
    University, Director of Perseus Project
  • Taken up by six funding bodies to create
    international grant competition (from US, UK,
    Canada, Germany) name t.b.c.
  • Competition to forge international partnerships
    to undertake type of work highlighted in case
    study

14
1m Books Challenge (II)
  • Will involve scholars, computer scientists,
    information managers and publishers
  • Competition is seeking to open up publishers
    content to allow for this type of analysis
  • Competition due to open in January 2009
    significant time built into the call to allow for
    relationships with publishers to be developed

15
Technical Legal Issues (I)
  • Scholars need to have a copy of the corpus /
    dataset to be analysed
  • Difficulties in actually transferring large
    corpora
  • Obvious IPR risks material could be passed on
  • Or publishers need to make entire corpus
    available online
  • Technically complex requires powerful
    infrastructure how does online content interact
    with analytical tools?

16
Technical Legal Issues (II)
  • Experiments need to be repeatable
  • Peer-review demands that other scholars have
    access to a corpus to review peers analyses
  • Where are enriched datasets stored?
  • Proliferating number of enriched datasets
  • Demand for delivering enriched datasets (or parts
    of them) for research and teaching
  • Who owns the IPR in an enriched data set
  • Original publisher yes. Researcher? Software,
    thesaurus and gazetteer creators?
  • How does this work for records, images, maps,
    audio, video?

17
Significant Challenges
  • Significant challenges exist for all stakeholders
  • But possibilities for exploiting investment in
    creating digital content
  • And potential for new avenues of research which
    scholars will wish to explore
  • Million Books Challenge will help explores some
    of these issues
Write a Comment
User Comments (0)
About PowerShow.com