Title: The Million Book Challenge: data mining for scholarship
1The Million Book Challenge data mining for
scholarship
- Alastair Dunning
- Digitisation Programme Manager, JISC
- 0203 006 6065
- a.dunning_at_jisc.ac.uk
2From Keyhole to Open Door
- Many scholars still approach digitised data with
the searchbrowse paradigm - All digitised resources are initially constructed
in this way - E.g. British Library 19th-Century Newspapers
over 1m pages of text, billions of words
keyword searches tend to reveal lists of 100s of
pages - Yet digitised resources can be analysed in their
entirety
3Open Door possibilities
- Machine translation
- Analogue to text e.g. identifying footnotes
within text, spotting the beginning and end of
entries, encyclopedias, and gazetteers - Information Extraction
- (Semantic) recognition of people, places, dates,
and organizations, citations etc. - ? For scholars, new types of research in
understanding primary (or secondary) sources
4A Case Study 17th-Cent. News
- Thanks to Ian Gregory, Andrew Hardie (University
of Lancaster) for this study - Lancaster Newsbooks Corpus
- 800,000 words of 1650s English newsbooks
- Every surviving newsbook from mid-Dec 1653 to end
of May 1654 - Freely available via AHDS Catalogue
- Methodological and technical problems exist
skirted over here
51. Recognising geographies
- Extracting individual mentions of place names
from the corpus - The identification of proper nouns is
accomplished via part-of-speech tagging, a well
established technology within linguistics
6- In the Patrick of Liverpoole which we lately
recovered from the Brest men of War was one
Walter Roche who was to carry her to Brest
and he informed us - that there are these Ships
following belonging to Brest who do so vex us
in these Seas viz. - ltpgt In_II the_AT ltemgt Patrick_NP1 lt/emgt of_IO
ltemgt ltreg orig"Liverpoole"gt Liverpool_NP1 lt/reggt
lt/emgt ,_, which_DDQ we_PPIS2 lately_RR
recovered_VVN from_II the_AT ltemgt Brest_JJT lt/emgt
men_NN2 of_IO War_NN1 ,_, was_VBDZ one_MC1 ltemgt
Walter_NP1 Roche_NP1 lt/emgt ,_, who_PNQS was_VBDZ
to_TO carry_VVI her_PPHO1 to_II ltemgt Brest_NP1
lt/emgt ,_, and_CC he_PPHS1 informed_VVD us_PPIO2
,_, that_CST there_EX are_VBR these_DD2 Ships_NN2
following_II belonging_VVG to_II ltemgt Brest_NP1
lt/emgt ,_, who_PNQS do_VD0 so_RR vex_VVI us_PPIO2
in_II these_DD2 Seas_NN2 ,_, ltemgt viz._REX lt/emgt
lt/pgt
72. Extracting place names and assigning
co-ordinates
- Proper nouns compared to a gazetteer
- We chose http//www.world-gazetteer.com
- Places outside Europe filtered out
- SQL database relational join
- Filters out non-place-name proper nouns
- Problem duplicate place names (e.g. Newcastle in
Ireland) - Each instance of a place name is associated with
(one or more) sets of coordinates
83. And on to GIS
Google Earth
ArcGIS
Density smoothing in ArcGIS
(GIS Geographical Information System)
94. Mapping by theme
- What is being discussed in relation to each
mention of each place-name? - We cannot tell just from the dates and
co-ordinates - Solution concordance semantic tagging
- USAS system (University of Lancaster Semantic
Analysis System) - Finding all terms related to a theme, e.g. money,
cash, sterling, pound.
10Identifying thematic associations (a) semantic
tags in immediate context
- lthit_wordgtDunkirklt/hit_wordgt
- lttextgtDutchDiurn03lt/textgt
- of a rich Fleet fromZ5 Z5 I11u Z2 Z5
- DunkirkZ2
- , consisting of about forty A18u Z5 A134 N1
11Mapping war
- Problems
- March 18 mentions, 2 places
- Munster 12 mentions, 3 places
- Newcastle 5 mentions in west of Ireland
- Manchester
- Middleton 63 mentions, General in a
rebellion in Scotland - Whalley 10 mentions, General in a regiment
of horse
Tag G3 warfare, etc 780 mentions
12Mapping money and government
I1 Money 140 mentions
G1 Government 293 mentions
131m Books Challenge (I)
- Such analysis could be done on any dataset
- Concept developed by Greg Crane, Tufts
University, Director of Perseus Project - Taken up by six funding bodies to create
international grant competition (from US, UK,
Canada, Germany) name t.b.c. - Competition to forge international partnerships
to undertake type of work highlighted in case
study
141m Books Challenge (II)
- Will involve scholars, computer scientists,
information managers and publishers - Competition is seeking to open up publishers
content to allow for this type of analysis - Competition due to open in January 2009
significant time built into the call to allow for
relationships with publishers to be developed
15Technical Legal Issues (I)
- Scholars need to have a copy of the corpus /
dataset to be analysed - Difficulties in actually transferring large
corpora - Obvious IPR risks material could be passed on
- Or publishers need to make entire corpus
available online - Technically complex requires powerful
infrastructure how does online content interact
with analytical tools?
16Technical Legal Issues (II)
- Experiments need to be repeatable
- Peer-review demands that other scholars have
access to a corpus to review peers analyses - Where are enriched datasets stored?
- Proliferating number of enriched datasets
- Demand for delivering enriched datasets (or parts
of them) for research and teaching - Who owns the IPR in an enriched data set
- Original publisher yes. Researcher? Software,
thesaurus and gazetteer creators? - How does this work for records, images, maps,
audio, video?
17Significant Challenges
- Significant challenges exist for all stakeholders
- But possibilities for exploiting investment in
creating digital content - And potential for new avenues of research which
scholars will wish to explore - Million Books Challenge will help explores some
of these issues