Title: Alan Tonge
1SPECTRa-T Project
Semantic Web Data Repositories from Chemistry
e-Thesis Data Mining
Open Repositories 2008 Southampton University 2
April 2008
2Project Overview
Submission, Preservation and Exposure of
Chemistry Teaching and Research Data
- 12-month project between University of
Cambridge and Imperial College London to
develop text- and data-mining tools to extract
chemical data from e-theses - Part of the JISC Digital Repositories programme
in Theses
3Background
Chemistry is an experimental science Synthetic
Organic Chemistry
is the basis of
Pharmaceutical and Agrochemical industries
Where does the information to make this molecule
come from?
Ethyl 4,5-epoxy-hex-2-enolate C8H12O3
Systematic Name Molecular Formula
4 Chemical Abstracts (9000 journals - 12,000
structures/day)Beilstein (180 core
journals)Patents (CAS, Derwent, MDL) (400,000
/annum)
Search Chemical patent journal abstracting
services e.g.
Academic chemistry publications largely derived
from PhD Theses Perhaps 10K published per year
worldwide Synthetic contains 50-60 preparations
only 20 published in detail
5- List of Starting Materials Reagents
- Recipe Reactions Conditions Work-up
- Product Characterization spectroscopic
physical properties
6Sample preparation from synthetic chemistry
thesis
7The Problem
- 80 of (academic) synthetic preparations remain
locked in theses - Manual abstraction (cf journals/patents) not an
option
The Solution
- OSCAR3 Automatic high-throughput chemical name
and chemical term recognition - Open Source Chemistry Analysis Routines is
an extensible Open Source framework which can
identify much of the chemical terminology in
electronic articles - Semantic Web Deposit extracted terms in
searchable RDF triplestore
8OSCAR Name recognition
1. Dictionary of chemical names/terms (ChEBI
Ontology)
2. Rules chemical suffix filters
3. Regular expressions to recognise data,
formulae
9(No Transcript)
10Input PDF Legacy FormatPDF is the de facto
format for electronic document deposition in
digital repositories
PDF text is a Page Description Format
optimized for human, not machine, readability
- irregular word order
- line-breaks loss of continuous text paragraphs
difficult to identify - loss of subscripts and superscripts
- non-printing characters
- erroneous character assignment with OCR.
11(No Transcript)
12Programmatic modifications to
- Remove linebreaks from extended chemical names
- Remove text fragments derived from Figures and
Tables - Correct whitespace in chemical names
OSCAR3
XSLT
UTF-8 text
SAF XML
RDF statements
PDF
Used as is
OSCAR used as is on PDF e-theses
Gives 5000 terms / thesis (80 duplicates) Cannot
identify chemical objects (spectra assignments
properties)
Gives 5000 terms / thess
13Input MS Office Open XML docx
- No information loss from students deposited
thesis (written with MS software) - Identification of experimental sections no
longer a problem -gt Chemical Objects - Conversion of COs into Chemical Markup Language
Extract chemical terms
RDF statements
OSCAR3
Link together
URI
DocX
Extract chemical objects
CML data files
Data Repository
14Sample preparation from synthetic chemistry
thesisSample preparation from chemistry
thesis
15CML Infra-Red ASSIGNMENTS ltcmlspectrum
type"cmlir"gt - ltcmlconditionListgt Â
ltcmlcondition title"the form of the IR
spectrum dictRef"cmlirform"gtfilmlt/cmlcondition
gt  lt/cmlconditionListgt - ltcmlpeakListgt Â
ltcmlpeak id"p1" xValue"3446" title"OH" /gt Â
ltcmlpeak id"p2" xValue"3062"
title"unassigned" /gt  ltcmlpeak id"p3"
xValue"3029" title"unassigned" /gt  ltcmlpeak
id"p4" xValue"2922" title"unassigned" /gt Â
ltcmlpeak id"p5" xValue"1672" title"CO" /gt Â
ltcmlpeak id"p6" xValue"1604" title"CC" /gt Â
ltcmlpeak id"p7" xValue"1496"
title"unassigned" /gt  ltcmlpeak id"p8"
xValue"1454" title"unassigned" /gt  ltcmlpeak
id"p9" xValue"1366" title"unassigned" /gt Â
ltcmlpeak id"p10" xValue"1299"
title"unassigned" /gt  ltcmlpeak id"p11"
xValue"1135" title"unassigned" /gt  ltcmlpeak
id"p12" xValue"1078" title"unassigned" /gt Â
ltcmlpeak id"p13" xValue"974"
title"unassigned" /gt   lt/cmlpeakListgt Â
lt/cmlspectrumgt
CML C-13 NMR ASSIGNMENTS ltcmlspectrum
type"cmlcnmr"gt - ltcmlparameterListgt Â
ltcmlparameter dictRef"cmlfrequency"
units"unitsMHz"gt50lt/cmlparametergt Â
lt/cmlparameterListgt - ltcmlsubstanceListgt Â
ltcmlsubstance ref"" /gt  lt/cmlsubstanceListgt -
ltcmlpeakListgt  ltcmlpeak xValue"198.6"
integral"" peakMultiplicity"" title"CO" /gt Â
ltcmlpeak xValue"198.5" integral""
peakMultiplicity"" title"" /gt  ltcmlpeak
xValue"145.0" integral"" peakMultiplicity""
title"C" /gt  ltcmlpeak xValue"142.7"
integral"" peakMultiplicity"" title"C" /gt Â
ltcmlpeak xValue"137.3" integral""
peakMultiplicity"" title"CH2" /gt  ltcmlpeak
xValue"136.7" integral"" peakMultiplicity""
title"CH2" /gt  ltcmlpeak xValue"129.1"
integral"" peakMultiplicity"" title"" /gt Â
ltcmlpeak xValue"128.6" integral""
peakMultiplicity"" title"" /gt  ltcmlpeak
xValue"126.7" integral"" peakMultiplicity""
title"" /gt  ltcmlpeak xValue"124.0"
integral"" peakMultiplicity"" title"aryl-C" /gt
 ltcmlpeak xValue"62.5" integral""
peakMultiplicity"" title"CH" /gt  ltcmlpeak
xValue"59.0" integral"" peakMultiplicity""
title"CH" /gt  ltcmlpeak xValue"55.2"
integral"" peakMultiplicity"" title"CH" /gt Â
ltcmlpeak xValue"54.9" integral""
peakMultiplicity"" title"CH" /gt  ltcmlpeak
xValue"38.5" integral"" peakMultiplicity""
title"CH2" /gt  ltcmlpeak xValue"32.8"
integral"" peakMultiplicity"" title"CH2" /gt Â
ltcmlpeak xValue"26.1" integral""
peakMultiplicity"" title"CH3" /gt  ltcmlpeak
xValue"26.0" integral"" peakMultiplicity""
title"CH3" /gt  lt/cmlpeakListgt Â
lt/cmlspectrumgt
16RDF - Resource Description Framework. A
component of the Semantic Web, it is based upon
the idea of making statements about
resources/data in the form of a
subject-predicate-object (or resource-property-v
alue) expression (called a triple) e.g.
My_thesis has_chemical_entity
2,4-dinitrobenzene The value of one property can
in turn be used as the resource for another.
17SPARQL QUERY PREFIX st lthttp//wwmm.ch.cam.ac.uk/
spectra-tgt PREFIX dcrdf lthttp//purl.org/metadat
a/dublin_coregt CONSTRUCT ?thesis
sthasBicycloMoleculeAndHNMR ?chemical . ?thesis
dcrdfauthor ?author WHERE ?thesis
dcrdfcreator ?author . ?thesis
sthasChemicalName ?annot . ?annot
stchemicalName ?chemical . ?annot
sthasHNMRSpectrum ?hnmr . FILTER
regex(?chemical, ".bicyclo.") .
RDF TRIPLESTORE ENTRY ltrdfRDF xmlnsdc"http//pu
rl.org/dc/elements/1.1/" xmlnsdcrdf"http//pur
l.org/metadata/dublin_core" xmlnsrdf"http//w
ww.w3.org/1999/02/22-rdf-syntax-ns"
xmlnsspectra-t"http//wwmm.ch.cam.ac.uk/spectr
a-t"gt ltrdfDescription rdfabout"file/C/spect
ra-t-theses/Juergen_Harter.docx"gt ltspectra-thasC
hemicalNamegt - ltrdfDescriptiongt
ltspectra-tchemicalNamegtCDCl3lt/spectra-tchemicalN
amegt ltspectra-thasSMILESgtClC(2H)(Cl)Cllt/spec
tra-thasSMILESgt ltspectra-thasInChIgtInChI1/CH
Cl3/c2-1(3)4/h1H/i1Dlt/spectra-thasInChIgt
lt/rdfDescriptiongt lt/spectra-thasChemicalNamegt lt
spectra-thasChemicalNamegt - ltrdfDescriptiongt
ltspectra-tchemicalNamegt1-Benzyloxy-but-3-ynelt/spe
ctra-tchemicalNamegt ltspectra-thasSMILESgtCCCC
OCC1CCCCC1lt/spectra-thasSMILESgt
ltspectra-thasInChIgtInChI1/C11H12O/c1-2-3-9-12-10
-11-7-5-4-6-8-11/h1,4-8H,3,9-10H2lt/spectra-thasIn
ChIgt ltspectra-thasHNMRSpectrumgthttp//ch.cam.a
c.uk8182/1ea7f8cd07/data-0.cmllt/spectra-thasHNMR
Spectrumgt ltspectra-thasCMLMoleculegthttp//ch.c
am.ac.uk8182/1ea7f8cd07/data-0.cmllt/spectra-thas
CMLMoleculegt ltspectra-thasPreparationgthttp//c
h.cam.ac.uk8182/1ea7f8cd07/preparation-0.sci.xmllt
/spectra-thasPreparationgt lt/rdfDescriptiongt lt
/spectra-thasChemicalNamegt ltspectra-thasChemica
lNamegt - ltrdfDescriptiongt ltspectra-tchemicalNa
megt(3E,5S,6S)-8-(p-Methoxy-benzyloxy)-5,6-epoxy-6-
methyl-oct-3-en-2-onelt/spectra-tchemicalNamegt
ltspectra-thasHNMRSpectrumgthttp//fiwlt.ch.cam.ac.
uk8182/8f2d98b04/data-20.cmllt/spectra-thasHNMRSp
ectrumgt ltspectra-thasIRSpectrumgthttp//fiwlt.c
h.cam.ac.uk8182/8f2d98b04/data-20.cmllt/spectra-t
hasIRSpectrumgt ltspectra-thasMassSpectrumgthttp
//fiwlt.ch.cam.ac.uk8182/8f2d98b04/data-20.cmllt/s
pectra-thasMassSpectrumgt ltspectra-thasHRMSSpe
ctrumgthttp//fiwlt.ch.cam.ac.uk8182/8f2d98b04/dat
a-20.cmllt/spectra-thasHRMSSpectrumgt
ltspectra-thasPreparationgthttp//fiwlt.ch.cam.ac.u
k8182/8f2d98b04/preparation-20.sci.xmllt/spectra-t
hasPreparationgt lt/rdfDescriptiongt lt/spectra-t
hasChemicalNamegt lt/rdfDescriptiongt ltrdfRDFgt
RESULT ltrdfDescription rdfabout"file/C/spectr
a-t-articles/B207708F.docx"gt ltsthasBicycloMolecu
leAndHNMRgt5-Acetyl-7,8-bis(trimethylsilyl)bicyclo
4.2.1nona-4,7-dienelt/sthasBicycloMoleculeAndHNMR
gt ltdcrdfauthorgtN.R.Champnesslt/dcrdfauthorgt ltst
hasBicycloMoleculeAndHNMRgt5-Acetyl-bicyclo4.2.1
nona-4,7-dienelt/sthasBicycloMoleculeAndHNMRgt ltdc
rdfauthorgtN.R.Champnesslt/dcrdfauthorgt ltsthasBi
cycloMoleculeAndHNMRgt5-Phenyl-bicyclo4.2.1nona-3
,7-dienelt/sthasBicycloMoleculeAndHNMRgt ltdcrdfau
thorgtN.R.Champnesslt/dcrdfauthorgt ltsthasBicycloM
oleculeAndHNMRgt5-Acetyl-7,8-bis(trimethylsilyl)bic
yclo4.2.1nona-4,7-dienelt/sthasBicycloMoleculeAn
dHNMRgt ltdcrdfauthorgtN.R.Champnesslt/dcrdfauthorgt
ltsthasBicycloMoleculeAndHNMRgt5-Acetyl-bicyclo4
.2.1nona-4,7-dienelt/sthasBicycloMoleculeAndHNMRgt
ltdcrdfauthorgtN.R.Champnesslt/dcrdfauthorgt ltst
hasBicycloMoleculeAndHNMRgt5-Phenyl-bicyclo4.2.1n
ona-3,7-dienelt/sthasBicycloMoleculeAndHNMRgt ltdcr
dfauthorgtN.R.Champnesslt/dcrdfauthorgt lt/rdfDescr
iptiongt
18Message to repository managers PDF is a limited
format for data extraction from e-theses Docx
allows chemical data object extraction (80
precision / recall)
Solutions Domain ontology development Make
your e-theses public!
Caveats (Proof-of-concept) Single subject area
(synthetic organic chemistry) Single institution
docx (limited variation in document
structure) Limited thesis availability
19 Acknowledgements
- Project Director Peter Morgan UL Cambridge
- Chemistry leads Henry Rzepa, Peter Murray-Rust
- Developers Jim Downing, Diana Stewart,
- Joe Townsend, Matt Harvey
- Project Manager Alan Tonge
http//www.lib.cam.ac.uk/spectra-t/
20SPECTRa Tools Workshop
Autumn 2008 Unilever Centre, Cambridge, UK
Contact Peter Murray-Rust (pm286_at_cam.ac.uk) P
eter Morgan (pbm2_at_cam.ac.uk)