Title: Project Prospect and the Semantic Web
1Project Prospectand the Semantic Web
- Colin Batchelor
- Royal Society of Chemistry, Cambridge, UK
- batchelorc_at_rsc.org
2Project Prospectand the Semantic Web
- Who we are
- What weve done
- Motivation
- Means
- The InChI and the Semantic Web
- Ontology development for chemistry
- RXNO and MOP
3Who we are
4(No Transcript)
5Royal Society of ChemistryAdvancing the Chemical
Sciences
- Learned and professional society
- Scientific publisher
- 25 journals, 8 databases and a growing book
program - 8000 articles yearly
- Covering a broad spectrum of chemical sciences
from systems biology (Molecular BioSystems) to
physical and theoretical chemistry (PCCP)
6What weve done
7(No Transcript)
8(No Transcript)
9(No Transcript)
10The motivation
11The motivation
- Scientific papers are formulaic and consistently
structured (but not necessarily IMRD see later) - There may be infinitely many possible chemical
compounds - BUT
- Nomenclature is productive and susceptible to
machine parsing
12The means
13The meanshow publishing really works
14Data capture
Editing and proof-reading
15Enhanced HTML
Database
Text mining (Oscar)
Manual QA
Enhanced RSS
16(No Transcript)
17(No Transcript)
18Regular polysemy
- where words stand for multiple things in a
consistent way. - Examples
- Brand names
- Grinding
- Figureground
- Exactclasspart polysemy in chemistry
- Peter Corbett, Colin Batchelor and Ann Copestake
(2008), Pyridines, pyridine and pyridine rings,
Proc. BERBMTM08 at LREC 2008, Marrakech, Morocco.
19Regular polysemy
- Brand names
- Learning to buy a Renault and talk to BMW
- Grinding
- The squirrel scampered down the path and kept
stopping and looking at the officers to check
they were behind - vs.
- the trick was to serve squirrel fresh and
not to leave it hanging like other game
20Regular polysemy
- Figureground
- Audrey Hepburn painted the door (figure)
- Audrey Hepburn walked through the door (ground)
- The Incredible Hulk walked through the door
(ambiguous)
21Imidazole
22An imidazole
23The imidazole side-chain/group/ring/etc.
24Can ChEBI handle this?
- Imidazoles (!) (CHEBI24780)
- Imidazole (CHEBI16069)
- Imidazole ring not yet
- Imidazolyl group not yet (but methyl, benzyl,
etc.) - and there are no disambiguation cues
25Disambiguation
- One Sense per Discourse (Gale et al. 1992)
- this doesnt hold at all
- One Sense per Collocation (Yarowsky 1993)
- matches our intuitions
26Disambiguation toy model
- CLASS
- w(1) a, an, the, this
- w(0) plural (bit of a cheat, as not a
collocation) - PART
- w(1) bridging, terminal
- w(1) backbone, bridge, chain, core, dyad,
fluorophore, fragment, framework (and many more) - w(1)w(2) building block, protecting
group, side chain
27Why is this hard?
Coordination resolution
Part of speech ambiguity tosylates noun or verb?
28Why is this hard?
- How many numbered compounds actually are named in
a given paper? - iloprost (1)
- tributyl-1-hexynylstannane (2)
- the desired 2-heptyne (3)
- methylPd(II) iodide 4 or 4'
- alkynylstannane 5
- the hypervalent stannate 6
- (alkynyl)(methyl)Pd(II) complex 7
- the desired methylalkyne 8
- compounds 914
- the stannyl precursors 15 and 16
- methylated compounds 17 and 18
- stannyl precursor 19
- iloprost methyl ester 20
- iloprost methyl ester is the real name, but you
need to know that iloprost is a monocarboxylic
acid!
29Why is this hard?
- For compound names
- 60 Oscar (Corbett and Murray-Rust 2006,
Batchelor and Corbett 2007) - 20 PubChem
- 20 ChemDraw
- For compound numbers
- 70 author ChemDraw
- 30 editors
30What are we marking up?
- Chemical compounds (InChI, ChEBI)
- Chemical classes and parts (ChEBI)
- Nanoparticles (in ChEBI from end of October)
- Chemical terms from the IUPAC Gold Book
- Name reactions (RXNO)
- Gene products function, process, location (GO)
- Nucleotide and polypeptide sequence terms (SO)
- Cell types (CL)
31InChI and the Semantic Web
32What InChI is for
- Can represent complete molecules (may be ions or
radicals) of less than 1024 heavy (non-H) atoms. - (however)
- Cannot yet represent metal atom geometry.
- Cannot yet represent polymers.
- Cannot yet represent diradicals etc.
33What InChI is not for
- Classes of molecule
- Parts of molecule
- (these have been done in ChemBlast)
34InChI in RDF
- (We dont like this.)
- We use the RSS content module. (As if articles
contained molecules.) - And we use infoinchi URIs.
- Look
35Some RDF
- ltcontentitemsgt
- ltrdfBaggt
- ltrdfligt
- ltcontentitem rdfabout"infoinchi/InChI1/C15
H22O9/c1-8(16)19-6-15(7-20-9(2)17)12(21-10(3)18)11
-13(24-15)23-14(4,5)22-11/h11-13H,6-7H2,1-5H3/t11?
,12-,13/m1/s1"/gt - lt/rdfligt
- ltrdfligt
- ltcontentitem rdfabout"infoinchi/InChI1/C21
H34O9/c1-6-9-14(22)25-12-21(13-26-15(23)10-7-2)18(
27-16(24)11-8-3)17-19(30-21)29-20(4,5)28-17/h17-19
H,6-13H2,1-5H3/t17?,18-,19/m1/s1"/gt - lt/rdfligt
- lt/rdfBaggt
- lt/contentitemsgt
- ltcontentitemsgt
- ltcontentitemgt ltowlClass rdfID"GO_0016298"gt
ltrdfslabelgtlipase activitylt/rdfslabelgt - lt/owlClassgtlt/contentitemgt
- lt/contentitemsgt
36(No Transcript)
37RXNO
- David Barden
- Colin Batchelor
- Celia Gitterman
38RXNOthe name reaction ontology (1)
- Every chemist knows about famous chemists like
Wittig, Cannizzaro, Diels, Alder, benzoin - Theyre pretty unambiguous and well-suited to
logical definitions - But what organizing principle do we use?
39RXNOthe name reaction ontology (2)
- Sort reactions by what they do to the skeleton
of the molecule. - Skeleton-changing reactions
- Joinings, cleavings, rearrangements, ring
formation, ring expansion - Skeleton-preserving reactions
- Additions, eliminations, substitutions,
protections, deprotections
40RXNOthe name reaction ontology (3)
- Quality? Subjectivity?
- Get our curators to assign reactions to
categories without conferring, check percentage
agreement, discuss disagreements, improve
guidelines, iterate to convergence.
41(No Transcript)
42(No Transcript)
43RXNOthe name reaction ontology (4)
44(No Transcript)
45What do people say?
46(No Transcript)
47The spectroscopists tale
- The enriched html version came as something of a
revelation and the current emphasis on links to,
and through biomolecular terminology was very
much a plus for us, since my colleagues and I are
a mix of physical and biological chemists who are
dabbling in inter-disciplinary waters. Given the
steadily increasing burden of keeping up with the
current literature and accessing earlier
publications - a fortiori when conventional
disciplinary boundaries are being crossed - the
ability to 'grow a tree' from current articles
(including one's own) is going to make 'targeted
sleuthing' a great deal easier. - John Simons, Oxford
48The high-throughput screeners tale
- An interesting opportunity particularly for
managers, students and beginners that are not
that deeply immersed in the detail and the
terminology. It further opens access to those who
want to explore areas they are not specialists
in. Great idea! - Eberhard Krausz, MPI-CBG Dresden
49Lastly
- My only criticism would be the need for a time
warning I spent 4 hours digging about which
generated at least six new research ideas printed
half a ream of paper and I missed my bus home. At
least it was a new excuse my wife had not heard,
so another first. - An analytical chemist, The North.
50(No Transcript)
51Acknowledgements
- Royal Society of Chemistry
- Richard Kidd, Jeff White, David Barden, Celia
Gitterman, Hilary Burch, the Informatics team - University of Cambridge
- Peter Corbett, Simone Teufel, Ann Copestake,
Peter Murray-Rust - OBO
- Karen Eilbeck, Midori Harris, Jen Deegan, Jane
Lomax, Chris Mungall, Barry Smith, the ChEBI team
52(No Transcript)