Title: Strings and Things
1Strings and Things
- A brief history of chemical languages
John Bradshaw, Daylight CIS Inc,
Cambridge johnb_at_daylight.com
2Antoine Lavoisier
- ...we cannot improve the language of any
science, without, at the same time improving the
science itself... - 1787
ANTOINE LAURENT LAVOISIER Courtesy Edgar Fahs
Smith Memorial Collection, Dept. of Special
Collections, University of Pennsylvania Library
3Representing compounds as pictures
- Early chemical symbols and those created by Jean
Henri Hassenfratz and Pierre Auguste Adet to
complement the Methode de Nomenclature Chimique
(1787).
Courtesy Edgar Fahs Smith Memorial Collection,
Dept. of Special Collections, University of
Pennyslvania Library
4Representing compounds as pictures
5Representing compounds as character strings
- Jöns Jacob Berzelius
- ... chemical signs ought to be letters, for the
greater facility of writing, . the printed
book... J.J. Berzelius Annals of Philosophy
(1813) 2, 443-454 - Compounds should be named by what they are, not
from their origins. E.g. äpfelsäure - These names were not intended for labelling
bottles in the laboratory
6Representing compounds as character strings
- The chemical sign should be made up from the
initial letter of the Latin name of each
elementary substance subject to rules - C carbonicum
- Co cobaltum
- O oxygen
- Os osmium
- and digits to indicate the number of volumes in
a substance. - There was no fixed (canonical) order for the
elements.
7Hill Order
- Hill notation is a standard way of writing the
formula for any chemical compound. I.e. It is
canonicalized ( canons rules ) - Example H3C(CH2)8CH2NH2
- Count all carbon atoms and record. C10
- Count all hydrogen atoms and record C10H23
- Determine all other elements in compound. Count
each and record in alphabetical order. C10H23N
8Representing chemical structures on paper
- Archibald Scott Couper
- Lines for bonds
- Couper, A.S. (1858) Philosophical Magazine, 16,
104-116
9Representing chemical structures on paper
- The corrected constitutional formula for hydrated
glucose ( C6H1407) - Couper, A.S. (1858) Philosophical Magazine, 16,
104-116 - Note that the oxygen is still drawn doubled up
10Representing chemical structures.
- Crum Brown, A.,1867, Transactions of the Royal
Society of Edinburgh, 24, 331-339 - Even had reduced graph representations which
equivalenced the ring carbon atoms.
11Representing chemical structures.
- Crum Brown, A., 1867, Transactions of the Royal
Society of Edinburgh, 24, 331-339 - Essentially a graph or topological representation
concentrating on the connectivity
12Linking activity to structure
- Crum Brown, A., Frazer, T, 1868, Transactions
of the Royal Society of Edinburgh, 24, 151, 693 - Changes in biological activity were linked to
small changes in structure. - A very early example of linking information in
the biology domain to information in the chemical
domain - Perhaps should be regarded as the birth of
Chemoinformatics
? f(C) where ? is a compounds
physiological action and C is a compounds
chemical constitution
13Representing stereochemistry
- Three-dimensional paper models used by Jacobus
Henricus van't Hoff in communicating his
stereo-chemical ideas.
Courtesy O. Bertrand Ramsay.
14Representing molecules on paper.
- Arthur Caley
- "kenograms" which were alkane tree graphs
these were used in the enumeration of the alkane
structural isomers. - Cayley, A. (1874) Philosophical Magazine 47, 444
Dennis Rouvray's "The Origins of Chemical
GraphTheory, Mathematical Chemistry 1(1991), 1-39
15Frederick Beilstein
- In the Handbuch the naming of compounds was an
integral part of their storage and retrieval from
indexes. - This drive for efficient indexing dominated
chemical information for the next half century.
Frederick Beilstein. Courtesy Edgar Fahs Smith
Memorial Collection, Dept. of Special
Collections,University of Pennsylvania Library.
16Mechanisation of structure input
- With the increased use of the typewriter to
input data for indexing. There was a need to
express structures using only the characters
available on a standard keyboard.
17A primitive Chemical Information System
- The edge-notched card and puncher could be
regarded as the start of modern chemoinformatics.
- Records of structure and properties were selected
0n the basis of a binary pattern on the card edge.
Courtesy Claire Schultz
18Wisswesser line notation
- A-Z
- Only upper case to distinguish 1 (one) from l
(ell) - worked well with early computers
- 0-9
- zero usually had a backslash overprinted(!) to
distinguish it from O - ltspacegt/-
19Wisswesser line notation.
20Wisswesser line notation.
21Wisswesser line notation.
22Wisswesser line notation.
23Line notations are part of the mainstream of
chemical indexing
- Members of IUPAC Commission, appointed to study
chemical notation systems, meeting informally at
MIT (1951).
Front row (left to right) E.W. Scott, Harriet
Geer, Alice Fitzmaurcie, Madeline Berry, John
Fletcher. Second row E.H. Huntress, Karl
Heumann, William J. Wiswesser, F. Richter,
Charles Bernier, Howard Nutting. Back row P.E.
Verkade, G. Malcom Dyson, James W. Perry, Austin
M. Patterson, I.B. Johns, Paul Aruther, Eric H.
Pietsch, Franz Leiss.
Courtesy Madeline M. (Berry) Henderson.
24Line notations are part of the mainstream of
chemical indexing
- Members of IUPAC Commission, appointed to study
chemical notation systems, meeting informally at
MIT (1951).
Front row (left to right) E.W. Scott, Harriet
Geer, Alice Fitzmaurcie, Madeline Berry, John
Fletcher. Second row E.H. Huntress, Karl
Heumann, William J. Wiswesser, F. Richter,
Charles Bernier, Howard Nutting. Back row P.E.
Verkade, G. Malcom Dyson, James W. Perry, Austin
M. Patterson, I.B. Johns, Paul Aruther, Eric H.
Pietsch, Franz Leiss.
Courtesy Madeline M. (Berry) Henderson.
25Machine readable structure representation.
- Line Notations
- WLN
- L66J BMR DSWQ IN11
- SMILES
- c1ccccc1Nc2cc(S(O)(O)O)cc3c2cc(N(C)C)cc3
- ROSDAL
- 1-5-105,10-1,1-11N-12-1712,3-18S-19O,1820O,1
821O,8-22N-23,22-24
26Wisswesser Line Notation
- Note that WLN unlike the other systems does not
have bonds. - It is a fragment/group based system.
- This reflected Wisswessers view that MO
representations would replace the VB model.
27Molecules in a computer.
- Represented as a graph
- Nodes are atoms
- Edges are bonds
- Searching
- Subgraph isomorphism
- Unique ordering
- Canonicalization
28CAOCI and CROSSBOW
- By making use of WLN, compounds from disparate
sources were brought together into one big
database. - The structures were normalized and put in a
hierarchy by hand. - Available on microfiche/hard copy
- Electronically, substructure searchable via
CROSSBOW (ICI) - Connection tables needed to be generated to
allow it to used in atom-by-atom searching.
29Line notations are dead
- Using a light pen and computer to draw a chemical
structure (1976).
Courtesy Chemical Abstracts
30Machine readable structure representation.
-ISIS- 02110115552D 24 26 0 0 0 0 0 0
0 0999 V2000 -1.7931 0.6000 0.0000 C
0 0 0 0 0 0 0 0 0 0 0 0 -1.7984
-0.2273 0.0000 C 0 0 0 0 0 0 0 0 0
0 0 0 -1.0851 -0.6459 0.0000 C 0 0
0 0 0 0 0 0 0 0 0 0 -1.0786 1.0072
0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
-0.3647 0.5965 0.0000 C 0 0 0 0 0
0 0 0 0 0 0 0 -0.3692 -0.2315
0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
0.3450 -0.6497 0.0000 C 0 0 0 0 0 0
0 0 0 0 0 0 1.0642 -0.2368 0.0000 C
0 0 0 0 0 0 0 0 0 0 0 0 1.0648
0.5943 0.0000 C 0 0 0 0 0 0 0 0 0
0 0 0 0.3499 1.0046 0.0000 C 0 0
0 0 0 0 0 0 0 0 0 0 0.3375 -1.4750
0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0
1.0542 -1.8875 0.0000 C 0 0 0 0 0
0 0 0 0 0 0 0 1.0504 -2.7160
0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
1.7662 -3.1326 0.0000 C 0 0 0 0 0 0
0 0 0 0 0 0 2.4807 -2.7176 0.0000 C
0 0 0 0 0 0 0 0 0 0 0 0 2.4790
-1.8859 0.0000 C 0 0 0 0 0 0 0 0 0
0 0 0 1.7668 -1.4772 0.0000 C 0 0
0 0 0 0 0 0 0 0 0 0 1.7792 1.0125
0.0000 S 0 0 3 0 0 0 0 0 0 0 0 0
-2.5208 -0.6375 0.0000 N 0 0 3 0 0
0 0 0 0 0 0 0 -2.5250 -1.4667
0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
-3.2417 -0.2208 0.0000 C 0 0 0 0 0 0
0 0 0 0 0 0 2.4824 1.4383 0.0000 O
0 0 0 0 0 0 0 0 0 0 0 0 1.3586
1.7304 0.0000 O 0 0 0 0 0 0 0 0 0
0 0 0 2.3625 0.4292 0.0000 O 0 0
0 0 0 0 0 0 0 0 0 0
31Machine readable structure representation.
2 3 1 0 0 0 0 11 12 1 0 0 0 0 5
6 1 0 0 0 0 12 13 2 0 0 0 0 3 6 2
0 0 0 0 13 14 1 0 0 0 0 6 7 1 0 0
0 0 14 15 2 0 0 0 0 1 2 2 0 0 0 0
15 16 1 0 0 0 0 7 8 2 0 0 0 0 16 17
2 0 0 0 0 17 12 1 0 0 0 0 5 4 2 0
0 0 0 9 18 1 0 0 0 0 8 9 1 0 0 0
0 2 19 1 0 0 0 0 4 1 1 0 0 0 0
19 20 1 0 0 0 0 9 10 2 0 0 0 0 19 21
1 0 0 0 0 10 5 1 0 0 0 0 18 22 2 0
0 0 0 18 23 2 0 0 0 0 7 11 1 0 0 0
0 18 24 1 0 0 0 0 M END
32The rebirth of line notations
- With the increasing need to manipulate and store
large numbers of chemical structures, line
notations offer great advantages. In general they
occupy 1 of the disc space, required by a
connection table. - The newer line notations, with a linguistic
structure such as SMILES and SMARTS can make use
of the character handling capabilities of modern
computers. - Machine readable/writable
- Parse from left to right in a single pass
- Human understandable
- Have the capacity to code chemical concepts
- Can describe reactions and can be used to carry
out in silico chemistry - Can be stored in regular RDBSs or documents.
33Languages
- Languages with defined grammar and syntax are the
bedrock of the Daylight system. - The lexical representations of the underlying
toolkit objects are expressed in these languages. - NB All Daylight objects can be represented as
strings which allows them to be transmitted
between processes without loss of information.
34SMILES
- SMILES contains the same information as might be
found in an extended connection table. - The primary reason SMILES is more useful than a
connection table is that it is a linguistic
construct, rather than a computer data structure. - SMILES is a true language, albeit with a simple
vocabulary (atom and bond symbols) and only a few
grammar rules. - SMILES can be canonicalised. I.e. there is a
unique, universal name for a structure - SMILES representations of structure can in turn
be used as words in the vocabulary of other
languages designed for storage and retrieval of
chemical information .E.g HTML, XML or query
languages such as SQL.
35SMILES syntax
atombondatom etc atom ltmassgt
symbol ltchiralgt lthcountgt ltsignltchargegtgt
ltclassgt bond ltemptygt -
.
Common elements, in the organic subset
B,C,N,O,P,S,F,Cl,Br,I, in their lowest common
valence state(s), can be written without
brackets. If bonds are omitted, they default to
single or aromatic, as appropriate, for
juxtaposed atoms.
36Example SMILES
37Analyzing Molecules
- Explicit properties
- Those required to completely specify the graph of
a molecule - atoms and their properties
- bonds and their properties
- Derived properties
- Those properties derived from the graph
- These may alter explicit properties e.g. bonds
38Derived properties
39Derived properties
40Beyond the structure diagram
- A chemical language should be able to represent
more than a structure diagram as it is not
constrained by the printed page. - SMARTS is such a language as it can express
chemical concepts.
41SMARTS
- In the SMILES language, there are two fundamental
types of symbols atoms and bonds. Using these
SMILES symbols, one can specify a molecule's
graph (its "nodes" and "edges") and assign
"labels" to the components of the graph (that is,
say what type of atom each node represents, and
what type of bond each edge represents). - The same is true in SMARTS One uses atomic and
bond symbols to specify a graph. However, in
SMARTS the labels for the graph's nodes and edges
(its "atoms" and "bonds") are extended to include
"logical operators" and special atomic and bond
symbols these allow SMARTS atoms and bonds to be
more general. For example, the SMARTS atomic
symbol C,N is an atom that can be aliphatic C
or aliphatic N the SMARTS bond symbol ""
(tilde) matches any bond
42Example SMARTS
43Useful SMARTS
Rotatable bonds !()!D1-!_at_!()!D1
Secondary amides NH1D2-!_at_6X3 H-donors
!6!H0 H-acceptors (!60)!(F,Cl,Br
,I)!(o,s,nX3)!(Nv5,Pv5,Sv4,Sv6) Isolati
ng carbons 6!(C(F)(F)F)!(c(!c)!c)!(
6,!6)!(6!0) Stereo atoms
(X4!v6!v5H0,H1),(SX3(6)(6)O) St
ereo bonds CX3!H2CX3!H2 Stereo
allenes CX3H0CCX3H0,H1
44Rotatable bonds!()!D1-!_at_!()!D1
- An atom which is
- NOT triply bonded to another atom
- AND NOT 1-connected ( I.e. Not terminal )
- Bonded by
- A single bond
- AND NOT a ring bond
- to the same type of atom
45Berzelius revisited
- ... chemical signs ought to be letters, for the
greater facility of writing, . the printed
book... - J.J. Berzelius Annals of Philosophy (1813) 2,
443-454 - ...chemical structures and concepts should be
represented as character strings for the greater
facility of electronic storage, transmission,
searching and retrieval of information - Mug01, Santa Fe
46(No Transcript)
47Supplementary slides
48Machine reading of information
- Hans Peter Luhn demonstrating a mock-up of an IBM
card used in his scanner (1952).
Courtesy IBM.
49Handling large amounts of data
- IBM 101 Statistical Machine (ca. 1952).
Courtesy IBM.
50Handling large amounts of data
- REMINGTON RAND COUNTING SORTER 221 (ca. 1952)
Courtesy Unysis
51Rotatable bonds!()!D1-!_at_!()!D1
- An atom which is
- NOT triply bonded to another atom
- AND NOT 1-connected ( I.e. Not terminal )
- Bonded by
- A single bond
- AND NOT a ring bond
- to the same type of atom
52Counting things
- Count matches to patterns defined in SMARTS
- Molecular formula
- H-donors
- H-acceptors
- Rotatable bonds
- Chiral centres
- Rings
- Fragments
53Example
- Molecular formula C13H22N4O3S
- H-donors 2
- H-acceptors 6
- Rotatable bonds 8
- Chiral centres 1
- Rings 1
- Fragments 6
54Estimating Measured Properties
- Any property which is an additive constitutive
property of a molecule can be calculated by - counting the matches of the constituent patterns
- lookup the weight for the pattern
- summing the products of the count and individual
pattern weights. - apply any correction factors
55Examples of properties to calculate
- Molecular Weight
- logP
- Parachor
- Molar Volume
- Molar Refractivity
- .
56Molecular weight a simple example
- Molecular weight
- Molecular formula
- ?(count(atom(i))atomic_weight(atom(i)))
- Accuracy depends on accuracy of atomic weights (
IUPAC) - C13H22N4O3S
- 314.45 (average molecular weight )
- 314.141235 ( accurate mass of commonest isotope)
57CLOGP A more complicated example
- Algorithmic definition of fragment
- Pattern NOT an isolating carbon
- Match the pattern to find all the fragments
- Look up the fragment value(s) ( if it exists )
using the unique string(s) from the match. - Accumulate the values for fragments and
non-fragments (isolating carbons). - Correct for proximity
58CLOGP example
- 2 Cl 1.880
- guanidyl 1.930
- 2 C 0.390
- 6 c 0.780
- 7 H 1.589
- Proximity 0.984
- Total 1.727
59Estimating values for concepts
- Flexibility
- Ratio of number of rotatable bonds to total
number of bonds - Rigidity
- Molecular similarity between original molecule
and molecules formed by breaking all rotatable
bonds - Difficulty of synthesis
- Ratio of number of potential chiral centres
weighted for rings to total number of heavy atoms
in a molecule
60Example
- Flexibility 0.38
- Rigidity 0.3819
- Difficulty of synthesis 0.05
61Example
- Flexibility 0.38(0.00)
- Rigidity 0.3819(1.00)
- Difficulty of synthesis 0.05 (0.85)
- Figures in parentheses for morphine