Strings and Things - PowerPoint PPT Presentation

About This Presentation
Title:

Strings and Things

Description:

Electronically, substructure searchable via CROSSBOW (ICI) ... notations, with a linguistic structure such as SMILES and SMARTS can make use of ... – PowerPoint PPT presentation

Number of Views:163
Avg rating:3.0/5.0
Slides: 62
Provided by: researc89
Category:
Tags: crossbow | how | make | strings | things | to

less

Transcript and Presenter's Notes

Title: Strings and Things


1
Strings and Things
  • A brief history of chemical languages

John Bradshaw, Daylight CIS Inc,
Cambridge johnb_at_daylight.com
2
Antoine Lavoisier
  • ...we cannot improve the language of any
    science, without, at the same time improving the
    science itself...
  • 1787

ANTOINE LAURENT LAVOISIER Courtesy Edgar Fahs
Smith Memorial Collection, Dept. of Special
Collections, University of Pennsylvania Library
3
Representing compounds as pictures
  • Early chemical symbols and those created by Jean
    Henri Hassenfratz and Pierre Auguste Adet to
    complement the Methode de Nomenclature Chimique
    (1787).

Courtesy Edgar Fahs Smith Memorial Collection,
Dept. of Special Collections, University of
Pennyslvania Library
4
Representing compounds as pictures
  • Daltons symbols 1808

5
Representing compounds as character strings
  • Jöns Jacob Berzelius
  • ... chemical signs ought to be letters, for the
    greater facility of writing, . the printed
    book... J.J. Berzelius Annals of Philosophy
    (1813) 2, 443-454
  • Compounds should be named by what they are, not
    from their origins. E.g. äpfelsäure
  • These names were not intended for labelling
    bottles in the laboratory

6
Representing compounds as character strings
  • The chemical sign should be made up from the
    initial letter of the Latin name of each
    elementary substance subject to rules
  • C carbonicum
  • Co cobaltum
  • O oxygen
  • Os osmium
  • and digits to indicate the number of volumes in
    a substance.
  • There was no fixed (canonical) order for the
    elements.

7
Hill Order
  • Hill notation is a standard way of writing the
    formula for any chemical compound. I.e. It is
    canonicalized ( canons rules )
  • Example H3C(CH2)8CH2NH2
  • Count all carbon atoms and record. C10
  • Count all hydrogen atoms and record C10H23
  • Determine all other elements in compound. Count
    each and record in alphabetical order. C10H23N

8
Representing chemical structures on paper
  • Archibald Scott Couper
  • Lines for bonds
  • Couper, A.S. (1858) Philosophical Magazine, 16,
    104-116

9
Representing chemical structures on paper
  • The corrected constitutional formula for hydrated
    glucose ( C6H1407)
  • Couper, A.S. (1858) Philosophical Magazine, 16,
    104-116
  • Note that the oxygen is still drawn doubled up

10
Representing chemical structures.
  • Crum Brown, A.,1867, Transactions of the Royal
    Society of Edinburgh, 24, 331-339
  • Even had reduced graph representations which
    equivalenced the ring carbon atoms.

11
Representing chemical structures.
  • Crum Brown, A., 1867, Transactions of the Royal
    Society of Edinburgh, 24, 331-339
  • Essentially a graph or topological representation
    concentrating on the connectivity

12
Linking activity to structure
  • Crum Brown, A., Frazer, T, 1868, Transactions
    of the Royal Society of Edinburgh, 24, 151, 693
  • Changes in biological activity were linked to
    small changes in structure.
  • A very early example of linking information in
    the biology domain to information in the chemical
    domain
  • Perhaps should be regarded as the birth of
    Chemoinformatics

? f(C) where ? is a compounds
physiological action and C is a compounds
chemical constitution
13
Representing stereochemistry
  • Three-dimensional paper models used by Jacobus
    Henricus van't Hoff in communicating his
    stereo-chemical ideas.

Courtesy O. Bertrand Ramsay.
14
Representing molecules on paper.
  • Arthur Caley
  • "kenograms" which were alkane tree graphs
    these were used in the enumeration of the alkane
    structural isomers.
  • Cayley, A. (1874) Philosophical Magazine 47, 444

Dennis Rouvray's "The Origins of Chemical
GraphTheory, Mathematical Chemistry 1(1991), 1-39
15
Frederick Beilstein
  • In the Handbuch the naming of compounds was an
    integral part of their storage and retrieval from
    indexes.
  • This drive for efficient indexing dominated
    chemical information for the next half century.

Frederick Beilstein. Courtesy Edgar Fahs Smith
Memorial Collection, Dept. of Special
Collections,University of Pennsylvania Library.
16
Mechanisation of structure input
  • With the increased use of the typewriter to
    input data for indexing. There was a need to
    express structures using only the characters
    available on a standard keyboard.

17
A primitive Chemical Information System
  • The edge-notched card and puncher could be
    regarded as the start of modern chemoinformatics.
  • Records of structure and properties were selected
    0n the basis of a binary pattern on the card edge.

Courtesy Claire Schultz
18
Wisswesser line notation
  • A-Z
  • Only upper case to distinguish 1 (one) from l
    (ell)
  • worked well with early computers
  • 0-9
  • zero usually had a backslash overprinted(!) to
    distinguish it from O
  • ltspacegt/-

19
Wisswesser line notation.
  • L66J BMR DSWQ IN11

20
Wisswesser line notation.
  • L66J BMR DSWQ IN11

21
Wisswesser line notation.
  • L66J BMR DSWQ IN11

22
Wisswesser line notation.
  • L66J BMR DSWQ IN11

23
Line notations are part of the mainstream of
chemical indexing
  • Members of IUPAC Commission, appointed to study
    chemical notation systems, meeting informally at
    MIT (1951).

Front row (left to right) E.W. Scott, Harriet
Geer, Alice Fitzmaurcie, Madeline Berry, John
Fletcher. Second row E.H. Huntress, Karl
Heumann, William J. Wiswesser, F. Richter,
Charles Bernier, Howard Nutting. Back row P.E.
Verkade, G. Malcom Dyson, James W. Perry, Austin
M. Patterson, I.B. Johns, Paul Aruther, Eric H.
Pietsch, Franz Leiss.
Courtesy Madeline M. (Berry) Henderson.
24
Line notations are part of the mainstream of
chemical indexing
  • Members of IUPAC Commission, appointed to study
    chemical notation systems, meeting informally at
    MIT (1951).

Front row (left to right) E.W. Scott, Harriet
Geer, Alice Fitzmaurcie, Madeline Berry, John
Fletcher. Second row E.H. Huntress, Karl
Heumann, William J. Wiswesser, F. Richter,
Charles Bernier, Howard Nutting. Back row P.E.
Verkade, G. Malcom Dyson, James W. Perry, Austin
M. Patterson, I.B. Johns, Paul Aruther, Eric H.
Pietsch, Franz Leiss.
Courtesy Madeline M. (Berry) Henderson.
25
Machine readable structure representation.
  • Line Notations
  • WLN
  • L66J BMR DSWQ IN11
  • SMILES
  • c1ccccc1Nc2cc(S(O)(O)O)cc3c2cc(N(C)C)cc3
  • ROSDAL
  • 1-5-105,10-1,1-11N-12-1712,3-18S-19O,1820O,1
    821O,8-22N-23,22-24

26
Wisswesser Line Notation
  • Note that WLN unlike the other systems does not
    have bonds.
  • It is a fragment/group based system.
  • This reflected Wisswessers view that MO
    representations would replace the VB model.

27
Molecules in a computer.
  • Represented as a graph
  • Nodes are atoms
  • Edges are bonds
  • Searching
  • Subgraph isomorphism
  • Unique ordering
  • Canonicalization

28
CAOCI and CROSSBOW
  • By making use of WLN, compounds from disparate
    sources were brought together into one big
    database.
  • The structures were normalized and put in a
    hierarchy by hand.
  • Available on microfiche/hard copy
  • Electronically, substructure searchable via
    CROSSBOW (ICI)
  • Connection tables needed to be generated to
    allow it to used in atom-by-atom searching.

29
Line notations are dead
  • Using a light pen and computer to draw a chemical
    structure (1976).

Courtesy Chemical Abstracts
30
Machine readable structure representation.
-ISIS- 02110115552D 24 26 0 0 0 0 0 0
0 0999 V2000 -1.7931 0.6000 0.0000 C
0 0 0 0 0 0 0 0 0 0 0 0 -1.7984
-0.2273 0.0000 C 0 0 0 0 0 0 0 0 0
0 0 0 -1.0851 -0.6459 0.0000 C 0 0
0 0 0 0 0 0 0 0 0 0 -1.0786 1.0072
0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
-0.3647 0.5965 0.0000 C 0 0 0 0 0
0 0 0 0 0 0 0 -0.3692 -0.2315
0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
0.3450 -0.6497 0.0000 C 0 0 0 0 0 0
0 0 0 0 0 0 1.0642 -0.2368 0.0000 C
0 0 0 0 0 0 0 0 0 0 0 0 1.0648
0.5943 0.0000 C 0 0 0 0 0 0 0 0 0
0 0 0 0.3499 1.0046 0.0000 C 0 0
0 0 0 0 0 0 0 0 0 0 0.3375 -1.4750
0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0
1.0542 -1.8875 0.0000 C 0 0 0 0 0
0 0 0 0 0 0 0 1.0504 -2.7160
0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
1.7662 -3.1326 0.0000 C 0 0 0 0 0 0
0 0 0 0 0 0 2.4807 -2.7176 0.0000 C
0 0 0 0 0 0 0 0 0 0 0 0 2.4790
-1.8859 0.0000 C 0 0 0 0 0 0 0 0 0
0 0 0 1.7668 -1.4772 0.0000 C 0 0
0 0 0 0 0 0 0 0 0 0 1.7792 1.0125
0.0000 S 0 0 3 0 0 0 0 0 0 0 0 0
-2.5208 -0.6375 0.0000 N 0 0 3 0 0
0 0 0 0 0 0 0 -2.5250 -1.4667
0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
-3.2417 -0.2208 0.0000 C 0 0 0 0 0 0
0 0 0 0 0 0 2.4824 1.4383 0.0000 O
0 0 0 0 0 0 0 0 0 0 0 0 1.3586
1.7304 0.0000 O 0 0 0 0 0 0 0 0 0
0 0 0 2.3625 0.4292 0.0000 O 0 0
0 0 0 0 0 0 0 0 0 0
31
Machine readable structure representation.
2 3 1 0 0 0 0 11 12 1 0 0 0 0 5
6 1 0 0 0 0 12 13 2 0 0 0 0 3 6 2
0 0 0 0 13 14 1 0 0 0 0 6 7 1 0 0
0 0 14 15 2 0 0 0 0 1 2 2 0 0 0 0
15 16 1 0 0 0 0 7 8 2 0 0 0 0 16 17
2 0 0 0 0 17 12 1 0 0 0 0 5 4 2 0
0 0 0 9 18 1 0 0 0 0 8 9 1 0 0 0
0 2 19 1 0 0 0 0 4 1 1 0 0 0 0
19 20 1 0 0 0 0 9 10 2 0 0 0 0 19 21
1 0 0 0 0 10 5 1 0 0 0 0 18 22 2 0
0 0 0 18 23 2 0 0 0 0 7 11 1 0 0 0
0 18 24 1 0 0 0 0 M END
32
The rebirth of line notations
  • With the increasing need to manipulate and store
    large numbers of chemical structures, line
    notations offer great advantages. In general they
    occupy 1 of the disc space, required by a
    connection table.
  • The newer line notations, with a linguistic
    structure such as SMILES and SMARTS can make use
    of the character handling capabilities of modern
    computers.
  • Machine readable/writable
  • Parse from left to right in a single pass
  • Human understandable
  • Have the capacity to code chemical concepts
  • Can describe reactions and can be used to carry
    out in silico chemistry
  • Can be stored in regular RDBSs or documents.

33
Languages
  • Languages with defined grammar and syntax are the
    bedrock of the Daylight system.
  • The lexical representations of the underlying
    toolkit objects are expressed in these languages.
  • NB All Daylight objects can be represented as
    strings which allows them to be transmitted
    between processes without loss of information.

34
SMILES
  • SMILES contains the same information as might be
    found in an extended connection table.
  • The primary reason SMILES is more useful than a
    connection table is that it is a linguistic
    construct, rather than a computer data structure.
  • SMILES is a true language, albeit with a simple
    vocabulary (atom and bond symbols) and only a few
    grammar rules.
  • SMILES can be canonicalised. I.e. there is a
    unique, universal name for a structure
  • SMILES representations of structure can in turn
    be used as words in the vocabulary of other
    languages designed for storage and retrieval of
    chemical information .E.g HTML, XML or query
    languages such as SQL.

35
SMILES syntax
atombondatom etc atom ltmassgt
symbol ltchiralgt lthcountgt ltsignltchargegtgt
ltclassgt bond ltemptygt -
.
Common elements, in the organic subset
B,C,N,O,P,S,F,Cl,Br,I, in their lowest common
valence state(s), can be written without
brackets. If bonds are omitted, they default to
single or aromatic, as appropriate, for
juxtaposed atoms.
36
Example SMILES
37
Analyzing Molecules
  • Explicit properties
  • Those required to completely specify the graph of
    a molecule
  • atoms and their properties
  • bonds and their properties
  • Derived properties
  • Those properties derived from the graph
  • These may alter explicit properties e.g. bonds

38
Derived properties
39
Derived properties
40
Beyond the structure diagram
  • A chemical language should be able to represent
    more than a structure diagram as it is not
    constrained by the printed page.
  • SMARTS is such a language as it can express
    chemical concepts.

41
SMARTS
  • In the SMILES language, there are two fundamental
    types of symbols atoms and bonds. Using these
    SMILES symbols, one can specify a molecule's
    graph (its "nodes" and "edges") and assign
    "labels" to the components of the graph (that is,
    say what type of atom each node represents, and
    what type of bond each edge represents).
  • The same is true in SMARTS One uses atomic and
    bond symbols to specify a graph. However, in
    SMARTS the labels for the graph's nodes and edges
    (its "atoms" and "bonds") are extended to include
    "logical operators" and special atomic and bond
    symbols these allow SMARTS atoms and bonds to be
    more general. For example, the SMARTS atomic
    symbol C,N is an atom that can be aliphatic C
    or aliphatic N the SMARTS bond symbol ""
    (tilde) matches any bond

42
Example SMARTS
43
Useful SMARTS
Rotatable bonds !()!D1-!_at_!()!D1
Secondary amides NH1D2-!_at_6X3 H-donors
!6!H0 H-acceptors (!60)!(F,Cl,Br
,I)!(o,s,nX3)!(Nv5,Pv5,Sv4,Sv6) Isolati
ng carbons 6!(C(F)(F)F)!(c(!c)!c)!(
6,!6)!(6!0) Stereo atoms
(X4!v6!v5H0,H1),(SX3(6)(6)O) St
ereo bonds CX3!H2CX3!H2 Stereo
allenes CX3H0CCX3H0,H1
44
Rotatable bonds!()!D1-!_at_!()!D1
  • An atom which is
  • NOT triply bonded to another atom
  • AND NOT 1-connected ( I.e. Not terminal )
  • Bonded by
  • A single bond
  • AND NOT a ring bond
  • to the same type of atom

45
Berzelius revisited
  • ... chemical signs ought to be letters, for the
    greater facility of writing, . the printed
    book...
  • J.J. Berzelius Annals of Philosophy (1813) 2,
    443-454
  • ...chemical structures and concepts should be
    represented as character strings for the greater
    facility of electronic storage, transmission,
    searching and retrieval of information
  • Mug01, Santa Fe

46
(No Transcript)
47
Supplementary slides
48
Machine reading of information
  • Hans Peter Luhn demonstrating a mock-up of an IBM
    card used in his scanner (1952).

Courtesy IBM.
49
Handling large amounts of data
  • IBM 101 Statistical Machine (ca. 1952).

Courtesy IBM.
50
Handling large amounts of data
  • REMINGTON RAND COUNTING SORTER 221 (ca. 1952)

Courtesy Unysis
51
Rotatable bonds!()!D1-!_at_!()!D1
  • An atom which is
  • NOT triply bonded to another atom
  • AND NOT 1-connected ( I.e. Not terminal )
  • Bonded by
  • A single bond
  • AND NOT a ring bond
  • to the same type of atom

52
Counting things
  • Count matches to patterns defined in SMARTS
  • Molecular formula
  • H-donors
  • H-acceptors
  • Rotatable bonds
  • Chiral centres
  • Rings
  • Fragments

53
Example
  • Molecular formula C13H22N4O3S
  • H-donors 2
  • H-acceptors 6
  • Rotatable bonds 8
  • Chiral centres 1
  • Rings 1
  • Fragments 6

54
Estimating Measured Properties
  • Any property which is an additive constitutive
    property of a molecule can be calculated by
  • counting the matches of the constituent patterns
  • lookup the weight for the pattern
  • summing the products of the count and individual
    pattern weights.
  • apply any correction factors

55
Examples of properties to calculate
  • Molecular Weight
  • logP
  • Parachor
  • Molar Volume
  • Molar Refractivity
  • .

56
Molecular weight a simple example
  • Molecular weight
  • Molecular formula
  • ?(count(atom(i))atomic_weight(atom(i)))
  • Accuracy depends on accuracy of atomic weights (
    IUPAC)
  • C13H22N4O3S
  • 314.45 (average molecular weight )
  • 314.141235 ( accurate mass of commonest isotope)

57
CLOGP A more complicated example
  • Algorithmic definition of fragment
  • Pattern NOT an isolating carbon
  • Match the pattern to find all the fragments
  • Look up the fragment value(s) ( if it exists )
    using the unique string(s) from the match.
  • Accumulate the values for fragments and
    non-fragments (isolating carbons).
  • Correct for proximity

58
CLOGP example
  • 2 Cl 1.880
  • guanidyl 1.930
  • 2 C 0.390
  • 6 c 0.780
  • 7 H 1.589
  • Proximity 0.984
  • Total 1.727

59
Estimating values for concepts
  • Flexibility
  • Ratio of number of rotatable bonds to total
    number of bonds
  • Rigidity
  • Molecular similarity between original molecule
    and molecules formed by breaking all rotatable
    bonds
  • Difficulty of synthesis
  • Ratio of number of potential chiral centres
    weighted for rings to total number of heavy atoms
    in a molecule

60
Example
  • Flexibility 0.38
  • Rigidity 0.3819
  • Difficulty of synthesis 0.05

61
Example
  • Flexibility 0.38(0.00)
  • Rigidity 0.3819(1.00)
  • Difficulty of synthesis 0.05 (0.85)
  • Figures in parentheses for morphine
Write a Comment
User Comments (0)
About PowerShow.com