Title: 2. Molecular Representations
1 2. Molecular Representations
2Communicating Chemical Data
- Chemical data Text, numbers, and molecules
- Standard valence model of chemistry
- Discrete bonds represent shared electrons
- Codify into a reproducible representation
- Graph of atoms and
- bonds is most commonly
- understood representation
32D Graph of Atoms / Bonds
- Labeled graph
- Nodes Atoms, Symbols C, N, O, H,
- Edges Bonds, Order 1, 2, 3, aromatic,..
- Organic compound shorthand
- Assumed carbons
- Implicit hydrogens
- Standard valence rules
4Tractability
- Small, Tree-Like, Graphs
- Number of vertices is small (e.g. less than 50)
- Number of edges is small (average degree 2.3 or
so) - Tree-like
52D Data Formats
- Bond matrix formats
- exist but size nAtoms2
- Connection table
- List of nodes
- C1, C2, O3, N4
- List of edges
- 1-2, 23, 2-4
- SDFile, Mol2 Formats
- Not human writeable
61D Line Notations
- Should be human parseable to facilitate
communication without computer module - Nomenclature
- IUPAC system 2-amino-3-phenyl-propanoic acid
- Common names phenylalanine
- SMILES C(C(O)O)(Cc1ccccc1)N
- Widely used, non-standardized
- InChi
- Recent, IUPAC supported official standard
- Ex. 1/C9H11NO2/c10-8(9(11)12)6-7-4-2-1-3-5-7/h1-5,
8H,6,10H2,(H,11,12)
http//www.iupac.org/inchi/
7IUPAC Nomenclature
- IUPAC Standard Naming Conventions
- propane
- propanoic acid
- 3-hydroxy-propanoic acid
- 2-amino-3-hydroxy-propanoic acid
- Unwieldy standard and inconsistent adoption
- Common names and abbreviations (Serine)
- Systematic bidirectional translation unreliable
8SMILES Basics
- Connection tables as a character string
- Atoms Atomic symbols C,N,O,S,
- Bonds single - (implicit), double , triple
- Examples
- CBr
- CO
- CN
- OCCN
http//www.daylight.com/smiles/
9SMILES Basics
- Branching Parentheses
- Cycles Numerical annotations
- CCC(O)C
- CC(N)(N)O
- C1CCCC1
- N12CCCCC1CCCC2
- NCC(CN)N1CCCC1
- Extensions for
- Inorganic atoms, unusual valence, formal charges,
stereochemistry, aromaticity, reactions, etc.
10Canonical Representations
- Unique representation needed for rapid DB lookup
and to check uniqueness - Need to uniquely order the atoms of a molecule
- nAtoms! atom orderings possible
- Morgan Algorithm
- Label nodes by connectivity (heavy degree)
- Relax iteratively towards extended connectivity
(EC) using neighbor values - Use EC magnitude to decide on atom order
- EC tie-breaking by atom, bond distinctions
11Stereochemistry / Isomers
- Chemical handedness
- Same connectivity, but not superimposable
- Atoms with at least 4 distinct components
- Double bonds with distinct components at ends
- Specification by atom / bond labels
e.g. CC_at_H(N)O vs. CC_at__at_H(N)O
e.g. O/CC/N vs. O/CC\N
123D Atomic Coordinates
- 2D graph only specifies connections
- 3D spatial coordinates (center, radius, surface)
- Largely unavailable
- Usually predicted
134D Conformers
- Molecules are relatively rigid w.r.t.
- Bond length
- Bond angles
- Single bonds are very
- flexible w.r.t rotation
- More information with
- collection of multiple,
- static 3D conformations
14Molecular Surfaces
- For intermolecular interactions, externally
visible surface is most important - Representations Orbitals,VDW Radii,
Accessibility, Tessellations
http//www.netsci.org/Science/Compchem/feature14e.
html
15Valence Model Limitations
- Bonds are non-existent
- Model of shared electron orbitals
- Difficulty modeling
- Aromaticity
- Resonance
- Tautomers
- Etc.
16Structural Keys
- Motivated in part by rapid screening for
functional group substructures - Pre-compute presence of common / important
substructures up front and record in bit-vector - Example of structural keys
- Presence of atoms (C, N, O, S, Cl, Br, etc.)
- Ring systems
- Functional Groups
- Aromatic, Phenol, Alcohol (ROH),
- Amine (RNH2), Acid(RC(O)OH), Ester,
17SMILES Examples
18SMILES Examples
19SMILES Examples
20(No Transcript)
21Generalized Fingerprints
- Structural Keys
- Generalizes only in proportion to knowledge
- Sparsely populated
- Good screening filter will have thousands of
keys, but each item generally only has a few
dozen - Generalized Fingerprints (Spectral
Representations) - No pre-defined patterns
- Record counts or presence/absence of
substructures (e.g. labeled paths, trees, etc) - Fixed length (binary) vectors
- Fast algorithms
- Abstract, hard to traceback meaning of individual
bits
22Systematic Graph Features
- For chemical compounds
- atom/node labels
- A C,N,O,H,
- bond/edge labels
- B s, d, t, ar,
- Trace Paths
- Depth First Search
(CsNsCdO)
23Fingerprint Flowchart
- 0 Bonds
- O
- C
- N
- 1 Bond
- OC
- CC
- CN
- 2 Bonds
- OC-C
- C-C-N
- 3 Bonds
- OC-C-N
Graph Feature Extractor
Random Number Generator
Hash Function
Modulo FP Size
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 1 0 1 0 0 0 0 1 1 1 0 1 0 1 0 0
24Other Fingerprint Representations
- Derived Representations
- Information Compression
- Example Local Sensitive Hashing (LSH)
- Choose K random lines in high-dimensional space
- Project data points
- Bin coordinates
25Summary
- Rich set of representations
- 1D SMILES, Fingerprints
- 2D Graph of Bonds
- 2.5D Surfaces
- 3D Coordinates
- 3.5D Conformers
- 4D Isomers, temporal evolution, etc
26Chemical Informatics
- Informatics must be able to deal with
variable-size structured data or convert data to
standard vectorial format - Graphical Models
- (Recursive) Neural Networks
- ILP
- GA
- SGs
- Kernels
27Slide Title (Arial 44 pt)
- Font Arial 32 pt
- Font Arial 28 pt
- This Arial 24 pt
- 20pt
- Again 20 pt do not use font sizes lt 20 pt
Place useful information here i.e. Overview
28SMILES Examples
29SMILES Examples
30SMILES Examples