2. Molecular Representations - PowerPoint PPT Presentation

About This Presentation
Title:

2. Molecular Representations

Description:

Standard valence model of chemistry. Discrete bonds ... Inorganic atoms, unusual valence, formal charges, stereochemistry, aromaticity, reactions, etc. ... – PowerPoint PPT presentation

Number of Views:170
Avg rating:3.0/5.0
Slides: 31
Provided by: Sho57
Learn more at: https://ics.uci.edu
Category:

less

Transcript and Presenter's Notes

Title: 2. Molecular Representations


1
2. Molecular Representations
2
Communicating Chemical Data
  • Chemical data Text, numbers, and molecules
  • Standard valence model of chemistry
  • Discrete bonds represent shared electrons
  • Codify into a reproducible representation
  • Graph of atoms and
  • bonds is most commonly
  • understood representation

3
2D Graph of Atoms / Bonds
  • Labeled graph
  • Nodes Atoms, Symbols C, N, O, H,
  • Edges Bonds, Order 1, 2, 3, aromatic,..
  • Organic compound shorthand
  • Assumed carbons
  • Implicit hydrogens
  • Standard valence rules

4
Tractability
  • Small, Tree-Like, Graphs
  • Number of vertices is small (e.g. less than 50)
  • Number of edges is small (average degree 2.3 or
    so)
  • Tree-like

5
2D Data Formats
  • Bond matrix formats
  • exist but size nAtoms2
  • Connection table
  • List of nodes
  • C1, C2, O3, N4
  • List of edges
  • 1-2, 23, 2-4
  • SDFile, Mol2 Formats
  • Not human writeable

6
1D Line Notations
  • Should be human parseable to facilitate
    communication without computer module
  • Nomenclature
  • IUPAC system 2-amino-3-phenyl-propanoic acid
  • Common names phenylalanine
  • SMILES C(C(O)O)(Cc1ccccc1)N
  • Widely used, non-standardized
  • InChi
  • Recent, IUPAC supported official standard
  • Ex. 1/C9H11NO2/c10-8(9(11)12)6-7-4-2-1-3-5-7/h1-5,
    8H,6,10H2,(H,11,12)

http//www.iupac.org/inchi/
7
IUPAC Nomenclature
  • IUPAC Standard Naming Conventions
  • propane
  • propanoic acid
  • 3-hydroxy-propanoic acid
  • 2-amino-3-hydroxy-propanoic acid
  • Unwieldy standard and inconsistent adoption
  • Common names and abbreviations (Serine)
  • Systematic bidirectional translation unreliable

8
SMILES Basics
  • Connection tables as a character string
  • Atoms Atomic symbols C,N,O,S,
  • Bonds single - (implicit), double , triple
  • Examples
  • CBr
  • CO
  • CN
  • OCCN

http//www.daylight.com/smiles/
9
SMILES Basics
  • Branching Parentheses
  • Cycles Numerical annotations
  • CCC(O)C
  • CC(N)(N)O
  • C1CCCC1
  • N12CCCCC1CCCC2
  • NCC(CN)N1CCCC1
  • Extensions for
  • Inorganic atoms, unusual valence, formal charges,
    stereochemistry, aromaticity, reactions, etc.

10
Canonical Representations
  • Unique representation needed for rapid DB lookup
    and to check uniqueness
  • Need to uniquely order the atoms of a molecule
  • nAtoms! atom orderings possible
  • Morgan Algorithm
  • Label nodes by connectivity (heavy degree)
  • Relax iteratively towards extended connectivity
    (EC) using neighbor values
  • Use EC magnitude to decide on atom order
  • EC tie-breaking by atom, bond distinctions

11
Stereochemistry / Isomers
  • Chemical handedness
  • Same connectivity, but not superimposable
  • Atoms with at least 4 distinct components
  • Double bonds with distinct components at ends
  • Specification by atom / bond labels

e.g. CC_at_H(N)O vs. CC_at__at_H(N)O
e.g. O/CC/N vs. O/CC\N
12
3D Atomic Coordinates
  • 2D graph only specifies connections
  • 3D spatial coordinates (center, radius, surface)
  • Largely unavailable
  • Usually predicted

13
4D Conformers
  • Molecules are relatively rigid w.r.t.
  • Bond length
  • Bond angles
  • Single bonds are very
  • flexible w.r.t rotation
  • More information with
  • collection of multiple,
  • static 3D conformations

14
Molecular Surfaces
  • For intermolecular interactions, externally
    visible surface is most important
  • Representations Orbitals,VDW Radii,
    Accessibility, Tessellations

http//www.netsci.org/Science/Compchem/feature14e.
html
15
Valence Model Limitations
  • Bonds are non-existent
  • Model of shared electron orbitals
  • Difficulty modeling
  • Aromaticity
  • Resonance
  • Tautomers
  • Etc.

16
Structural Keys
  • Motivated in part by rapid screening for
    functional group substructures
  • Pre-compute presence of common / important
    substructures up front and record in bit-vector
  • Example of structural keys
  • Presence of atoms (C, N, O, S, Cl, Br, etc.)
  • Ring systems
  • Functional Groups
  • Aromatic, Phenol, Alcohol (ROH),
  • Amine (RNH2), Acid(RC(O)OH), Ester,

17
SMILES Examples
18
SMILES Examples
19
SMILES Examples
20
(No Transcript)
21
Generalized Fingerprints
  • Structural Keys
  • Generalizes only in proportion to knowledge
  • Sparsely populated
  • Good screening filter will have thousands of
    keys, but each item generally only has a few
    dozen
  • Generalized Fingerprints (Spectral
    Representations)
  • No pre-defined patterns
  • Record counts or presence/absence of
    substructures (e.g. labeled paths, trees, etc)
  • Fixed length (binary) vectors
  • Fast algorithms
  • Abstract, hard to traceback meaning of individual
    bits

22
Systematic Graph Features
  • For chemical compounds
  • atom/node labels
  • A C,N,O,H,
  • bond/edge labels
  • B s, d, t, ar,
  • Trace Paths
  • Depth First Search

(CsNsCdO)
23
Fingerprint Flowchart
  • 0 Bonds
  • O
  • C
  • N
  • 1 Bond
  • OC
  • CC
  • CN
  • 2 Bonds
  • OC-C
  • C-C-N
  • 3 Bonds
  • OC-C-N

Graph Feature Extractor
Random Number Generator
Hash Function
Modulo FP Size
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 1 0 1 0 0 0 0 1 1 1 0 1 0 1 0 0
24
Other Fingerprint Representations
  • Derived Representations
  • Information Compression
  • Example Local Sensitive Hashing (LSH)
  • Choose K random lines in high-dimensional space
  • Project data points
  • Bin coordinates

25
Summary
  • Rich set of representations
  • 1D SMILES, Fingerprints
  • 2D Graph of Bonds
  • 2.5D Surfaces
  • 3D Coordinates
  • 3.5D Conformers
  • 4D Isomers, temporal evolution, etc

26
Chemical Informatics
  • Informatics must be able to deal with
    variable-size structured data or convert data to
    standard vectorial format
  • Graphical Models
  • (Recursive) Neural Networks
  • ILP
  • GA
  • SGs
  • Kernels

27
Slide Title (Arial 44 pt)
  • Font Arial 32 pt
  • Font Arial 28 pt
  • This Arial 24 pt
  • 20pt
  • Again 20 pt do not use font sizes lt 20 pt

Place useful information here i.e. Overview
28
SMILES Examples
29
SMILES Examples
30
SMILES Examples
Write a Comment
User Comments (0)
About PowerShow.com