Machine Learning for HighThroughput Biological Data - PowerPoint PPT Presentation

1 / 161
About This Presentation
Title:

Machine Learning for HighThroughput Biological Data

Description:

Predicting the operons in E. coli. Chromatin Remodelers and Nucleosome ... Finding Operons in E. coli (Craven, Page, Shavlik, Bockhorst and Glasner, 2000) ... – PowerPoint PPT presentation

Number of Views:131
Avg rating:3.0/5.0
Slides: 162
Provided by: unkn1262
Category:

less

Transcript and Presenter's Notes

Title: Machine Learning for HighThroughput Biological Data


1
Machine Learning for High-Throughput Biological
Data
These notes were originally from KDD2006 tutorial
notes, written by David page at Dept.
Biostatistics and Medical Informatics Dept.
Computer Sciences University of
Wisconsin-Madison. http//www.biostat.wisc.edu/pa
ge/PageKDD2006.ppt
2
Some Data Types Well Discuss
  • Gene expression microarray
  • Single-nucleotide polymorphisms (??????????)
  • Mass spectrometry proteomics (????? ) and
    metabolomics (????? )
  • Protein-protein interactions (from
    co-immunoprecipitation)
  • High-throughput screening of potential drug
    molecules

3
image from the DOE Human Genome
Program http//www.ornl.gov/hgmis
4
How Microarrays Work
Probes (DNA)
Labeled Sample (RNA)
Hybridization
Gene Chip Surface
5
Two Views of Microarray Data
  • Data points are genes
  • Represented by expression levels across different
    samples (ie, featuressamples)
  • Goal categorize new genes
  • Data points are samples (eg, patients)
  • Represented by expression levels of different
    genes (ie, featuresgenes)
  • Goal categorize new samples

6
Two Ways to View The Data
7
Data Points are Genes
8
Data Points are Samples
9
Supervision Add Class Values
10
Supervised Learning Task
  • Given a set of microarray experiments, each done
    with mRNA from a different patient (same cell
    type from every patient) Patients expression
    values for each gene constitute the features, and
    patients disease constitutes the class
  • Do Learn a model that accurately predicts
    class based on features

11
Location in Task Space
12
Leukemia (Golub et al., 1999)
  • Classes Acute Lymphoblastic Leukemia(?????)
    (ALL) and Acute Myeloid Leukemia (?????) (AML)
  • Approach Weighted voting (essentially naïve
    Bayes)
  • Cross-Validated Accuracy Of 34 samples,
    declined to predict 5, correct on other 29

13
Cancer vs. Normal
  • Relatively easy to predict accurately, because so
    much goes haywire in cancer cells
  • Primary barrier is noise in the data impure RNA,
    cross-hybridization, etc
  • Studies include breast, colon (??), prostate
    (???), lymphoma (???), and multiple myeloma (???)

14
X-Val Accuracies for Multiple Myeloma (74 MM vs.
31 Normal)
15
More MM (300), Benign Condition MGUS (Hardin et
al., 2004)
16
ROC Curves Cancer vs. Normal
17
ROC Cancer vs. Benign (MGUS)
18
Work by Statisticians Outside of Standard
Classification/Clustering
  • Methods to better convert Affymetrixs low-level
    intensity measurements into expression levels
    e.g., work by Speed, Wong, Irrizary
  • Methods to find differentially expressed genes
    between two samples, e.g. work by Newton and
    Kendziorski
  • But the following is most related

19
Ranking Genes by Significance
  • Some biologists dont want one predictive model,
    but a rank-ordered list of genes to explore
    further (with estimated significance)
  • For each gene we have a set of expression levels
    under our conditions, say cancer vs. normal
  • We can do a t-test to see if the mean expression
    levels are different under the two conditions
    p-value
  • Multiple comparisons problem if we repeat this
    test for 30,000 genes, some will pop up as
    significant just by chance alone
  • Could do a Bonferoni correction (multiply
    p-values by 30,000), but this is drastic and
    might eliminate all

20
False Discovery Rate (FDR) Storey and
Tibshirani, 2001
  • Addresses multiple comparisons but is less
    extreme than Bonferoni
  • Replaces p-value by q-value fraction of genes
    with this p-value or lower that really dont have
    different means in the two classes (false
    discoveries)
  • Publicly available in R as part of Bioconductor
    package
  • Recommendation Use this in addition to your
    supervised data mining your collaborators will
    want to see it

21
FDR Highlights Difficulties Getting Insight into
Cancer vs. Normal
22
Using Benign Condition Instead of Normal Helps
Somewhat
23
Question to Anticipate
  • Youve run a supervised data mining algorithm on
    your collaborators data, and you present an
    estimate of accuracy or an ROC curve (from X-val)
  • How did you adjust this for the multiple
    comparisons problem?
  • Answer you dont need to because you commit to a
    single predictive model before ever looking at
    the test data for a foldthis is only one
    comparison

24
Prognosis and Treatment
  • Features same as for diagnosis
  • Rather than disease state, class value becomes
    life expectancy with a given treatment (or
    positive response vs. no response to given
    treatment)

25
Breast Cancer Prognosis (Vant Veer et al., 2002)
  • Classes good prognosis (no metastasis within
    five years of initial diagnosis) vs. poor
    prognosis
  • Algorithm Ensemble of voters
  • Results 83 cross-validated accuracy on 78
    cases

26
A Lesson
  • Previous work selected features to use in
    ensemble by looking at the entire data set
  • Should have repeated feature selection on each
    cross-val fold
  • Authors also chose ensemble size by seeing which
    size gave highest cross-val result
  • Authors corrected this in web supplementaccuracy
    went from 83 to 73
  • Remember to tune parameters separately for each
    cross-val fold!

27
Prognosis with Specific Therapy (Rosenwald et
al., 2002)
  • Data set contains gene-expression patterns for
    160 patients with diffuse large B-cell lymphoma,
    receiving anthracycline chemotherapy
  • Class label is five-year survival
  • One test-train split 80/80
  • True positive rate 60 False negative rate
    39

28
Some Future Directions
  • Using gene-chip data to select therapy Predict
    which therapy gives best prognosis for
    patient
  • Combining Gene Expression Data with Clinical Data
    such as Lab Results, Medical and Family History
  • Multiple relational tables, may benefit from
    relational learning

29
Unsupervised Learning Task
  • Given a set of microarray experiments under
    different conditions
  • Do cluster the genes, where a gene described by
    its expression levels in different experiments

30
Location in Task Space
31
Example(Green up-regulated, Red
down-regulated)
Genes
Experiments (Samples)
32
Visualizing Gene Clusters (eg, Sharan and
Shamir, 2000)
Gene Cluster 1, size20
Gene Cluster 2, size43
Time (10-minute intervals)
33
Unsupervised Learning Task 2
  • Given a set of microarray experiments (samples)
    corresponding to different conditions or
    patients
  • Do cluster the experiments

34
Location in Task Space
35
Examples
  • Cluster samples from mice subjected to a variety
    of toxic compounds (Thomas et al., 2001)
  • Cluster samples from cancer patients, potentially
    to discover different subtypes of a cancer
  • Cluster samples taken at different time points

36
Some Biological Pathways
  • Regulatory pathways
  • Nodes are labeled by genes
  • Arcs denote influence on transcription
  • G1 codes for P1, P1 inhibits G2s transcription
  • Metabolic pathways
  • Nodes are metabolites, large biomolecules (eg,
    sugars, lipids, proteins and modified proteins)
  • Arcs from biochemical reaction inputs to outputs
  • Arcs labeled by enzymes (themselves proteins)

37
Metabolic Pathway Example
H20
HSCoA
Citrate
cis-Aconitate
Acetyl CoA
citrate synthase
aconitase
H20
Oxaloacetate
NADH
MDH
(Krebs Cycle, TCA Cycle, Citric Acid Cycle)
Isocitrate
NAD
NAD
Malate
IDH
NADH CO2
fumarase
H20
a-Ketoglutarate
NAD HSCoA
Fumarate
a-KDGH
NADH CO2
succinate thikinase
Succinyl-CoA
FADH2
Succinate
FAD
GDP Pi
GTP
HSCoA
38
Regulatory Pathway (KEGG)
39
Using Microarray Data Only
  • Regulatory pathways
  • Nodes are labeled by genes
  • Arcs denote influence on transcription
  • G1 codes for P1, P1 inhibits G2s transcription
  • Metabolic pathways
  • Nodes are metabolites, large biomolecules (eg,
    sugars, lipids, proteins, and modified proteins)
  • Arcs from biochemical reaction inputs to outputs
  • Arcs labeled by enzymes (themselves proteins)

40
Supervised Learning Task 2
  • Given a set of microarray experiments for same
    organism under different conditions
  • Do Learn graphical model that accurately
    predicts expression of some genes in terms of
    others

41
Some Approaches to Learning Regulatory Networks
  • Bayes Net Learning (started with Friedman
    Halpern, 1999, well see more)
  • Boolean Networks (Akutsu, Kuhara, Maruyama
    Miyano, 1998 Ideker, Thorsson Karp, 2002)
  • Related Graphical Approaches (Tanay Shamir,
    2001 Chrisman, Langley, Baay Pohorille, 2003)

42
Bayesian Network (BN)
Note direction of arrow indicates dependence not
causality
43
Problem Not Causality
A
B
A is a good predictor of B. But is A regulating
B?? Ground truth might be
B
A
A
C
B
B
C
A
C
Or a more complicated variant
B
A
44
Approaches to Get Causality
  • Use knock-outs (Peer, Regev, Elidan and
    Friedman, 2001). But not available in most
    organisms.
  • Use time-series data and Dynamic Bayesian
    Networks (Ong, Glasner and Page, 2002). But even
    less data typically.
  • Use other data sources, eg sequences upstream of
    genes, where transcription regulators may bind.
    (Segal, Barash, Simon, Friedman and Koller, 2002
    Noto and Craven, 2005)

45
A Dynamic Bayes Net
46
Problem Not Enough Data Points to Construct
Large Network
  • Fortunate to get 100s of chips
  • But have 1000s of genes
  • E. coli 4000
  • Yeast 6000
  • Human 30,000
  • Want to learn causal graphical model over 1000s
    of variables with 100s of examples (settings of
    the variables)

47
Advance Module Networks Segal, Peer, Regev,
Koller Friedman, 2005
  • Cluster genes by similarity over expression
    experiments
  • All genes in a cluster are tied together same
    parents and CPDs
  • Learn structure subject to this tying together of
    genes
  • Iteratively re-form clusters and re-learn
    network, in an EM-like fashion

48
Problem Data are Continuous but Models are
Discrete
  • Gene chips provide a real-valued mRNA measurement
  • Boolean networks and most practical Bayes net
    learning algorithms assume discrete variables
  • May lose valuable information by discretizing

49
Advance Use of Dynamic Bayes Nets with
Continuous Variables Segal, Peer, Regev, Koller
Friedman, 2005
  • Expression measurements used instead of
    discretized (up, down, same)
  • Assume linear influence of parents on children
    (Michaelis-Menten assumption)
  • Work so far constructed the network from
    literature and learned parameters

50
Problem Much Missing Information
  • mRNA from gene 1 doesnt directly alter level of
    mRNA from gene 2
  • Rather, the protein product from gene 1 may alter
    level of mRNA from gene 2 (e.g., transcription
    factor)
  • Activation of transcription factor might not
    occur by making more of it, but just by
    phosphorylating it (post-translational
    modification)

51
Example Transcription Regulation
Operon
Operon
DNA
52
Approach Measure More Stuff
  • Mass spectrometry (later) can measure protein
    rather than mRNA
  • Doesnt measure all proteins
  • Not very quantitative (presence/absence)
  • 2D gels can measure post-translational
    modifications, but still low-throughput because
    of current analysis
  • Co-immunoprecipitation (later), Yeast 2-Hybrids
    can measure protein interactions, but noisy

53
Another Way Around Limitations
  • Identify smaller part of the task that is a step
    toward a full regulatory pathway
  • Part of a pathway
  • Classes or groups of genes
  • Examples
  • Chromatin remodelers Predicting the operons
    in E. coli

54
Chromatin Remodelers and Nucleosome Segal et al.
2006
  • Previous DNA picture oversimplified
  • DNA double-helix is wrapped in further complex
    structure
  • DNA is accessible only if part of this structure
    is unwound
  • Can we predict what chromatin remodelers act on
    what parts of DNA, also what activates a
    remodeler?

55
The E. Coli Genome
56
Finding Operons in E. coli(Craven, Page,
Shavlik, Bockhorst and Glasner, 2000)
g3
g2
g4
g5
g1
promoter
terminator
  • Given known operons and other E. coli data
  • Do predict all operons in E. coli
  • Additional Sources of Information
  • gene-expression data
  • functional annotation

57
Comparing Naive Bayes and Decision Trees (C5.0)
58
Using Only Individual Features
59
Single-Nucleotide Polymorphisms
  • SNPs Individual positions in DNA where
    variation is common
  • Roughly 2 million known SNPs in humans
  • New Affymetrix whole-genome scan measures 500,000
    of these
  • Easier/faster/cheaper to measure SNPs than to
    completely sequence everyone
  • Motivation

60
If We Sequenced Everyone
Susceptible to Disease D or Responds to Treatment
T
Not Susceptible or Not Responding
61
Example of SNP Data
62
Phasing (Haplotyping)
63
Advantages of SNP Data
  • Persons SNP pattern does not change with time or
    disease, so it can give more insight into
    susceptibility
  • Easier to collect samples (can simply use blood
    rather than affected tissue)

64
Challenges of SNP Data
  • Unphased Algorithms exist for phasing
    (haplotyping), but they make errors and
    typically need related individuals, dense
    coverage
  • Missing values are more common than in
    microarray data (though improving substantially,
    down to around 1-2 now)
  • Many more measurements. For example, Affymetrix
    human SNP chip at a half million SNPs.

65
Supervised Learning Task
  • Given a set of SNP profiles, each from a
    different patient.
  • Phased nucleotides at each SNP position on each
    copy of each chromosome constitute the features,
    and patients disease constitutes the class
  • Unphased unordered pair of nucleotides at each
    SNP position constitute the features, and
    patients disease constitutes the class
  • Do Learn a model that accurately predicts
    class based on features

66
Waddell et al., 2005
  • Multiple Myeloma, Young (susceptible) vs. Old
    (less susceptible), 3000 SNPs, best at 64 acc
    (training)
  • SVM with feature selection (repeated on every
    fold of cross-validation) 72 accuracy, also
    naïve Bayes. Significantly better than chance.

67
Listgarten et al., 2005
  • SVMs from SNP data predict lung cancer
    susceptibility at 69 accuracy
  • Naïve Bayes gives similar performance
  • Best single SNP at less than 60 accuracy
    (training)

68
Lessons
  • Supervised data mining algorithms can predict
    disease susceptibility at rates better than
    chance and better than individual SNPs
  • Accuracies much lower than we see with microarray
    data, because were predicting who will get
    disease, not who already has it

69
Future Directions
  • Pharmacogenetics predicting drug response from
    SNP profile
  • Drug Efficacy
  • Adverse Reaction
  • Combining SNP data with other data types, such as
    clinical (history, lab tests) and microarray

70
Proteomics
  • Microarrays are useful primarily because mRNA
    concentrations serve as surrogate for protein
    concentrations
  • Like to measure protein concentrations directly,
    but at present cannot do so insame
    high-throughput manner
  • Proteins do not have obvious direct complements
  • Could build molecules that bind, but binding
    greatly affected by protein structure

71
Time-of-Flight (TOF) Mass Spectrometry (thanks
Sean McIlwain)
Detector
  • Measures the time for an ionized particle,
    starting from the sample plate, to hit the
    detector

Laser
Sample
V
72
Time-of-Flight (TOF) Mass Spectrometry 2
Detector
  • Matrix-Assisted Laser Desorption-Ionization
    (MALDI)
  • Crystalloid structures made using proton-rich
    matrix molecule
  • Hitting crystalloid with laser causes molecules
    to ionize and fly towards detector

Laser
Sample
V
73
Time-of-Flight Demonstration 0
Sample Plate
74
Time-of-Flight Demonstration 1
Matrix Molecules
75
Time-of-Flight Demonstration 2
Protein Molecules
76
Time-of-Flight Demonstration 3
Laser
Detector
Positive Charge
10KV
77
Time-of-Flight Demonstration 4
Proton kicked off matrix molecule onto another
molecule
Laser pulsed directly onto sample

10KV
78
Time-of-Flight Demonstration 5
Lots of protons kicked off matrix ions, giving
rise to more positively charged molecules





10KV
79
Time-of-Flight Demonstration 6
The high positive potential under sample plate,
causes positively charged molecules to accelerate
towards detector





10KV
80
Time-of-Flight Demonstration 7

Smaller mass molecules hit detector first, while
heavier ones detected later





10Kv
81
Time-of-Flight Demonstration 8




The incident time measured from when laser is
pulsed until molecule hits detector


10KV
82
Time-of-Flight Demonstration 9






Experiment repeated a number of times, counting
frequencies of flight-times
10KV
83
Example Spectra from a Competition by Lin et al.
at Duke
These are different fractions from the same
sample.
Intensity
M/Z
84
Trypsin-Treated Spectra
Frequency
M/Z
85
Many Challenges Raised by Mass Spectrometry Data
  • Noise extra peaks from handling of sample, from
    machine and environment (electrical noise), etc.
  • M/Z values may not align exactly across spectra
    (resolution 0.1)
  • Intensities not calibrated across spectra
    quantification is difficult
  • Cannot get all proteins typically only several
    hundred. To improve odds of getting the ones we
    want, may fractionate our sample by 2D gel
    electrophoresis or liquid chromatography.

86
Challenges (Continued)
  • Better results if partially digest proteins
    (break into smaller peptides) first
  • Can be difficult to determine what proteins we
    have from spectrum
  • Isotopic peaks C13 and N15 atoms in varying
    numbers cause multiple peaks for a single peptide

87
Handling Noise Peak Picking
  • Want to pick peaks that are statistically
    significant from the noise signal

Want to use these as features in our learning
algorithms.
88
Many Supervised Learning Tasks
  • Learn to predict proteins from spectra, when the
    organisms proteome is known
  • Learn to identify isotopic distributions
  • Learn to predict disease from either proteins,
    peaks or isotopic distributions as features
  • Construct pathway models

89
Using Mass Spectrometry for Early Detection of
Ovarian Cancer Petricoin et al., 2002
  • Ovarian cancer difficult to detect early, often
    leading to poor prognosis
  • Trained and tested on mass spectra from blood
    serum
  • 100 training cases, 50 with cancer
  • Held-out test set of 116 cases, 50 with cancer
  • 100 sensitivity, 95 specificity (63/66) on
    held-out test set

90
Not So Fast
  • Data mining methodology seems sound
  • But Keith Baggerly argues that cancer samples
    were handled differently than normal samples, and
    perhaps data were preprocessed differently too
  • If we run cancer samples Monday and normals
    Wednesday, could get differences from machine
    breakdown or nearby electrical equipment thats
    running on Monday but not Wed
  • Lesson tell collaborators they must randomize
    samples for the entire processing phase and of
    course all our preprocessing must be same
  • Debate is still raging results not replicated in
    trials

91
Other Proteomics 3D Structures
92
Other Proteomics Interactions
Figure from Ideker et al., Science
292(5518)929-934, 2001
  • each node represents a gene product (protein)
  • blue edges show direct protein-protein
    interactions
  • yellow edges show interactions in which one
    protein binds to DNA and affects the expression
    of another

93
Protein-Protein Interactions
  • Yeast 2-Hybrid
  • Immunoprecipitation
  • Antibodies (immuno) are made by combinatorial
    combinations of certain proteins
  • Millions of antibodies can be made, to recognize
    a wide variety of different antigens
    (invaders), often by recognizing specific
    proteins

antibody
protein
94
Protein-Protein Interactions
95
Immunoprecipitation
antibody
96
Co-Immunoprecipitation
antibody
97
Many Supervised Learning Tasks
  • Learn to predict protein-protein interactions
    protein 3D structures may be critical
  • Use protein-protein interactions in construction
    of pathway models
  • Learn to predict protein function from
    interaction data

98
ChIP-Chip Data
  • Immunoprecipitation can also be done to identify
    proteins interacting with DNA rather than other
    proteins
  • Chromatin immunoprecipitation (ChIP) grab sample
    of DNA bound to a particular protein
    (transcription factor)
  • ChIP-Chip run this sample of DNA on a microarray
    to see which DNA was bound
  • Example of analysis of such new data Keles et
    al., 2006

99
Metabolomics
  • Measures concentration of each low-molecular
    weight molecule in sample
  • These typically are metabolites, or small
    molecules produced or consumed by reactions in
    biochemical pathways
  • These reactions typically catalyzed by proteins
    (specifically, enzymes)
  • This data typically also mass spectrometry,
    though could also be NMR

100
Lipomics
  • Analogous to metabolomics, but measuring
    concentrations of lipids rather than metabolites
  • Potentially help induce biochemical pathway
    information or to help disease diagnosis or
    treatment choice

101
To Design a Drug
Identify Target Protein
Knowledge of proteome/genome
Relevant biochemical pathways
Crystallography, NMR Difficult if Membrane-Bound
Determine Target Site Structure
Synthesize a Molecule that Will Bind
Imperfect modeling of structure Structures may
change at binding And even then
102
Molecule Binds Target But May
  • Bind too tightly or not tightly enough.
  • Be toxic.
  • Have other effects (side-effects) in the body.
  • Break down as soon as it gets into the body, or
    may not leave the body soon enough.
  • It may not get to where it should in the body
    (e.g., crossing blood-brain barrier).
  • Not diffuse from gut to bloodstream.

103
And Every Body is Different
  • Even if a molecule works in the test tube and
    works in animal studies, it may not work in
    people (will fail in clinical trials).
  • A molecule may work for some people but not
    others.
  • A molecule may cause harmful side-effects in some
    people but not others.

104
Typical Practice when Target Structure is Unknown
  • High-Throughput Screening (HTS) Test many
    molecules (1,000,000) to find some that bind to
    target (ligands).
  • Infer (induce) shape of target site from 3D
    structural similarities.
  • Shared 3D substructure is called a pharmacophore.
  • Perfect example of a machine learning task with
    spatial target.

105
An Example of Structure Learning
Inactive
Active
106
Common Data Mining Approaches
  • Represent a molecule by thousands to millions of
    features and use standard techniques (e.g., KDD
    Cup 2001)
  • Represent each low-energy conformer by feature
    vector and use multiple-instance learning (e.g.,
    Jain et al., 1998)
  • Relational learning
  • Inductive logic programming (e.g., Finn et al.,
    1998)
  • Graph mining

107
Supervised Learning Task
  • Given a set of molecules, each labeled by
    activity -- binding affinity for target protein
    -- and a set of low-energy conformers for each
    molecule
  • Do Learn a model that accurately predicts
    activity (may be Boolean or real-valued)

108
ILP as Illustration The Logical Representation
of a Pharmacophore
109
Background Knowledge I
  • Information about atoms and bonds in the
    molecules
  • atm(m1,a1,o,3,5.915800,-2.441200,1.799700).
  • atm(m1,a2,c,3,0.574700,-2.773300,0.337600).
  • atm(m1,a3,s,3,0.408000,-3.511700,-1.314000).
  • bond(m1,a1,a2,1).
  • bond(m1,a2,a3,1).

110
Background knowledge II
  • Definition of distance equivalence
  • dist(Drug,Atom1,Atom2,Dist,Error)-
  • number(Error),
  • coord(Drug,Atom1,X1,Y1,Z1),
  • coord(Drug,Atom2,X2,Y2,Z2),
  • euc_dist(p(X1,Y1,Z1),p(X2,Y2,Z2),Dist1),
  • Diff is Dist1-Dist,
  • absolute_value(Diff,E1),
  • E1 lt Error.
  • euc_dist(p(X1,Y1,Z1),p(X2,Y2,Z2),D)-
  • Dsq is (X1-X2)2(Y1-Y2)2(Z1-Z2)2,
  • D is sqrt(Dsq).

111
Central Idea Generalize by searching a lattice
112
Conformational model
  • Conformational flexibility modelled as multiple
    conformations
  • Sybyl randomsearch
  • Catalyst

113
Pharmacophore description
  • Atom and site centred
  • Hydrogen bond donor
  • Hydrogen bond acceptor
  • Hydrophobe
  • Site points (limited at present)
  • User definable
  • Distance based

114
Example 1 Dopamine agonists
  • Agonists taken from Martin data set on QSAR
    society web pages
  • Examples (5-50 conformations/molecule)

115
Pharmacophore identified
  • Molecule A has the desired activity if
  • in conformation B molecule A contains a
    hydrogen acceptor at C, and
  • in conformation B molecule A contains a basic
    nitrogen group at D, and
  • the distance between C and D is 7.05966 /-
    0.75 Angstroms, and
  • in conformation B molecule A contains a
    hydrogen acceptor at E, and
  • the distance between C and E is 2.80871 /-
    0.75 Angstroms, and
  • the distance between D and E is 6.36846 /-
    0.75 Angstroms, and
  • in conformation B molecule A contains a
    hydrophobic group at F, and
  • the distance between C and F is 2.68136 /-
    0.75 Angstroms, and
  • the distance between D and F is 4.80399 /-
    0.75 Angstroms, and
  • the distance between E and F is 2.74602 /-
    0.75 Angstroms.

116
Example II ACE inhibitors
  • 28 angiotensin converting enzyme inhibitors taken
    from literature
  • D. Mayer et al., J. Comput.-Aided Mol. Design, 1,
    3-16, (1987)

117
ACE pharmacophore
  • Molecule A is an ACE inhibitor if
  • molecule A contains a zinc-site B,
  • molecule A contains a hydrogen acceptor C,
  • the distance between B and C is 7.899 /-
    0.750 A,
  • molecule A contains a hydrogen acceptor D,
  • the distance between B and D is 8.475 /-
    0.750 A,
  • the distance between C and D is 2.133 /-
    0.750 A,
  • molecule A contains a hydrogen acceptor E,
  • the distance between B and E is 4.891 /-
    0.750 A,
  • the distance between C and E is 3.114 /-
    0.750 A,
  • the distance between D and E is 3.753 /-
    0.750 A.

118
Pharmacophore discovered
Zinc site H-bond acceptor
119
Additional Finding
  • Original pharmacophore rediscovered plus one
    other
  • different zinc ligand position
  • similar to alternative proposed by Ciba-Geigy

120
Example III Thermolysin inhibitors
  • 10 inhibitors for which crystallographic data is
    available in PDB
  • Conformationally challenging molecules
  • Experimentally observed superposition

121
Key binding site interactions
Asn112-NH
OC Asn112
S2
Arg203-NH
S1
OC Ala113
Zn
122
Interactions made by inhibitors
123
Pharmacophore Identification
  • Structures considered 1HYT 1THL 1TLP 1TMN 2TMN
    4TLN 4TMN 5TLN 5TMN 6TMN
  • Conformational analysis using Best conformer
    generation in Catalyst
  • 98-251 conformations/molecule

124
Thermolysin Results
  • 10 5-point pharmacophore identified, falling into
    2 groups (7/10 molecules)
  • 3 acceptors, 1 hydrophobe, 1 donor
  • 4 acceptors, 1 donor
  • Common core of Zn ligands, Arg203 and Asn112
    interactions identified
  • Correct assignments of functional groups
  • Correct geometry to 1 Angstrom tolerance

125
Thermolysin results
  • Increasing tolerance to 1.5Angstroms finds common
    6-point pharmacophore including one extra
    interaction

126
Example IV Antibacterial peptides Spatola et
al., 2000
  • Dataset of 11 pentapeptides showing activity
    against Pseudomonas aeruginosa
  • 6 actives lt64mg/ml IC50
  • 5 inactives

127
Pharmacophore Identified
A Molecule M is active against Pseudomonas
Aeruginosa if it has a conformation B such
that M has a hydrophobic group C, M has a
hydrogen acceptor D, the distance between C and
D in conformation B is 11.7 Angstroms M has a
positively-charged atom E, the distance between
C and E in conformation B is 4 Angstroms the
distance between D and E in conformation B is 9.4
Angstroms M has a positively-charged atom
F, the distance between C and F in conformation
B is 11.1 Angstroms the distance between D and F
in conformation B is 12.6 Angstroms the distance
between E and F in conformation B is 8.7
Angstroms Tolerance 1.5 Angstroms
128
(No Transcript)
129
Clinical Databases of the Future (Dramatically
Simplified)
PatientID Date Physician Symptoms
Diagnosis P1 1/1/01 Smith
palpitations hypoglycemic P1 2/1/03
Jones fever, aches influenza
PatientID Gender Birthdate P1 M
3/22/63
PatientID Date Lab Test Result
PatientID SNP1 SNP2 SNP500K P1
AA AB BB P2
AB BB AA
P1 1/1/01 blood glucose 42
P1 1/9/01 blood glucose 45
PatientID Date Prescribed Date Filled
Physician Medication Dose Duration
P1 5/17/98 5/18/98
Jones prilosec 10mg 3
months
130
Final Wrap-up
  • Molecular biology collecting lots and lots of
    data in post-genome era
  • Opportunity to connect molecular-level
    information to diseases and treatment
  • Need analysis tools to interpret
  • Data mining opportunities abound
  • Hopefully this tutorial provided solid start
    toward applying data mining to high-throughput
    biological data

131
Thanks To
  • Jude Shavlik
  • John Shaughnessy
  • Bart Barlogie
  • Mark Craven
  • Sean McIlwain
  • Jan Struyf
  • Arno Spatola
  • Paul Finn
  • Beth Burnside
  • Michael Molla
  • Michael Waddell
  • Irene Ong
  • Jesse Davis
  • Soumya Ray
  • Jo Hardin
  • John Crowley
  • Fenghuang Zhan
  • Eric Lantz

132
If Time Permits some of my groups directions in
the area
  • Clinical Data (with Jesse Davis, Beth Burnside,
    M.D.)
  • Addressing another problem with current
    approaches to biological network learning (with
    Soumya Ray, Eric Lantz)

133
Using Machine Learning with Clinical Histories
Example
  • Well use example of Mammography to show some
    issues that arise
  • These issues arise here even with just one
    relational table
  • These issues are even more pronounced with data
    in multiple tables

134
Supervised Learning Task
  • Given a database of mammogram abnormalities for
    different patients (same cell type from every
    patient) Radiologist-entered values describing
    the abnormality constitute the features, and
    abnormalitys biopsy result as benign or
    malignant constitutes the class
  • Do Learn a model that accurately predicts
    class based on features

135
Mammography DatabaseDavis et al, 2005 Burnside
et al, 2005
136
Original Expert Structure
137
Level 1 Parameters
Given Features (node labels, or fields in
database), Data, Bayes net structure Learn
Probabilities. Note probabilities needed are
Pr(Be/Mal), Pr(ShapeBe/Mal), Pr (SizeBe/Mal)
138
Level 2 Structure
Be/Mal
Given Features, Data Learn Bayes
net structure and probabilities. Note with this
structure, now will need Pr(SizeShape,Be/Mal)
instead of Pr(SizeBe/Mal).
Shape
Size
139
Mammography Database
140
Mammography Database
141
Mammography Database
142
Level 3 Aggregates
Given Features, Data, Background knowledge
aggregation functions such as average, mode, max,
etc. Learn Useful aggregate features,
Bayes net structure that uses these features, and
probabilities. New features may use other
rows/tables.
Avg size this date
Be/Mal
Shape
Size
143
Mammography Database
144
Mammography Database
145
Mammography Database
146
Level 4 View Learning
Given Features, Data, Background knowledge
aggregation functions and intensionally-defined
relations such as increase or same
location Learn Useful new features defined by
views (equivalent to rules or SQL queries), Bayes
net structure, and probabilities.
Shape change in abnormality at this location
Increase in average size of abnormalities
Avg size this date
Be/Mal
Shape
Size
147
Example of Learned Rule
is_malignant(A) IF 'BIRADS_category'(A,b5),
'MassPAO'(A,present), 'MassesDensity'(A,high),
'HO_BreastCA'(A,hxDCorLC), in_same_mammogram(A,B
), 'Calc_Pleomorphic'(B,notPresent),
'Calc_Punctate'(B,notPresent).
148
ROC Level 2 (TAN) vs. Level 1
149
Precision-Recall Curves
150
(No Transcript)
151
SAYU-View
  • Improved View Learning approach
  • SAYU Score As You Use
  • For each candidate rule, add it to the Bayesian
    network and see if it improves the networks
    score
  • Only add a rule (new field for the view) if it
    improves the Bayes net

152
(No Transcript)
153
(No Transcript)
154
Clinical Databases of the Future (Dramatically
Simplified)
PatientID Date Physician Symptoms
Diagnosis P1 1/1/01 Smith
palpitations hypoglycemic P1 2/1/03
Jones fever, aches influenza
PatientID Gender Birthdate P1 M
3/22/63
PatientID Date Lab Test Result
PatientID SNP1 SNP2 SNP500K P1
AA AB BB P2
AB BB AA
P1 1/1/01 blood glucose 42
P1 1/9/01 blood glucose 45
PatientID Date Prescribed Date Filled
Physician Medication Dose Duration
P1 5/17/98 5/18/98
Jones prilosec 10mg 3
months
155
Another Problem with Current Learning of
Regulatory Models
  • Current techniques all use greedy heuristic
  • Bayes net learning algorithms use sparse
    candidate approach to be considered as a parent
    of gene 1, another gene 2 must be correlated with
    gene 1
  • CPDs often represented as trees use greedy tree
    learning algorithms
  • All can fall prey to functions such as
    exclusive-or do these arise?

156
Skewing Example Page Ray, 2003 Ray Page,
2004 Rosell et al., 2005 Ray Page, 2005
Drosophila survival based on gender and Sxl gene
activity
157
Hard Functions
  • Our Definition those functions for which no
    attribute has gain according to standard purity
    measures (GINI, Entropy)
  • NOTE Hard does not refer to size of
    representation
  • Example n-variable odd parity
  • Many others

158
Learning Hard Functions
  • Standard method of learning hard functions (e.g.
    with decision trees) depth-k Lookahead
  • O(mn2k1-1) for m examples in n variables
  • We devise a technique that allows learning
    algorithms to efficiently learn hard functions

159
Key Idea
  • Hard functions are not hard for all data
    distributions
  • We can skew the input distribution to simulate a
    different one
  • By randomly choosing preferred values for
    attributes
  • Accumulate evidence over several skews to select
    a split attribute

160
Example Uniform Distribution
161
Example Skewed Distribution(Sequential
Skewing, Ray Page, ICML 2004)
Write a Comment
User Comments (0)
About PowerShow.com