Title: Machine Learning for HighThroughput Biological Data
1Machine Learning for High-Throughput Biological
Data
These notes were originally from KDD2006 tutorial
notes, written by David page at Dept.
Biostatistics and Medical Informatics Dept.
Computer Sciences University of
Wisconsin-Madison. http//www.biostat.wisc.edu/pa
ge/PageKDD2006.ppt
2Some Data Types Well Discuss
- Gene expression microarray
- Single-nucleotide polymorphisms (??????????)
- Mass spectrometry proteomics (????? ) and
metabolomics (????? ) - Protein-protein interactions (from
co-immunoprecipitation) - High-throughput screening of potential drug
molecules
3image from the DOE Human Genome
Program http//www.ornl.gov/hgmis
4How Microarrays Work
Probes (DNA)
Labeled Sample (RNA)
Hybridization
Gene Chip Surface
5Two Views of Microarray Data
- Data points are genes
- Represented by expression levels across different
samples (ie, featuressamples) - Goal categorize new genes
- Data points are samples (eg, patients)
- Represented by expression levels of different
genes (ie, featuresgenes) - Goal categorize new samples
6Two Ways to View The Data
7Data Points are Genes
8Data Points are Samples
9Supervision Add Class Values
10Supervised Learning Task
- Given a set of microarray experiments, each done
with mRNA from a different patient (same cell
type from every patient) Patients expression
values for each gene constitute the features, and
patients disease constitutes the class - Do Learn a model that accurately predicts
class based on features
11Location in Task Space
12Leukemia (Golub et al., 1999)
- Classes Acute Lymphoblastic Leukemia(?????)
(ALL) and Acute Myeloid Leukemia (?????) (AML) - Approach Weighted voting (essentially naïve
Bayes) - Cross-Validated Accuracy Of 34 samples,
declined to predict 5, correct on other 29
13Cancer vs. Normal
- Relatively easy to predict accurately, because so
much goes haywire in cancer cells - Primary barrier is noise in the data impure RNA,
cross-hybridization, etc - Studies include breast, colon (??), prostate
(???), lymphoma (???), and multiple myeloma (???)
14X-Val Accuracies for Multiple Myeloma (74 MM vs.
31 Normal)
15More MM (300), Benign Condition MGUS (Hardin et
al., 2004)
16ROC Curves Cancer vs. Normal
17ROC Cancer vs. Benign (MGUS)
18Work by Statisticians Outside of Standard
Classification/Clustering
- Methods to better convert Affymetrixs low-level
intensity measurements into expression levels
e.g., work by Speed, Wong, Irrizary - Methods to find differentially expressed genes
between two samples, e.g. work by Newton and
Kendziorski - But the following is most related
19Ranking Genes by Significance
- Some biologists dont want one predictive model,
but a rank-ordered list of genes to explore
further (with estimated significance) - For each gene we have a set of expression levels
under our conditions, say cancer vs. normal - We can do a t-test to see if the mean expression
levels are different under the two conditions
p-value - Multiple comparisons problem if we repeat this
test for 30,000 genes, some will pop up as
significant just by chance alone - Could do a Bonferoni correction (multiply
p-values by 30,000), but this is drastic and
might eliminate all
20False Discovery Rate (FDR) Storey and
Tibshirani, 2001
- Addresses multiple comparisons but is less
extreme than Bonferoni - Replaces p-value by q-value fraction of genes
with this p-value or lower that really dont have
different means in the two classes (false
discoveries) - Publicly available in R as part of Bioconductor
package - Recommendation Use this in addition to your
supervised data mining your collaborators will
want to see it
21FDR Highlights Difficulties Getting Insight into
Cancer vs. Normal
22Using Benign Condition Instead of Normal Helps
Somewhat
23Question to Anticipate
- Youve run a supervised data mining algorithm on
your collaborators data, and you present an
estimate of accuracy or an ROC curve (from X-val) - How did you adjust this for the multiple
comparisons problem? - Answer you dont need to because you commit to a
single predictive model before ever looking at
the test data for a foldthis is only one
comparison
24Prognosis and Treatment
- Features same as for diagnosis
- Rather than disease state, class value becomes
life expectancy with a given treatment (or
positive response vs. no response to given
treatment)
25Breast Cancer Prognosis (Vant Veer et al., 2002)
- Classes good prognosis (no metastasis within
five years of initial diagnosis) vs. poor
prognosis - Algorithm Ensemble of voters
- Results 83 cross-validated accuracy on 78
cases
26A Lesson
- Previous work selected features to use in
ensemble by looking at the entire data set - Should have repeated feature selection on each
cross-val fold - Authors also chose ensemble size by seeing which
size gave highest cross-val result - Authors corrected this in web supplementaccuracy
went from 83 to 73 - Remember to tune parameters separately for each
cross-val fold!
27Prognosis with Specific Therapy (Rosenwald et
al., 2002)
- Data set contains gene-expression patterns for
160 patients with diffuse large B-cell lymphoma,
receiving anthracycline chemotherapy - Class label is five-year survival
- One test-train split 80/80
- True positive rate 60 False negative rate
39
28Some Future Directions
- Using gene-chip data to select therapy Predict
which therapy gives best prognosis for
patient - Combining Gene Expression Data with Clinical Data
such as Lab Results, Medical and Family History - Multiple relational tables, may benefit from
relational learning
29Unsupervised Learning Task
- Given a set of microarray experiments under
different conditions - Do cluster the genes, where a gene described by
its expression levels in different experiments
30Location in Task Space
31Example(Green up-regulated, Red
down-regulated)
Genes
Experiments (Samples)
32Visualizing Gene Clusters (eg, Sharan and
Shamir, 2000)
Gene Cluster 1, size20
Gene Cluster 2, size43
Time (10-minute intervals)
33Unsupervised Learning Task 2
- Given a set of microarray experiments (samples)
corresponding to different conditions or
patients - Do cluster the experiments
34Location in Task Space
35Examples
- Cluster samples from mice subjected to a variety
of toxic compounds (Thomas et al., 2001) - Cluster samples from cancer patients, potentially
to discover different subtypes of a cancer - Cluster samples taken at different time points
36Some Biological Pathways
- Regulatory pathways
- Nodes are labeled by genes
- Arcs denote influence on transcription
- G1 codes for P1, P1 inhibits G2s transcription
- Metabolic pathways
- Nodes are metabolites, large biomolecules (eg,
sugars, lipids, proteins and modified proteins) - Arcs from biochemical reaction inputs to outputs
- Arcs labeled by enzymes (themselves proteins)
37Metabolic Pathway Example
H20
HSCoA
Citrate
cis-Aconitate
Acetyl CoA
citrate synthase
aconitase
H20
Oxaloacetate
NADH
MDH
(Krebs Cycle, TCA Cycle, Citric Acid Cycle)
Isocitrate
NAD
NAD
Malate
IDH
NADH CO2
fumarase
H20
a-Ketoglutarate
NAD HSCoA
Fumarate
a-KDGH
NADH CO2
succinate thikinase
Succinyl-CoA
FADH2
Succinate
FAD
GDP Pi
GTP
HSCoA
38Regulatory Pathway (KEGG)
39Using Microarray Data Only
- Regulatory pathways
- Nodes are labeled by genes
- Arcs denote influence on transcription
- G1 codes for P1, P1 inhibits G2s transcription
- Metabolic pathways
- Nodes are metabolites, large biomolecules (eg,
sugars, lipids, proteins, and modified proteins) - Arcs from biochemical reaction inputs to outputs
- Arcs labeled by enzymes (themselves proteins)
40Supervised Learning Task 2
- Given a set of microarray experiments for same
organism under different conditions - Do Learn graphical model that accurately
predicts expression of some genes in terms of
others
41Some Approaches to Learning Regulatory Networks
- Bayes Net Learning (started with Friedman
Halpern, 1999, well see more) - Boolean Networks (Akutsu, Kuhara, Maruyama
Miyano, 1998 Ideker, Thorsson Karp, 2002) - Related Graphical Approaches (Tanay Shamir,
2001 Chrisman, Langley, Baay Pohorille, 2003)
42Bayesian Network (BN)
Note direction of arrow indicates dependence not
causality
43Problem Not Causality
A
B
A is a good predictor of B. But is A regulating
B?? Ground truth might be
B
A
A
C
B
B
C
A
C
Or a more complicated variant
B
A
44Approaches to Get Causality
- Use knock-outs (Peer, Regev, Elidan and
Friedman, 2001). But not available in most
organisms. - Use time-series data and Dynamic Bayesian
Networks (Ong, Glasner and Page, 2002). But even
less data typically. - Use other data sources, eg sequences upstream of
genes, where transcription regulators may bind.
(Segal, Barash, Simon, Friedman and Koller, 2002
Noto and Craven, 2005)
45A Dynamic Bayes Net
46Problem Not Enough Data Points to Construct
Large Network
- Fortunate to get 100s of chips
- But have 1000s of genes
- E. coli 4000
- Yeast 6000
- Human 30,000
- Want to learn causal graphical model over 1000s
of variables with 100s of examples (settings of
the variables)
47Advance Module Networks Segal, Peer, Regev,
Koller Friedman, 2005
- Cluster genes by similarity over expression
experiments - All genes in a cluster are tied together same
parents and CPDs - Learn structure subject to this tying together of
genes - Iteratively re-form clusters and re-learn
network, in an EM-like fashion
48Problem Data are Continuous but Models are
Discrete
- Gene chips provide a real-valued mRNA measurement
- Boolean networks and most practical Bayes net
learning algorithms assume discrete variables - May lose valuable information by discretizing
49Advance Use of Dynamic Bayes Nets with
Continuous Variables Segal, Peer, Regev, Koller
Friedman, 2005
- Expression measurements used instead of
discretized (up, down, same) - Assume linear influence of parents on children
(Michaelis-Menten assumption) - Work so far constructed the network from
literature and learned parameters
50Problem Much Missing Information
- mRNA from gene 1 doesnt directly alter level of
mRNA from gene 2 - Rather, the protein product from gene 1 may alter
level of mRNA from gene 2 (e.g., transcription
factor) - Activation of transcription factor might not
occur by making more of it, but just by
phosphorylating it (post-translational
modification)
51Example Transcription Regulation
Operon
Operon
DNA
52Approach Measure More Stuff
- Mass spectrometry (later) can measure protein
rather than mRNA - Doesnt measure all proteins
- Not very quantitative (presence/absence)
- 2D gels can measure post-translational
modifications, but still low-throughput because
of current analysis - Co-immunoprecipitation (later), Yeast 2-Hybrids
can measure protein interactions, but noisy
53Another Way Around Limitations
- Identify smaller part of the task that is a step
toward a full regulatory pathway - Part of a pathway
- Classes or groups of genes
- Examples
- Chromatin remodelers Predicting the operons
in E. coli
54Chromatin Remodelers and Nucleosome Segal et al.
2006
- Previous DNA picture oversimplified
- DNA double-helix is wrapped in further complex
structure - DNA is accessible only if part of this structure
is unwound - Can we predict what chromatin remodelers act on
what parts of DNA, also what activates a
remodeler?
55The E. Coli Genome
56Finding Operons in E. coli(Craven, Page,
Shavlik, Bockhorst and Glasner, 2000)
g3
g2
g4
g5
g1
promoter
terminator
- Given known operons and other E. coli data
- Do predict all operons in E. coli
- Additional Sources of Information
- gene-expression data
- functional annotation
57Comparing Naive Bayes and Decision Trees (C5.0)
58Using Only Individual Features
59Single-Nucleotide Polymorphisms
- SNPs Individual positions in DNA where
variation is common - Roughly 2 million known SNPs in humans
- New Affymetrix whole-genome scan measures 500,000
of these - Easier/faster/cheaper to measure SNPs than to
completely sequence everyone - Motivation
60 If We Sequenced Everyone
Susceptible to Disease D or Responds to Treatment
T
Not Susceptible or Not Responding
61Example of SNP Data
62Phasing (Haplotyping)
63Advantages of SNP Data
- Persons SNP pattern does not change with time or
disease, so it can give more insight into
susceptibility - Easier to collect samples (can simply use blood
rather than affected tissue)
64Challenges of SNP Data
- Unphased Algorithms exist for phasing
(haplotyping), but they make errors and
typically need related individuals, dense
coverage - Missing values are more common than in
microarray data (though improving substantially,
down to around 1-2 now) - Many more measurements. For example, Affymetrix
human SNP chip at a half million SNPs.
65Supervised Learning Task
- Given a set of SNP profiles, each from a
different patient. - Phased nucleotides at each SNP position on each
copy of each chromosome constitute the features,
and patients disease constitutes the class - Unphased unordered pair of nucleotides at each
SNP position constitute the features, and
patients disease constitutes the class - Do Learn a model that accurately predicts
class based on features
66Waddell et al., 2005
- Multiple Myeloma, Young (susceptible) vs. Old
(less susceptible), 3000 SNPs, best at 64 acc
(training) - SVM with feature selection (repeated on every
fold of cross-validation) 72 accuracy, also
naïve Bayes. Significantly better than chance.
67Listgarten et al., 2005
- SVMs from SNP data predict lung cancer
susceptibility at 69 accuracy - Naïve Bayes gives similar performance
- Best single SNP at less than 60 accuracy
(training)
68Lessons
- Supervised data mining algorithms can predict
disease susceptibility at rates better than
chance and better than individual SNPs - Accuracies much lower than we see with microarray
data, because were predicting who will get
disease, not who already has it
69Future Directions
- Pharmacogenetics predicting drug response from
SNP profile - Drug Efficacy
- Adverse Reaction
- Combining SNP data with other data types, such as
clinical (history, lab tests) and microarray
70Proteomics
- Microarrays are useful primarily because mRNA
concentrations serve as surrogate for protein
concentrations - Like to measure protein concentrations directly,
but at present cannot do so insame
high-throughput manner - Proteins do not have obvious direct complements
- Could build molecules that bind, but binding
greatly affected by protein structure
71Time-of-Flight (TOF) Mass Spectrometry (thanks
Sean McIlwain)
Detector
- Measures the time for an ionized particle,
starting from the sample plate, to hit the
detector
Laser
Sample
V
72Time-of-Flight (TOF) Mass Spectrometry 2
Detector
- Matrix-Assisted Laser Desorption-Ionization
(MALDI) - Crystalloid structures made using proton-rich
matrix molecule - Hitting crystalloid with laser causes molecules
to ionize and fly towards detector
Laser
Sample
V
73Time-of-Flight Demonstration 0
Sample Plate
74Time-of-Flight Demonstration 1
Matrix Molecules
75Time-of-Flight Demonstration 2
Protein Molecules
76Time-of-Flight Demonstration 3
Laser
Detector
Positive Charge
10KV
77Time-of-Flight Demonstration 4
Proton kicked off matrix molecule onto another
molecule
Laser pulsed directly onto sample
10KV
78Time-of-Flight Demonstration 5
Lots of protons kicked off matrix ions, giving
rise to more positively charged molecules
10KV
79Time-of-Flight Demonstration 6
The high positive potential under sample plate,
causes positively charged molecules to accelerate
towards detector
10KV
80Time-of-Flight Demonstration 7
Smaller mass molecules hit detector first, while
heavier ones detected later
10Kv
81Time-of-Flight Demonstration 8
The incident time measured from when laser is
pulsed until molecule hits detector
10KV
82Time-of-Flight Demonstration 9
Experiment repeated a number of times, counting
frequencies of flight-times
10KV
83Example Spectra from a Competition by Lin et al.
at Duke
These are different fractions from the same
sample.
Intensity
M/Z
84Trypsin-Treated Spectra
Frequency
M/Z
85Many Challenges Raised by Mass Spectrometry Data
- Noise extra peaks from handling of sample, from
machine and environment (electrical noise), etc. - M/Z values may not align exactly across spectra
(resolution 0.1) - Intensities not calibrated across spectra
quantification is difficult - Cannot get all proteins typically only several
hundred. To improve odds of getting the ones we
want, may fractionate our sample by 2D gel
electrophoresis or liquid chromatography.
86Challenges (Continued)
- Better results if partially digest proteins
(break into smaller peptides) first - Can be difficult to determine what proteins we
have from spectrum - Isotopic peaks C13 and N15 atoms in varying
numbers cause multiple peaks for a single peptide
87Handling Noise Peak Picking
- Want to pick peaks that are statistically
significant from the noise signal
Want to use these as features in our learning
algorithms.
88Many Supervised Learning Tasks
- Learn to predict proteins from spectra, when the
organisms proteome is known - Learn to identify isotopic distributions
- Learn to predict disease from either proteins,
peaks or isotopic distributions as features - Construct pathway models
89Using Mass Spectrometry for Early Detection of
Ovarian Cancer Petricoin et al., 2002
- Ovarian cancer difficult to detect early, often
leading to poor prognosis - Trained and tested on mass spectra from blood
serum - 100 training cases, 50 with cancer
- Held-out test set of 116 cases, 50 with cancer
- 100 sensitivity, 95 specificity (63/66) on
held-out test set
90Not So Fast
- Data mining methodology seems sound
- But Keith Baggerly argues that cancer samples
were handled differently than normal samples, and
perhaps data were preprocessed differently too - If we run cancer samples Monday and normals
Wednesday, could get differences from machine
breakdown or nearby electrical equipment thats
running on Monday but not Wed - Lesson tell collaborators they must randomize
samples for the entire processing phase and of
course all our preprocessing must be same - Debate is still raging results not replicated in
trials
91Other Proteomics 3D Structures
92Other Proteomics Interactions
Figure from Ideker et al., Science
292(5518)929-934, 2001
- each node represents a gene product (protein)
- blue edges show direct protein-protein
interactions - yellow edges show interactions in which one
protein binds to DNA and affects the expression
of another
93Protein-Protein Interactions
- Yeast 2-Hybrid
- Immunoprecipitation
- Antibodies (immuno) are made by combinatorial
combinations of certain proteins - Millions of antibodies can be made, to recognize
a wide variety of different antigens
(invaders), often by recognizing specific
proteins
antibody
protein
94Protein-Protein Interactions
95Immunoprecipitation
antibody
96Co-Immunoprecipitation
antibody
97Many Supervised Learning Tasks
- Learn to predict protein-protein interactions
protein 3D structures may be critical - Use protein-protein interactions in construction
of pathway models - Learn to predict protein function from
interaction data
98ChIP-Chip Data
- Immunoprecipitation can also be done to identify
proteins interacting with DNA rather than other
proteins - Chromatin immunoprecipitation (ChIP) grab sample
of DNA bound to a particular protein
(transcription factor) - ChIP-Chip run this sample of DNA on a microarray
to see which DNA was bound - Example of analysis of such new data Keles et
al., 2006
99Metabolomics
- Measures concentration of each low-molecular
weight molecule in sample - These typically are metabolites, or small
molecules produced or consumed by reactions in
biochemical pathways - These reactions typically catalyzed by proteins
(specifically, enzymes) - This data typically also mass spectrometry,
though could also be NMR
100Lipomics
- Analogous to metabolomics, but measuring
concentrations of lipids rather than metabolites - Potentially help induce biochemical pathway
information or to help disease diagnosis or
treatment choice
101To Design a Drug
Identify Target Protein
Knowledge of proteome/genome
Relevant biochemical pathways
Crystallography, NMR Difficult if Membrane-Bound
Determine Target Site Structure
Synthesize a Molecule that Will Bind
Imperfect modeling of structure Structures may
change at binding And even then
102Molecule Binds Target But May
- Bind too tightly or not tightly enough.
- Be toxic.
- Have other effects (side-effects) in the body.
- Break down as soon as it gets into the body, or
may not leave the body soon enough. - It may not get to where it should in the body
(e.g., crossing blood-brain barrier). - Not diffuse from gut to bloodstream.
103And Every Body is Different
- Even if a molecule works in the test tube and
works in animal studies, it may not work in
people (will fail in clinical trials). - A molecule may work for some people but not
others. - A molecule may cause harmful side-effects in some
people but not others.
104Typical Practice when Target Structure is Unknown
- High-Throughput Screening (HTS) Test many
molecules (1,000,000) to find some that bind to
target (ligands). - Infer (induce) shape of target site from 3D
structural similarities. - Shared 3D substructure is called a pharmacophore.
- Perfect example of a machine learning task with
spatial target.
105An Example of Structure Learning
Inactive
Active
106Common Data Mining Approaches
- Represent a molecule by thousands to millions of
features and use standard techniques (e.g., KDD
Cup 2001) - Represent each low-energy conformer by feature
vector and use multiple-instance learning (e.g.,
Jain et al., 1998) - Relational learning
- Inductive logic programming (e.g., Finn et al.,
1998) - Graph mining
107Supervised Learning Task
- Given a set of molecules, each labeled by
activity -- binding affinity for target protein
-- and a set of low-energy conformers for each
molecule - Do Learn a model that accurately predicts
activity (may be Boolean or real-valued)
108ILP as Illustration The Logical Representation
of a Pharmacophore
109Background Knowledge I
- Information about atoms and bonds in the
molecules - atm(m1,a1,o,3,5.915800,-2.441200,1.799700).
- atm(m1,a2,c,3,0.574700,-2.773300,0.337600).
- atm(m1,a3,s,3,0.408000,-3.511700,-1.314000).
- bond(m1,a1,a2,1).
- bond(m1,a2,a3,1).
110Background knowledge II
- Definition of distance equivalence
- dist(Drug,Atom1,Atom2,Dist,Error)-
- number(Error),
- coord(Drug,Atom1,X1,Y1,Z1),
- coord(Drug,Atom2,X2,Y2,Z2),
- euc_dist(p(X1,Y1,Z1),p(X2,Y2,Z2),Dist1),
- Diff is Dist1-Dist,
- absolute_value(Diff,E1),
- E1 lt Error.
- euc_dist(p(X1,Y1,Z1),p(X2,Y2,Z2),D)-
- Dsq is (X1-X2)2(Y1-Y2)2(Z1-Z2)2,
- D is sqrt(Dsq).
111Central Idea Generalize by searching a lattice
112Conformational model
- Conformational flexibility modelled as multiple
conformations - Sybyl randomsearch
- Catalyst
113Pharmacophore description
- Atom and site centred
- Hydrogen bond donor
- Hydrogen bond acceptor
- Hydrophobe
- Site points (limited at present)
- User definable
- Distance based
114Example 1 Dopamine agonists
- Agonists taken from Martin data set on QSAR
society web pages - Examples (5-50 conformations/molecule)
115Pharmacophore identified
- Molecule A has the desired activity if
- in conformation B molecule A contains a
hydrogen acceptor at C, and - in conformation B molecule A contains a basic
nitrogen group at D, and - the distance between C and D is 7.05966 /-
0.75 Angstroms, and - in conformation B molecule A contains a
hydrogen acceptor at E, and - the distance between C and E is 2.80871 /-
0.75 Angstroms, and - the distance between D and E is 6.36846 /-
0.75 Angstroms, and - in conformation B molecule A contains a
hydrophobic group at F, and - the distance between C and F is 2.68136 /-
0.75 Angstroms, and - the distance between D and F is 4.80399 /-
0.75 Angstroms, and - the distance between E and F is 2.74602 /-
0.75 Angstroms.
116Example II ACE inhibitors
- 28 angiotensin converting enzyme inhibitors taken
from literature - D. Mayer et al., J. Comput.-Aided Mol. Design, 1,
3-16, (1987)
117ACE pharmacophore
- Molecule A is an ACE inhibitor if
- molecule A contains a zinc-site B,
- molecule A contains a hydrogen acceptor C,
- the distance between B and C is 7.899 /-
0.750 A, - molecule A contains a hydrogen acceptor D,
- the distance between B and D is 8.475 /-
0.750 A, - the distance between C and D is 2.133 /-
0.750 A, - molecule A contains a hydrogen acceptor E,
- the distance between B and E is 4.891 /-
0.750 A, - the distance between C and E is 3.114 /-
0.750 A, - the distance between D and E is 3.753 /-
0.750 A.
118Pharmacophore discovered
Zinc site H-bond acceptor
119Additional Finding
- Original pharmacophore rediscovered plus one
other - different zinc ligand position
- similar to alternative proposed by Ciba-Geigy
120Example III Thermolysin inhibitors
- 10 inhibitors for which crystallographic data is
available in PDB - Conformationally challenging molecules
- Experimentally observed superposition
121Key binding site interactions
Asn112-NH
OC Asn112
S2
Arg203-NH
S1
OC Ala113
Zn
122Interactions made by inhibitors
123Pharmacophore Identification
- Structures considered 1HYT 1THL 1TLP 1TMN 2TMN
4TLN 4TMN 5TLN 5TMN 6TMN - Conformational analysis using Best conformer
generation in Catalyst - 98-251 conformations/molecule
124Thermolysin Results
- 10 5-point pharmacophore identified, falling into
2 groups (7/10 molecules) - 3 acceptors, 1 hydrophobe, 1 donor
- 4 acceptors, 1 donor
- Common core of Zn ligands, Arg203 and Asn112
interactions identified - Correct assignments of functional groups
- Correct geometry to 1 Angstrom tolerance
125Thermolysin results
- Increasing tolerance to 1.5Angstroms finds common
6-point pharmacophore including one extra
interaction
126Example IV Antibacterial peptides Spatola et
al., 2000
- Dataset of 11 pentapeptides showing activity
against Pseudomonas aeruginosa - 6 actives lt64mg/ml IC50
- 5 inactives
127Pharmacophore Identified
A Molecule M is active against Pseudomonas
Aeruginosa if it has a conformation B such
that M has a hydrophobic group C, M has a
hydrogen acceptor D, the distance between C and
D in conformation B is 11.7 Angstroms M has a
positively-charged atom E, the distance between
C and E in conformation B is 4 Angstroms the
distance between D and E in conformation B is 9.4
Angstroms M has a positively-charged atom
F, the distance between C and F in conformation
B is 11.1 Angstroms the distance between D and F
in conformation B is 12.6 Angstroms the distance
between E and F in conformation B is 8.7
Angstroms Tolerance 1.5 Angstroms
128(No Transcript)
129Clinical Databases of the Future (Dramatically
Simplified)
PatientID Date Physician Symptoms
Diagnosis P1 1/1/01 Smith
palpitations hypoglycemic P1 2/1/03
Jones fever, aches influenza
PatientID Gender Birthdate P1 M
3/22/63
PatientID Date Lab Test Result
PatientID SNP1 SNP2 SNP500K P1
AA AB BB P2
AB BB AA
P1 1/1/01 blood glucose 42
P1 1/9/01 blood glucose 45
PatientID Date Prescribed Date Filled
Physician Medication Dose Duration
P1 5/17/98 5/18/98
Jones prilosec 10mg 3
months
130Final Wrap-up
- Molecular biology collecting lots and lots of
data in post-genome era - Opportunity to connect molecular-level
information to diseases and treatment - Need analysis tools to interpret
- Data mining opportunities abound
- Hopefully this tutorial provided solid start
toward applying data mining to high-throughput
biological data
131Thanks To
- Jude Shavlik
- John Shaughnessy
- Bart Barlogie
- Mark Craven
- Sean McIlwain
- Jan Struyf
- Arno Spatola
- Paul Finn
- Beth Burnside
- Michael Molla
- Michael Waddell
- Irene Ong
- Jesse Davis
- Soumya Ray
- Jo Hardin
- John Crowley
- Fenghuang Zhan
- Eric Lantz
132If Time Permits some of my groups directions in
the area
- Clinical Data (with Jesse Davis, Beth Burnside,
M.D.) - Addressing another problem with current
approaches to biological network learning (with
Soumya Ray, Eric Lantz)
133Using Machine Learning with Clinical Histories
Example
- Well use example of Mammography to show some
issues that arise - These issues arise here even with just one
relational table - These issues are even more pronounced with data
in multiple tables
134Supervised Learning Task
- Given a database of mammogram abnormalities for
different patients (same cell type from every
patient) Radiologist-entered values describing
the abnormality constitute the features, and
abnormalitys biopsy result as benign or
malignant constitutes the class - Do Learn a model that accurately predicts
class based on features
135Mammography DatabaseDavis et al, 2005 Burnside
et al, 2005
136Original Expert Structure
137Level 1 Parameters
Given Features (node labels, or fields in
database), Data, Bayes net structure Learn
Probabilities. Note probabilities needed are
Pr(Be/Mal), Pr(ShapeBe/Mal), Pr (SizeBe/Mal)
138Level 2 Structure
Be/Mal
Given Features, Data Learn Bayes
net structure and probabilities. Note with this
structure, now will need Pr(SizeShape,Be/Mal)
instead of Pr(SizeBe/Mal).
Shape
Size
139Mammography Database
140Mammography Database
141Mammography Database
142Level 3 Aggregates
Given Features, Data, Background knowledge
aggregation functions such as average, mode, max,
etc. Learn Useful aggregate features,
Bayes net structure that uses these features, and
probabilities. New features may use other
rows/tables.
Avg size this date
Be/Mal
Shape
Size
143Mammography Database
144Mammography Database
145Mammography Database
146Level 4 View Learning
Given Features, Data, Background knowledge
aggregation functions and intensionally-defined
relations such as increase or same
location Learn Useful new features defined by
views (equivalent to rules or SQL queries), Bayes
net structure, and probabilities.
Shape change in abnormality at this location
Increase in average size of abnormalities
Avg size this date
Be/Mal
Shape
Size
147Example of Learned Rule
is_malignant(A) IF 'BIRADS_category'(A,b5),
'MassPAO'(A,present), 'MassesDensity'(A,high),
'HO_BreastCA'(A,hxDCorLC), in_same_mammogram(A,B
), 'Calc_Pleomorphic'(B,notPresent),
'Calc_Punctate'(B,notPresent).
148ROC Level 2 (TAN) vs. Level 1
149Precision-Recall Curves
150(No Transcript)
151SAYU-View
- Improved View Learning approach
- SAYU Score As You Use
- For each candidate rule, add it to the Bayesian
network and see if it improves the networks
score - Only add a rule (new field for the view) if it
improves the Bayes net
152(No Transcript)
153(No Transcript)
154Clinical Databases of the Future (Dramatically
Simplified)
PatientID Date Physician Symptoms
Diagnosis P1 1/1/01 Smith
palpitations hypoglycemic P1 2/1/03
Jones fever, aches influenza
PatientID Gender Birthdate P1 M
3/22/63
PatientID Date Lab Test Result
PatientID SNP1 SNP2 SNP500K P1
AA AB BB P2
AB BB AA
P1 1/1/01 blood glucose 42
P1 1/9/01 blood glucose 45
PatientID Date Prescribed Date Filled
Physician Medication Dose Duration
P1 5/17/98 5/18/98
Jones prilosec 10mg 3
months
155Another Problem with Current Learning of
Regulatory Models
- Current techniques all use greedy heuristic
- Bayes net learning algorithms use sparse
candidate approach to be considered as a parent
of gene 1, another gene 2 must be correlated with
gene 1 - CPDs often represented as trees use greedy tree
learning algorithms - All can fall prey to functions such as
exclusive-or do these arise?
156Skewing Example Page Ray, 2003 Ray Page,
2004 Rosell et al., 2005 Ray Page, 2005
Drosophila survival based on gender and Sxl gene
activity
157Hard Functions
- Our Definition those functions for which no
attribute has gain according to standard purity
measures (GINI, Entropy) - NOTE Hard does not refer to size of
representation - Example n-variable odd parity
- Many others
158Learning Hard Functions
- Standard method of learning hard functions (e.g.
with decision trees) depth-k Lookahead - O(mn2k1-1) for m examples in n variables
- We devise a technique that allows learning
algorithms to efficiently learn hard functions
159Key Idea
- Hard functions are not hard for all data
distributions - We can skew the input distribution to simulate a
different one - By randomly choosing preferred values for
attributes - Accumulate evidence over several skews to select
a split attribute
160Example Uniform Distribution
161Example Skewed Distribution(Sequential
Skewing, Ray Page, ICML 2004)