Title: Mining the Genome
1Mining the Genome
- Filip elezný
- CVUT FEL, Prague
- Dept. of Cybernetics
- Gerstner Laboratory
2Intro
- Research at CVUT FEL Dept. of Cybernetics
- Nature Inspired Technologies
- machine learning
- evolutionary computation
- Agent Computing
- Robotics
- Computer Vision
- EU Projects (6 FP)
- 14 running in 2005, 9 new starting 2006
3Machine Learning basics
4Machine Learning Data Mining
- Supervised learning
- given examples and their class labels
- find a model for predicting class labels of new
examples - also concept learning, predictive
classification, ... -
- Example
- Given
- Discover
sizesmall luxurylow ? affordable
5Machine Learning
Plethora of paradigms
Decision trees
Support VectorMachines
Artificial NeuralNetworks
Symbolic
Subsymbolic
Statistical
Learning optimization in structure / parameter
space Learning search AI techniques employed
(gradient descent, heuristic search)
6Relational Learning
What if examples have a structure?
Not an attribute tuple ! Description spread in
multiple tables of a relational database
7Relational Learning
- Relational learning
- Representing data and rules in relational logic
(Prolog) - Exploits background knowledge (eg. charge)
- Inductive Logic Programming
carcinogenic(Compound) IF has_atom(Compound,
Atom) type(Atom, carbon) charge(Atom,
Charge) Charge gt 0.0133 has_atom(Compound,
Atom2) double_bond(Atom1, Atom2)
8Applications of Interest
- 3 hot fields intersection
-
BIO technologies(genomics)
INFORMATION technologies(machine learning)
NANO technologies(microarray chips)
9A quick intro into computational genomics
10Background GENETICS
How does a cell know what to do?
11Chromosomes
Chromosomes get copied during mitosis They carry
the assembly instructions? How?
Chromosomes proteins DNA where is the
information ??
12DNA
1953 Jim Watson Francis Crick Discover the DNA
structure. That is where the information is.
4-symbol alphabet Guanin, Adenin, Cytosin,
Tymin Double-helix pairing C-G A-T
video
13The CENTRAL DOGMA of Molecular Biology
- Gene DNA subsequence
- Genes code for proteins
- Gene expression
- DNA piece transcribes to RNA
- RNA translates into a protein
- Proteins do the job
- - enzymes
- - building blocks
- - ...
video
14Protein Coding
Codon(3 bases)
DNA strand
aminoacid
Protein
15Protein structures
resolution
16Secondary structure prediction
Two common secondary structures
? - sheet
? - helix
Primary structure determines secondary
structure. Computational problemGiven primary
structure, predict if ? - sheet or ? -
helix NOBODY CAN DO THAT !
17Secondary structure prediction
- Secondary structure prediction with ILP
- Muggleton 1992
- Using ILP, obtained rulessuch as
alpha0(A,B) ? ... position(A,D,O)
not_aromatic(O) small_or_polar(O)
position(A,B,C) very_hydrophobic(C)
not_aromatic(C) ...etc (22 literals)
- Note the incorporation of background knowledge
- Accuracy 81, best at the time
- Published in Jr Protein Engineering
18Sequencing the Human Genome
19The Genome project
- 1993 2003
- All human genes sequenced
-
- Celera X NIH race
- Challenge NOW
- annotate the genes
- discover functions
- interactions
- dynamic pathways
video
20Genomics research
- Traditional functional genomics research
-
- Hypothesis - driven
- eg. a gene is suspected to be responsible for ...
- then tracing its expression in relevant tissues
- First hypothesize, then measure
21Gene Expression Microarrays
- Microarray chip
- Measures expression of tens of thousands genes
simultaneously high-throughput - pioneering technology (mid to late 90s)
- A grid carrying synthesized DNA probes
-
- ? Breakthrough in genomics research?
photo scan
22Genomics Research
- High-Throughput approach to functional genomics
? - Data-driven, unbiased, First measure, then
hypothesize - Might reveal never-thought-of relationships
Microarray data
Human analysis
Hypotheses
IMPOSSIBLE (TOO MUCH DATA)
Expression of almost entire genome(tens of
thousands genes)
23Genomics Research through Machine Learning
- AI based High-Throughput functional genomics ?
High-throughputscreening
High-performancecomputing
Microarray data
Machine Learning
Hypotheses
Interpretation
24Genomics Research with AI
- This concept has recently been proven to work
- Golub et al., Science 286531-537 1999
- leukemia classification model (AML vs. ALL)
- voting of informative attributes (genes)
- Discovery of new classes (clustering)
- Ramaswamy et al., PNAS 9815149-54 2001
- Tumor classification
- 14 classes of cancer
- used Support Vector Machines
video
25Interpretable classifiers
- Comprehensibility Pursuit Rule Based Models
- Models interpretable by biologists
-
- Our work
- D. Gamberger, N. Lavrac, F. elezný, J. Tolar Jr
Biomed Informatics 37(5)269-284 2004
IF gene_20056 EXPRESSEDAND gene_23984
NOT_EXPRESSEDTHEN cancer_class AML
Class
26Exploiting Background knowledge
- Tons of genomic background knowledge available
- Relational learning would allow to exploit it!
27Relational Genomic Data Mining
- Our current work
- Combining expression gene annotation data
Rule Based Model
28Relational Genomic Data Mining
- Example rule algorithmically discovered
- ... open end, no conclusions
expressed_in_all(Gene) IF has_location(Gene,
integral_to_membrane) has_function(Gene,
receptor_activity)
Expression of genes coding for proteins located
in the integral to membrane cell component, whose
functions include receptor activity, has a high
correlation with the BCR class of acute
lymphoblastic leukemia (ALL) and a low
correlation with other classes of ALL.