Title: Extracting information from microarray data
1Extracting information from microarray data
- Jaak Vilo
- European Bioinformatics Institute EMBL-EBI
-
- http//www.ebi.ac.uk/microarray/
- http//ep.ebi.ac.uk
EMBO course Dec. 6th, 2000
2Microarray Experiment
RT-PCR
DNA Chip
High glucose
RT-PCR
LASER
Low glucose
3Gene expression data
Treated sample labeled red (Cy5) Control data
labeled green (Cy3) Competitive hybridization
onto chip Red dot - gene overexpressed in
treated sample Green dot - gene underexpressed in
treated sample Yellow - equally
expressed Intensity - absolute level red/green
- ratio of expression 2 - 2x
overexpressed 0.5 - 2x underexpressed log2(
red/green ) - log ratio 1 2x
overexpressed -1 2x underexpressed
Spotted cDNA microarray Stanford University
(Yeast,1997)
4Expression data matrix
5 Hierarchical clustering Visualization
Pseudo-coloring of expression data matrix. Method
developed by Mike Eisen, Stanford University
6Components of Expression Profiler http//www.ebi.
ac.uk/microarray/
External data, tools pathways, function, etc.
Expression data
EPCLUST (cluster Expression profiles)
GENOMES sequence, function, annotation
URLMAP provide links
SPEXS (Sequence Pattern Exhaustive Search) novel
patterns
PATMATCHknown patterns
7(No Transcript)
8EPCLUST
- Cluster-analysis
- Hierarchical and K-means clustering
- Many distance measures
- Nearest neighbours
- Visualisation - many options
- Also clusters short sequences...
9Clustering methods
- Hierarchical clustering
- complete linkage
- average linkage
- single linkage
- Distance measures
- Euclidean
- Correlation based
- Rank correlation
- Manhattan
- ...
- Partition-based
- K-means
- Specify K
- Randomly select centers
- Assign genes to centers
- Recalculate centers to gravity center
- Iterate until stabilizes
- Can get to local minimum
- Fast for large datasets
- Initial selection of centers
Plus many, many more ...
10Cluster matrices
Hierarchical clustering
Cluster sequences
11K-means clustering
New centers - center of gravity for a
cluster Cluster - objects closest to a center
Start clustering by choosing K
centers randomly most distant centers more...
Iterate clustering step until no cluster
changes Deterministic, might get stuck in
local minimum
12Data selection and analysis folder
13Hierarchical clustering output
Cut
GENOMES Yeast
Zoom
146200 genes, 80 exp.
Monitor size 1600x1200
Laptop 800x600
156200 genes, 80 exp.
Monitor size 1600x1200
Laptop 800x600
COLLAPSE
75 subtrees
Developed and implemented in Expression Profiler
in October 2000 by Jaak Vilo
16Running times of hierarchical clustering(In
seconds, on a 850MHz Intel, 1GBRAM, SCSI Linux
server)
17K-means clustering output
URLMAP
18More features
- Upload your own data
- Find most similar genes
- Find most distant genes
- Seed selected genes as K-means centers
- Cluster (short) sequences
- In future direct upload from databases
19Data upload from files Matrix separator Gene
annotations id annotation Annotations can
contain ltA HREFgtlinkslt/Agt Sequence
data (Other types?)
20Some standard distance measures
Euclidean distance
Euclidean squared
Manhattan (city-block)
Average distance
21Pearson correlation (centered)
If means of each column are 0, then it becomes
This can be used directly, then its uncentered
22Chord distance
y
f
Euclidean distance between two vectors whose
length has been normalized to 1
g
x
Legendre Legendre Numerical Ecology2nd ed.
23Rank correlation
Rank - smallest has rank1, next 2, etc. Equal
values have rank that is average of the ranks f
3 17 12 12 8 rank 1 5 3.5 3.5 2
24URLMAP - no need to cut paste
KEGG
25GENOMES
- Started from the need to get upstream sequences
- Overview, annotation, function, links to other
databases - Extraction of upstream, sequences relative to
start codons of genes for sets of genes - Querying capabilities
- I wish major databases offered all that...
26Automatic TF-binding site identification
- How to identify putative transcription factor
binding sites on a genomic scale? - Traditional approach (case by case basis)
- High-throughput methods (data mining)
27Organization of a typical yeast promoter
RNA
URS
URS
TATA
I
40 - 120 bp
20 - 700 bp
Coding Region
40 - 60 bp
28Traditional approach for TF-binding site
prediction
- Extract bits of DNA where binding occurs
- Carefully handpick coregulated genes (from
literature, experiments, functional class) - Identify conserved motifs in their upstreams
- Build complex models, use expensive search
techniques (Gibbs, EM, alignments) - Not easily scaleable for full genomes?
- Very good domain knowledge required
29High-throughput method From expression data to
regulatory signals
- Cluster the genes based on expression
measurements for identifying potentially
co-regulated sets of genes - Extract upstream sequences to these genes
- Search for patterns over-represented in these
clusters - Assess the quality of findings using some
statistical criteria
Brazma, Jonassen, Vilo, Ukkonen Genome Research
81205-1215, 1998 Vilo, Brazma, Jonassen,
Robinson, Ukkonen ISMB-2000, AAAI Press (August
2000)
30Cluster of co-expressed genes, pattern discovery
in regulatory regions
600 basepairs
Expression profiles
Retrieve
Upstream regions
Pattern over-represented in cluster
31Problem of noise
- Gene expression measurement accuracy (about
within a factor of 2 in 95 cases) - Clustering result depends on the choice of method
and parameters used in each - Does co-expression mean co-regulation?
32What questions we asked
- How to perform systematic discovery?
- How to assess the quality of the predictions?
- Do better clusters give better signals?
- Is method scaleable for larger genomes?
- Want to discover something unique for each
cluster, not just features common to upsreams
33Cluster and pattern strengths
- Cluster strength Average silhouette value How
well each object is classified into its own
cluster. Use average distances within cluster and
compare them to next closest cluster for each
object. Value between -1 .. 1 (well classified) - Pattern strength binomial distribution Given
probability of tails on coin, how probable is
to observe k or more tails out of n trials.
Number of tails nr. of pattern occurrences.
34Silhouette value(Rousseeuw 1987)
Average distance to members in same cluster
Average distance to members in closest cluster
bi
-
ai
Si
Max( bi, ai )
Assign goodness to each clustered object
Average silhouette over each cluster or over the
clustering Not a silver bullet
35Computational experimentclustering
- Yeast Saccharomyces cerevisiae, 6221 genes, 80
expression conditions for each (from P. Browns
lab) - No single best clustering method K-means, vary
K ? 2..1000, repeat 10x for each K. Total 1000
x K-means - Calculated average Silhouette values for all
clusters - Select all unique clusters of size 20..100 (
52.000) - Could combine several methods, several distance
measures
36Clustering Hierarchical Partitioning
based K-means clustering
Many more SOM, fuzzy c-means, graph-based, ...
37Computational experimentpattern discovery
- Upstream sequences of length 600bp from ORF start
- Analyze all upstreams of all 52K clusters with
SPEXS looking for substrings only (one weekend,
10 PC-s) - Extract all patterns from upstreams of all
clusters with probability less than 1
(binomial distribution, background probability is
calculated simultaneously from all 6221
upstreams)
38Pattern selection criteriaBinomial distribution
Background - ALL upstream sequences
? occurs 3 times P(?,6) is probability of having
3 or more matches in 6 sequences P(?,6) 0.0989
Cluster
5 out of 25, p 0.2
39(No Transcript)
40Pattern vs cluster strength
The pattern probability vs. the average
silhouette for the cluster
The same for randomised clusters
41The most unprobable pattern from best clusters
42One example
Pattern GGTGGCAA was the best for this
cluster (visualized in here after hierarchical
clustering within the cluster) 25 out of 40 ORFs
belong to cytoplasmic degradation functional
class (MIPS), mostly being proteasome subunits
43GGTGGCAA cluster (proteasome)
44GGTGGCAA is a binding site for RPN4
FEBS Lett 1999 Apr 30450(1-2)27-34 Rpn4p acts
as a transcription factor by binding to PACE, a
nonamer box found upstream of 26S proteasomal and
other genes in yeast. Mannhaupt G, Schnall R,
Karpov V, Vetter I, Feldmann H Adolf-Butenandt-Ins
titut der Ludwig-Maximilians-Universitat Munchen,
Germany. We identified a new, unique upstream
activating sequence (5'-GGTGGCAAA-3') in the
promoters of 26 out of the 32 proteasomal yeast
genes characterized to date, which we propose to
call proteasome-associated control element. By
using the one-hybrid method, we show that the
factor binding to the proteasome-associated
control element is Rpn4p, a protein containing a
C2H2-type finger motif and two acidic domains.
...
45GGTGGCAA - proteasome associated control element
YOR261C YOR261C RPN8 protein degradation
26S proteasome regulatory subunit
S0005787 1 YDL020C YDL020C RPN4 protein
degradation, ubiquitin26S proteasome subunit
S0002178 1 YDL007W YDL007W
RPT2 protein degradation 26S
proteasome subunit S0002165
1 YDL147W YDL147W RPN5 protein degradation
26S proteasome subunit
S0002306 1 YOL038W YOL038W PRE6
protein degradation 20S proteasome
subunit (alpha4) S0005398 1 YKL145W
YKL145W RPT1 protein degradation,
ubiquitin26S proteasome subunit
S0001628 1 YDL097C YDL097C RPN6
protein degradation 26S proteasome
regulatory subunit S0002255 1 YDR394W
YDR394W RPT3 protein degradation
26S proteasome subunit
S0002802 1 YBR173C YBR173C UMP1 protein
degradation, ubiquitin20S proteasome maturation
factor S0000377 1 YER012W YER012W
PRE1 protein degradation 20S
proteasome subunit C11(beta4) S0000814
1 YPR108W YPR108W RPN7 protein degradation
26S proteasome regulatory subunit
S0006312 1 YOR117W YOR117W RPT5
protein degradation 26S proteasome
regulatory subunit S0005643 1 YJL001W
YJL001W PRE3 protein degradation
20S proteasome subunit (beta1)
S0003538 1 YPR103W YPR103W PRE2 protein
degradation 20S proteasome subunit
(beta5) S0006307 1 YOR157C YOR157C
PUP1 protein degradation 20S
proteasome subunit (beta2) S0005683
1 YGL048C YGL048C RPT6 protein degradation
26S proteasome regulatory subunit
S0003016 1 YHR200W YHR200W RPN10
protein degradation 26S proteasome
subunit S0001243 1 YML092C
YML092C PRE8 protein degradation
20S proteasome subunit Y7 (alpha2
S0004557 1 YIL075C YIL075C RPN2 tRNA
processing 26S proteasome subunit)
S0001337 1 YMR314W YMR314W
PRE5 protein degradation 20S
proteasome subunit(alpha6) S0004931
1 YGR253C YGR253C PUP2 protein degradation
20S proteasome subunit(alpha5)
S0003485 1 YGR135W YGR135W PRE9
protein degradation 20S proteasome
subunit Y13 (alpha3) S0003367 1 YFR004W
YFR004W RPN11 transcription
putative global regulator
S0001900 1 YOR259C YOR259C RPT4 protein
degradation 26S proteasome regulatory
subunit S0005785 1 YFR052W YFR052W
RPN12 protein degradation 26S
proteasome regulatory subunit S0001948
1 YFR050C YFR050C PRE4 protein degradation
proteasome subunit, B type
S0001946 1 YGL011C YGL011C SCL1
protein degradation 20S proteasome
subunit YC7ALPHA/Y8 S0002979 1 YDR427W
YDR427W RPN9 protein degradation
26S proteasome regulatory subunit
S0002835 1 YOR362C YOR362C PRE10 protein
degradation 20S proteasome subunit C1
(alpha7) S0005889 1 YBL041W YBL041W
PRE7 protein degradation 20S
proteasome subunit S0000137
1 YER021W YER021W RPN3 protein degradation
26S proteasome regulatory subunit
S0000823 1 YER094C YER094C PUP3
protein degradation 20S proteasome
subunit (beta3 S0000896 1 YGR270W
YGR270W YTA7 protein degradation
26S proteasome subunit ATPase
S0003502 1 YHR027C YHR027C RPN1 protein
degradation 26S proteasome regulatory
subunit S0001069 1 YER047C YER047C
SAP1 mating type switching AAA
family protein S0000849
1 YGR232W YGR232W unknown
unknown
S0003464 1
46SPEXS - Sequence Pattern EXhaustive SearchJaak
Vilo, 1998
- User-definable pattern language substrings
(oligos), group positions/wildcards, flexible
wildcards (PROSITE) - Fast exhaustive search over pattern language
- Lazy suffix tree construction-like algorithm
- Analyze multiple sets of sequences simultaneously
- Restrict search to most frequent patterns only
(in each set) - Report most frequent patterns, patterns over- or
underrepresented in selected subsets, or
patterns significant by various statistical
criteria, e.g. by binomial distribution
47Lazy construction of trie
- ATACATA
- 12345678
- O(n²)
- Kurtz, Giegerich
- Good in practice
A
C
T
8
1,3,5,7
2,6
4
A
C
T
8
3,7
2,6
4
48SPEXS pattern discovery based on pattern trie.
Vilo 1998
?
- Substrings
- Group characters
- Wildcard positions
- Restrictions on nr of each separately
- At least k occurrences
- Exact occurrences of each pattern
A
CT
T
C
1,3,5,7
C ? T
4
2,6
2,4,6
A
3,5,7
ATACATA 12345678
49Results continued...
- Totally over 6000 interesting patterns
- Many from homologous upstreams gt Remove
- 1498 most interesting substrings left
- How to summarize?
- Clustered these based on mutual similarity
- Found alignments, consensi, and profiles
- Of 62 clusters 48 had patterns matching SCPD
(experimentally mapped) binding site database
50Cluster and align patterns
Words aligned as ALIGNMENT based on pattern
TGAC ----TGACAGC- ----TGACAGCT ---GTGACAGC- -TTGTG
ACAG-- --AGTGACA--- ACAGTGACA--- -CAGTGACA--- --AG
TGACAT-- --AGTGACATT- ---GTGACATT- ---GTGACAT-- --
TGTGACA--- -TTGTGACA--- ---GTGACAG-- --TGTGACAG--
---GTGACAC-- ----TGACCCT- ----TGACCGA- ---TTGACCG-
- --TTTGACC--- NO ALIGNMENT FOR NO CAGACAG
CLUSTER_0009 TGACAGC TGACAGCT GTGACAGC TTGTGACAG
AGTGACA ACAGTGACA CAGTGACA AGTGACAT AGTGACATT
GTGACATT GTGACAT TGTGACA TTGTGACA GTGACAG
TGTGACAG GTGACAC CAGACAG TGACCCT TGACCGA TTGACCG
TTTGACC
PROFILE gt/tmp/.12100.profile Profile for
sequences in file /tmp/.12100 based on pattern
TGAC - 19 16 10 4 0 0 0 0 0 6
13 19 A 1 0 5 0 0 0 20 0 16 0
1 0 C 0 2 0 0 0 0 0 20 4
2 3 0 G 0 0 0 14 0 20 0 0 0
8 0 0 T 0 2 5 2 20 0 0 0 0
4 3 1 CONSENSUS (not real one
) gt/tmp/.12100.consensus Profile for sequences
in file /tmp/.12100 based on pattern
TGAC CONSENSUS_PAT /tmp/.12100 Consensus
pattern actatGTGACAGTCctat
51Align a group of patterns
ALIGNMENT based on pattern AGT -----AGTGACA ---AC
AGTGACA ----CAGTGACA ----CAGTGAC- ---ACAGTGAC- ---
GCAGTGA-- ---GCAGTG--- --AGCAGTGA-- -TTACAGTG--- T
TTACAGTGA-- -TTACAGTGA-- --TACAGTG--- ---ACAGTGA--
--TACAGTGA-- TTTACAGT---- TTTACAGTG---
PROFILE gt/tmp/.5679.profile Profile for
sequences in file /tmp/.5679 based on pattern
AGT - 13 11 8 3 1 0 0 0 1 5
11 13 A 0 0 1 10 0 16 0 0 0 11
0 3 C 0 0 0 0 15 0 0 0 0
0 5 0 G 0 0 0 3 0 0 16 0 15
0 0 0 T 3 5 7 0 0 0 0 16 0
0 0 0 CONSENSUS (not real one
) gt/tmp/.5679.consensus Profile for sequences in
file /tmp/.5679 based on pattern
AGT CONSENSUS_PAT /tmp/.5679 tttAGCAGTGAca
AGTGACA ACAGTGACA CAGTGACA CAGTGAC ACAGTGAC GCAGTG
A GCAGTG AGCAGTGA TTACAGTG TTTACAGTGA TTACAGTGA TA
CAGTG ACAGTGA TACAGTGA TTTACAGT TTTACAGTG
52 -----------------ACCCAGACATCGGGCTTCCAC- ---------
------ACACCCAGACATC----------- ---------------ACAC
CCAGACATC----------- ---------------GAACCCATACACT-
---------- ---------------ACACCCAGACCGCG----------
---------------GCACCCACACATTT---------- ---------
---GCTAAACCCATGCACAGTGACT----- -----------------AC
CCAGACACGCTCGA------ -------------CTTCACCCTCATAC--
---------- ---------------ACACCCCTTTTCT-----------
---------------GCACCCAGTCTT------------ ---------
------GCACCCAAACACCTGCATATTTGG ---------------GCAC
CCAATCACC----------- ---------------ACACCCAGACCTC-
---------- ---------------AAACCCACACAT------------
--------------TGCACCCATACCTT----------- ---------
-----AACACCCAAGCACAG---------- -ATCTCTCGCAACG-----
-------------------- ACCTCCGTACATTC---------------
---------- ACACCTGGACACC--------------------------
ACATCCGTACAACGAGAACCCATACATTA---------- ---TCCGT
AC--- ACCCATAC-- -CATCCGTAC---
ACCCATACA- --ATCCGTA---- ACCCATACAT --ATCCGTACA
-- -CCCATAC-- ----CCGTAC---
-CCCATACA- ----CCGTACA-- --CCATACAT ---TCCGTACA
T- -CCCATACAT --ATCCGTACAT-
--CCATACA- ---TCCGTACA-- -AACATAC-- ----CCGTACA
T- --ACATACT- ---TCCGTA----
---GATACT- --ATCCGTAC--- --AGATACT- ----CCGTACC
-- ---ACCGTACC-- ---ACCGTAC--- --CACCGTAC--- ----C
CGTACATT ----GCGTAG--- ----GCGTAGG-- -----CGTAGG--
-CATCCGTA---- ACATCCGT----- -CATCCGT-----
In SCPD
Discovered automatically
53atatCTAGGCACTCaca taGCGCAGTgacc cgGTGG
CAAACag tcagaGACGGCTGGTActatttt a
catAGCAGGGGTctgaca gcgagatgaacGAT
GAGACtagatg aTGGATGCc taTGCATGAAc aTGGC
GTATAc gcTAGTATATATCgatagtggg gagga
agAGTAGATGATGagtgaag tagAGTAGATA
Agaaaaa gtaTAAATAGAGCtgct ataagTGATGCCC
Gacacga aCCTCAATATtgt ATCCAAGAg aaacaAA
ACAAAATcaacaata tgtGTAAATCATtt ataaaagt
CAGTAAAAGAcggacaaaag tgtTCGAAAGAGTt
attactgtaagAAAATTTTtgtcattt gaat
acgCAGGAAAGTgtgaa ttccatATTCTTCGA
ACTgat cggctctggctctgCTTTTTCTGTCa
tctgcc tTTTTCTGCTTAc ttagtagtcTGTCT
ATGGTCaatct taaatATTTTGTGtaca tacgCT
GTGCTaac
tctcaTCTCATCCTtagcatc aaTCTTCATGt cctcGAACG
TGCCATCtca ataCGCCTAATAat cgTACCTCTa accac
CCCCCTCGTaga gTTACATCCTCGg gACAGCTAc
actatGTGACAGTCctat tttcACAGTGTATtc
g atATCTACACAt tttGTCACAGATgg tgcACATTGC
CTtg ataTCTGGTTCt acaTCCGTACacgtt agcaatc
TAAGCGTAGtgaa tATTACGTTAAgc tctatAG
AAGTATTAc gtTAGTTACTTGAGca ACTTTATTT
agTAACTTATCa tACTCGCTTAAT gaacagatacg
AGCGCGctagatcagc acatGTACGCcaa GGTCG
CTAc tgtTAACGAATCGTTtaa gaatTCCGTTTAagg t
taCGAATAAGaaaa aACAGATGAATCttc tactcat
CGACTCAcacgaa tcCACGAAgctag cgactg
ACGTACGATatctat aCCACATACATt
54Summary about identified patterns
SCPD Of 1498 patterns 315 patterns match
sites 73 patterns are matched by some
site 1134 patterns do not match any nor
is matched by any Of 109 factors with total 799
sites 85 factors are matched by some of the
patterns 19 factors match some patterns 24
factors do not have matches nor is matched Of
498 unique sites of total 799 sites 238 sites
are matched by some of the patterns 21 sites
match some patterns 252 sites do not have
matches nor is matched
TRANSFAC Of 1498 patterns 297 patterns match
sites 61 patterns are matched by some
site 1174 patterns do not match any nor
is matched by any Of 351 DB-entries with total
359 sites 205 factors are matched by some of the
patterns 22 factors match some patterns 134
factors do not have matches nor is matched Of
334 unique sites of total 359 sites 198 sites are
matched by some of the patterns 16 sites match
some patterns 127 sites do not have matches nor
is matched
55PATMATCH
- Match your patterns against sequences
- Sequences - extracted from GENOMES
- Visualise matches along the sequence
- Visualise pattern by pattern if sequence has a
match - Order sequences according to hierarchical
clustering order from EPCLUST - Show clustering and upstream next to each other
56GGTGGCAA - proteasome associated control element
YOR261C YOR261C RPN8 protein degradation
26S proteasome regulatory subunit
S0005787 1 YDL020C YDL020C RPN4 protein
degradation, ubiquitin26S proteasome subunit
S0002178 1 YDL007W YDL007W
RPT2 protein degradation 26S
proteasome subunit S0002165
1 YDL147W YDL147W RPN5 protein degradation
26S proteasome subunit
S0002306 1 YOL038W YOL038W PRE6
protein degradation 20S proteasome
subunit (alpha4) S0005398 1 YKL145W
YKL145W RPT1 protein degradation,
ubiquitin26S proteasome subunit
S0001628 1 YDL097C YDL097C RPN6
protein degradation 26S proteasome
regulatory subunit S0002255 1 YDR394W
YDR394W RPT3 protein degradation
26S proteasome subunit
S0002802 1 YBR173C YBR173C UMP1 protein
degradation, ubiquitin20S proteasome maturation
factor S0000377 1 YER012W YER012W
PRE1 protein degradation 20S
proteasome subunit C11(beta4) S0000814
1 YPR108W YPR108W RPN7 protein degradation
26S proteasome regulatory subunit
S0006312 1 YOR117W YOR117W RPT5
protein degradation 26S proteasome
regulatory subunit S0005643 1 YJL001W
YJL001W PRE3 protein degradation
20S proteasome subunit (beta1)
S0003538 1 YPR103W YPR103W PRE2 protein
degradation 20S proteasome subunit
(beta5) S0006307 1 YOR157C YOR157C
PUP1 protein degradation 20S
proteasome subunit (beta2) S0005683
1 YGL048C YGL048C RPT6 protein degradation
26S proteasome regulatory subunit
S0003016 1 YHR200W YHR200W RPN10
protein degradation 26S proteasome
subunit S0001243 1 YML092C
YML092C PRE8 protein degradation
20S proteasome subunit Y7 (alpha2
S0004557 1 YIL075C YIL075C RPN2 tRNA
processing 26S proteasome subunit)
S0001337 1 YMR314W YMR314W
PRE5 protein degradation 20S
proteasome subunit(alpha6) S0004931
1 YGR253C YGR253C PUP2 protein degradation
20S proteasome subunit(alpha5)
S0003485 1 YGR135W YGR135W PRE9
protein degradation 20S proteasome
subunit Y13 (alpha3) S0003367 1 YFR004W
YFR004W RPN11 transcription
putative global regulator
S0001900 1 YOR259C YOR259C RPT4 protein
degradation 26S proteasome regulatory
subunit S0005785 1 YFR052W YFR052W
RPN12 protein degradation 26S
proteasome regulatory subunit S0001948
1 YFR050C YFR050C PRE4 protein degradation
proteasome subunit, B type
S0001946 1 YGL011C YGL011C SCL1
protein degradation 20S proteasome
subunit YC7ALPHA/Y8 S0002979 1 YDR427W
YDR427W RPN9 protein degradation
26S proteasome regulatory subunit
S0002835 1 YOR362C YOR362C PRE10 protein
degradation 20S proteasome subunit C1
(alpha7) S0005889 1 YBL041W YBL041W
PRE7 protein degradation 20S
proteasome subunit S0000137
1 YER021W YER021W RPN3 protein degradation
26S proteasome regulatory subunit
S0000823 1 YER094C YER094C PUP3
protein degradation 20S proteasome
subunit (beta3 S0000896 1 YGR270W
YGR270W YTA7 protein degradation
26S proteasome subunit ATPase
S0003502 1 YHR027C YHR027C RPN1 protein
degradation 26S proteasome regulatory
subunit S0001069 1 YER047C YER047C
SAP1 mating type switching AAA
family protein S0000849
1 YGR232W YGR232W unknown
unknown
S0003464 1
57PATMATCH - combine pattern matching with
expression data
58Global and Local Data MiningSecondary Data Mining
- Find global structure by clustering
- Find local structure by pattern discovery
- Summarize the findings to the size feasible for
humans to interpret - Find most interesting rules
- Find explanations for the behavior
- Hypothesis generation for wet-lab
59Gene networks
promoter1 gene1
promoter2 gene2
promoter3 gene3
promoter4 gene4
DNA
transcription
transcription factors
RNA
translation
proteins
60A gene network
61Lac-Operon
Thomas Schlitt
62Thanks Alvis Brazma EBI Inge Jonassen Universit
y of Bergen Esko Ukkonen University of
Helsinki Thomas Schlitt EBI Alan
Robinson EBI Lawrence Bower EBI
63Why microarray
- Systems biology era
- Study the complexity of a biological system as a
whole, e.g. all genes simultaneously - Increases throughput
- Better general models
- SNP, antibody, protein arrays
- Proteomics ...
64Proteomics
- Proteome - complete set of PROteins encoded by
the genOME - Capture proteins, the end products
- Post-translational modifications, untranslated
genes - What is the correspondence to mRNA levels?
- Proteomics - 2D protein gels, MassSpec are
expensive
65Some promises of MA data
- Hypothesis about the function of genes
- Tissue, or cancer biopsies (what is normal?)
- Cell lines (e.g. insert viruses)
- Animal model systems for diseases
- Toxicology and drug testing
- Molecular classification of diseases
(class discovery and class prediction)
66Using binomial distribution
- The probability of a pattern occurring in exactly
k out of n sequences is - The probability of a pattern to occur k or more
times in a set of n sequences (in a cluster) is