Extracting information from microarray data

About This Presentation

Title:

Extracting information from microarray data

Description:

Yellow - equally expressed. Intensity - 'absolute' level. red/green - ratio of expression ... Not a silver bullet. Average distance to members. in same cluster ... – PowerPoint PPT presentation

Number of Views:32

Avg rating:3.0/5.0

Slides: 67

Provided by: juham

Category:

more less

Transcript and Presenter's Notes

Title: Extracting information from microarray data

1
Extracting information from microarray data

Jaak Vilo
European Bioinformatics Institute EMBL-EBI
http//www.ebi.ac.uk/microarray/
http//ep.ebi.ac.uk

EMBO course Dec. 6th, 2000
2
Microarray Experiment
RT-PCR
DNA Chip
High glucose
RT-PCR
LASER
Low glucose
3
Gene expression data
Treated sample labeled red (Cy5) Control data
labeled green (Cy3) Competitive hybridization
onto chip Red dot - gene overexpressed in
treated sample Green dot - gene underexpressed in
treated sample Yellow - equally
expressed Intensity - absolute level red/green
- ratio of expression 2 - 2x
overexpressed 0.5 - 2x underexpressed log2(
red/green ) - log ratio 1 2x
overexpressed -1 2x underexpressed
Spotted cDNA microarray Stanford University
(Yeast,1997)
4
Expression data matrix
5
Hierarchical clustering Visualization
Pseudo-coloring of expression data matrix. Method
developed by Mike Eisen, Stanford University
6
Components of Expression Profiler http//www.ebi.
ac.uk/microarray/
External data, tools pathways, function, etc.
Expression data
EPCLUST (cluster Expression profiles)
GENOMES sequence, function, annotation
URLMAP provide links
SPEXS (Sequence Pattern Exhaustive Search) novel
patterns
PATMATCHknown patterns
7
(No Transcript)
8
EPCLUST

Cluster-analysis
Hierarchical and K-means clustering
Many distance measures
Nearest neighbours
Visualisation - many options
Also clusters short sequences...

9
Clustering methods

Hierarchical clustering
complete linkage
average linkage
single linkage
Distance measures
Euclidean
Correlation based
Rank correlation
Manhattan
...

Partition-based
K-means
Specify K
Randomly select centers
Assign genes to centers
Recalculate centers to gravity center
Iterate until stabilizes
Can get to local minimum
Fast for large datasets
Initial selection of centers

Plus many, many more ...
10
Cluster matrices
Hierarchical clustering
Cluster sequences
11
K-means clustering
New centers - center of gravity for a
cluster Cluster - objects closest to a center
Start clustering by choosing K
centers randomly most distant centers more...
Iterate clustering step until no cluster
changes Deterministic, might get stuck in
local minimum
12
Data selection and analysis folder
13
Hierarchical clustering output
Cut
GENOMES Yeast
Zoom
14
6200 genes, 80 exp.
Monitor size 1600x1200
Laptop 800x600
15
6200 genes, 80 exp.
Monitor size 1600x1200
Laptop 800x600
COLLAPSE
75 subtrees
Developed and implemented in Expression Profiler
in October 2000 by Jaak Vilo
16
Running times of hierarchical clustering(In
seconds, on a 850MHz Intel, 1GBRAM, SCSI Linux
server)

17
K-means clustering output
URLMAP
18
More features

Upload your own data
Find most similar genes
Find most distant genes
Seed selected genes as K-means centers
Cluster (short) sequences
In future direct upload from databases

19
Data upload from files Matrix separator Gene
annotations id annotation Annotations can
contain ltA HREFgtlinkslt/Agt Sequence
data (Other types?)
20
Some standard distance measures
Euclidean distance
Euclidean squared
Manhattan (city-block)
Average distance
21
Pearson correlation (centered)
If means of each column are 0, then it becomes
This can be used directly, then its uncentered
22
Chord distance
y
f
Euclidean distance between two vectors whose
length has been normalized to 1
g
x
Legendre Legendre Numerical Ecology2nd ed.
23
Rank correlation
Rank - smallest has rank1, next 2, etc. Equal
values have rank that is average of the ranks f
3 17 12 12 8 rank 1 5 3.5 3.5 2
24
URLMAP - no need to cut paste
KEGG
25
GENOMES

Started from the need to get upstream sequences
Overview, annotation, function, links to other
databases
Extraction of upstream, sequences relative to
start codons of genes for sets of genes
Querying capabilities
I wish major databases offered all that...

26
Automatic TF-binding site identification

How to identify putative transcription factor
binding sites on a genomic scale?
Traditional approach (case by case basis)
High-throughput methods (data mining)

27
Organization of a typical yeast promoter
RNA
URS
URS
TATA
I
40 - 120 bp
20 - 700 bp
Coding Region
40 - 60 bp
28
Traditional approach for TF-binding site
prediction

Extract bits of DNA where binding occurs
Carefully handpick coregulated genes (from
literature, experiments, functional class)
Identify conserved motifs in their upstreams
Build complex models, use expensive search
techniques (Gibbs, EM, alignments)
Not easily scaleable for full genomes?
Very good domain knowledge required

29
High-throughput method From expression data to
regulatory signals

Cluster the genes based on expression
measurements for identifying potentially
co-regulated sets of genes
Extract upstream sequences to these genes
Search for patterns over-represented in these
clusters
Assess the quality of findings using some
statistical criteria

Brazma, Jonassen, Vilo, Ukkonen Genome Research
81205-1215, 1998 Vilo, Brazma, Jonassen,
Robinson, Ukkonen ISMB-2000, AAAI Press (August
2000)
30
Cluster of co-expressed genes, pattern discovery
in regulatory regions
600 basepairs
Expression profiles
Retrieve
Upstream regions
Pattern over-represented in cluster
31
Problem of noise

Gene expression measurement accuracy (about
within a factor of 2 in 95 cases)
Clustering result depends on the choice of method
and parameters used in each
Does co-expression mean co-regulation?

32
What questions we asked

How to perform systematic discovery?
How to assess the quality of the predictions?
Do better clusters give better signals?
Is method scaleable for larger genomes?
Want to discover something unique for each
cluster, not just features common to upsreams

33
Cluster and pattern strengths

Cluster strength Average silhouette value How
well each object is classified into its own
cluster. Use average distances within cluster and
compare them to next closest cluster for each
object. Value between -1 .. 1 (well classified)
Pattern strength binomial distribution Given
probability of tails on coin, how probable is
to observe k or more tails out of n trials.
Number of tails nr. of pattern occurrences.

34
Silhouette value(Rousseeuw 1987)
Average distance to members in same cluster
Average distance to members in closest cluster
bi
-
ai
Si
Max( bi, ai )
Assign goodness to each clustered object
Average silhouette over each cluster or over the
clustering Not a silver bullet
35
Computational experimentclustering

Yeast Saccharomyces cerevisiae, 6221 genes, 80
expression conditions for each (from P. Browns
lab)
No single best clustering method K-means, vary
K ? 2..1000, repeat 10x for each K. Total 1000
x K-means
Calculated average Silhouette values for all
clusters
Select all unique clusters of size 20..100 (
52.000)
Could combine several methods, several distance
measures

36
Clustering Hierarchical Partitioning
based K-means clustering
Many more SOM, fuzzy c-means, graph-based, ...
37
Computational experimentpattern discovery

Upstream sequences of length 600bp from ORF start
Analyze all upstreams of all 52K clusters with
SPEXS looking for substrings only (one weekend,
10 PC-s)
Extract all patterns from upstreams of all
clusters with probability less than 1
(binomial distribution, background probability is
calculated simultaneously from all 6221
upstreams)

38
Pattern selection criteriaBinomial distribution
Background - ALL upstream sequences
? occurs 3 times P(?,6) is probability of having
3 or more matches in 6 sequences P(?,6) 0.0989
Cluster
5 out of 25, p 0.2
39
(No Transcript)
40
Pattern vs cluster strength
The pattern probability vs. the average
silhouette for the cluster
The same for randomised clusters
41
The most unprobable pattern from best clusters
42
One example
Pattern GGTGGCAA was the best for this
cluster (visualized in here after hierarchical
clustering within the cluster) 25 out of 40 ORFs
belong to cytoplasmic degradation functional
class (MIPS), mostly being proteasome subunits
43
GGTGGCAA cluster (proteasome)
44
GGTGGCAA is a binding site for RPN4
FEBS Lett 1999 Apr 30450(1-2)27-34 Rpn4p acts
as a transcription factor by binding to PACE, a
nonamer box found upstream of 26S proteasomal and
other genes in yeast. Mannhaupt G, Schnall R,
Karpov V, Vetter I, Feldmann H Adolf-Butenandt-Ins
titut der Ludwig-Maximilians-Universitat Munchen,
Germany. We identified a new, unique upstream
activating sequence (5'-GGTGGCAAA-3') in the
promoters of 26 out of the 32 proteasomal yeast
genes characterized to date, which we propose to
call proteasome-associated control element. By
using the one-hybrid method, we show that the
factor binding to the proteasome-associated
control element is Rpn4p, a protein containing a
C2H2-type finger motif and two acidic domains.
...
45
GGTGGCAA - proteasome associated control element
YOR261C YOR261C RPN8 protein degradation
26S proteasome regulatory subunit
S0005787 1 YDL020C YDL020C RPN4 protein
degradation, ubiquitin26S proteasome subunit
S0002178 1 YDL007W YDL007W
RPT2 protein degradation 26S
proteasome subunit S0002165
1 YDL147W YDL147W RPN5 protein degradation
26S proteasome subunit
S0002306 1 YOL038W YOL038W PRE6
protein degradation 20S proteasome
subunit (alpha4) S0005398 1 YKL145W
YKL145W RPT1 protein degradation,
ubiquitin26S proteasome subunit
S0001628 1 YDL097C YDL097C RPN6
protein degradation 26S proteasome
regulatory subunit S0002255 1 YDR394W
YDR394W RPT3 protein degradation
26S proteasome subunit
S0002802 1 YBR173C YBR173C UMP1 protein
degradation, ubiquitin20S proteasome maturation
factor S0000377 1 YER012W YER012W
PRE1 protein degradation 20S
proteasome subunit C11(beta4) S0000814
1 YPR108W YPR108W RPN7 protein degradation
26S proteasome regulatory subunit
S0006312 1 YOR117W YOR117W RPT5
protein degradation 26S proteasome
regulatory subunit S0005643 1 YJL001W
YJL001W PRE3 protein degradation
20S proteasome subunit (beta1)
S0003538 1 YPR103W YPR103W PRE2 protein
degradation 20S proteasome subunit
(beta5) S0006307 1 YOR157C YOR157C
PUP1 protein degradation 20S
proteasome subunit (beta2) S0005683
1 YGL048C YGL048C RPT6 protein degradation
26S proteasome regulatory subunit
S0003016 1 YHR200W YHR200W RPN10
protein degradation 26S proteasome
subunit S0001243 1 YML092C
YML092C PRE8 protein degradation
20S proteasome subunit Y7 (alpha2
S0004557 1 YIL075C YIL075C RPN2 tRNA
processing 26S proteasome subunit)
S0001337 1 YMR314W YMR314W
PRE5 protein degradation 20S
proteasome subunit(alpha6) S0004931
1 YGR253C YGR253C PUP2 protein degradation
20S proteasome subunit(alpha5)
S0003485 1 YGR135W YGR135W PRE9
protein degradation 20S proteasome
subunit Y13 (alpha3) S0003367 1 YFR004W
YFR004W RPN11 transcription
putative global regulator
S0001900 1 YOR259C YOR259C RPT4 protein
degradation 26S proteasome regulatory
subunit S0005785 1 YFR052W YFR052W
RPN12 protein degradation 26S
proteasome regulatory subunit S0001948
1 YFR050C YFR050C PRE4 protein degradation
proteasome subunit, B type
S0001946 1 YGL011C YGL011C SCL1
protein degradation 20S proteasome
subunit YC7ALPHA/Y8 S0002979 1 YDR427W
YDR427W RPN9 protein degradation
26S proteasome regulatory subunit
S0002835 1 YOR362C YOR362C PRE10 protein
degradation 20S proteasome subunit C1
(alpha7) S0005889 1 YBL041W YBL041W
PRE7 protein degradation 20S
proteasome subunit S0000137
1 YER021W YER021W RPN3 protein degradation
26S proteasome regulatory subunit
S0000823 1 YER094C YER094C PUP3
protein degradation 20S proteasome
subunit (beta3 S0000896 1 YGR270W
YGR270W YTA7 protein degradation
26S proteasome subunit ATPase
S0003502 1 YHR027C YHR027C RPN1 protein
degradation 26S proteasome regulatory
subunit S0001069 1 YER047C YER047C
SAP1 mating type switching AAA
family protein S0000849
1 YGR232W YGR232W unknown
unknown
S0003464 1
46
SPEXS - Sequence Pattern EXhaustive SearchJaak
Vilo, 1998

User-definable pattern language substrings
(oligos), group positions/wildcards, flexible
wildcards (PROSITE)
Fast exhaustive search over pattern language
Lazy suffix tree construction-like algorithm
Analyze multiple sets of sequences simultaneously
Restrict search to most frequent patterns only
(in each set)
Report most frequent patterns, patterns over- or
underrepresented in selected subsets, or
patterns significant by various statistical
criteria, e.g. by binomial distribution

47
Lazy construction of trie

ATACATA
12345678
O(n²)
Kurtz, Giegerich
Good in practice

A
C
T
8
1,3,5,7
2,6
4
A

C
T
8
3,7
2,6
4
48
SPEXS pattern discovery based on pattern trie.
Vilo 1998
?

Substrings
Group characters
Wildcard positions
Restrictions on nr of each separately
At least k occurrences
Exact occurrences of each pattern

A
CT
T
C
1,3,5,7
C ? T
4
2,6
2,4,6
A
3,5,7
ATACATA 12345678
49
Results continued...

Totally over 6000 interesting patterns
Many from homologous upstreams gt Remove
1498 most interesting substrings left
How to summarize?
Clustered these based on mutual similarity
Found alignments, consensi, and profiles
Of 62 clusters 48 had patterns matching SCPD
(experimentally mapped) binding site database

50
Cluster and align patterns
Words aligned as ALIGNMENT based on pattern
TGAC ----TGACAGC- ----TGACAGCT ---GTGACAGC- -TTGTG
ACAG-- --AGTGACA--- ACAGTGACA--- -CAGTGACA--- --AG
TGACAT-- --AGTGACATT- ---GTGACATT- ---GTGACAT-- --
TGTGACA--- -TTGTGACA--- ---GTGACAG-- --TGTGACAG--
---GTGACAC-- ----TGACCCT- ----TGACCGA- ---TTGACCG-
- --TTTGACC--- NO ALIGNMENT FOR NO CAGACAG
CLUSTER_0009 TGACAGC TGACAGCT GTGACAGC TTGTGACAG
AGTGACA ACAGTGACA CAGTGACA AGTGACAT AGTGACATT
GTGACATT GTGACAT TGTGACA TTGTGACA GTGACAG
TGTGACAG GTGACAC CAGACAG TGACCCT TGACCGA TTGACCG
TTTGACC
PROFILE gt/tmp/.12100.profile Profile for
sequences in file /tmp/.12100 based on pattern
TGAC - 19 16 10 4 0 0 0 0 0 6
13 19 A 1 0 5 0 0 0 20 0 16 0
1 0 C 0 2 0 0 0 0 0 20 4
2 3 0 G 0 0 0 14 0 20 0 0 0
8 0 0 T 0 2 5 2 20 0 0 0 0
4 3 1 CONSENSUS (not real one
) gt/tmp/.12100.consensus Profile for sequences
in file /tmp/.12100 based on pattern
TGAC CONSENSUS_PAT /tmp/.12100 Consensus
pattern actatGTGACAGTCctat
51
Align a group of patterns
ALIGNMENT based on pattern AGT -----AGTGACA ---AC
AGTGACA ----CAGTGACA ----CAGTGAC- ---ACAGTGAC- ---
GCAGTGA-- ---GCAGTG--- --AGCAGTGA-- -TTACAGTG--- T
TTACAGTGA-- -TTACAGTGA-- --TACAGTG--- ---ACAGTGA--
--TACAGTGA-- TTTACAGT---- TTTACAGTG---
PROFILE gt/tmp/.5679.profile Profile for
sequences in file /tmp/.5679 based on pattern
AGT - 13 11 8 3 1 0 0 0 1 5
11 13 A 0 0 1 10 0 16 0 0 0 11
0 3 C 0 0 0 0 15 0 0 0 0
0 5 0 G 0 0 0 3 0 0 16 0 15
0 0 0 T 3 5 7 0 0 0 0 16 0
0 0 0 CONSENSUS (not real one
) gt/tmp/.5679.consensus Profile for sequences in
file /tmp/.5679 based on pattern
AGT CONSENSUS_PAT /tmp/.5679 tttAGCAGTGAca
AGTGACA ACAGTGACA CAGTGACA CAGTGAC ACAGTGAC GCAGTG
A GCAGTG AGCAGTGA TTACAGTG TTTACAGTGA TTACAGTGA TA
CAGTG ACAGTGA TACAGTGA TTTACAGT TTTACAGTG
52
-----------------ACCCAGACATCGGGCTTCCAC- ---------
------ACACCCAGACATC----------- ---------------ACAC
CCAGACATC----------- ---------------GAACCCATACACT-
---------- ---------------ACACCCAGACCGCG----------
---------------GCACCCACACATTT---------- ---------
---GCTAAACCCATGCACAGTGACT----- -----------------AC
CCAGACACGCTCGA------ -------------CTTCACCCTCATAC--
---------- ---------------ACACCCCTTTTCT-----------
---------------GCACCCAGTCTT------------ ---------
------GCACCCAAACACCTGCATATTTGG ---------------GCAC
CCAATCACC----------- ---------------ACACCCAGACCTC-
---------- ---------------AAACCCACACAT------------
--------------TGCACCCATACCTT----------- ---------
-----AACACCCAAGCACAG---------- -ATCTCTCGCAACG-----
-------------------- ACCTCCGTACATTC---------------
---------- ACACCTGGACACC--------------------------
ACATCCGTACAACGAGAACCCATACATTA---------- ---TCCGT
AC--- ACCCATAC-- -CATCCGTAC---
ACCCATACA- --ATCCGTA---- ACCCATACAT --ATCCGTACA
-- -CCCATAC-- ----CCGTAC---
-CCCATACA- ----CCGTACA-- --CCATACAT ---TCCGTACA
T- -CCCATACAT --ATCCGTACAT-
--CCATACA- ---TCCGTACA-- -AACATAC-- ----CCGTACA
T- --ACATACT- ---TCCGTA----
---GATACT- --ATCCGTAC--- --AGATACT- ----CCGTACC
-- ---ACCGTACC-- ---ACCGTAC--- --CACCGTAC--- ----C
CGTACATT ----GCGTAG--- ----GCGTAGG-- -----CGTAGG--
-CATCCGTA---- ACATCCGT----- -CATCCGT-----
In SCPD
Discovered automatically
53
atatCTAGGCACTCaca taGCGCAGTgacc cgGTGG
CAAACag tcagaGACGGCTGGTActatttt a
catAGCAGGGGTctgaca gcgagatgaacGAT
GAGACtagatg aTGGATGCc taTGCATGAAc aTGGC
GTATAc gcTAGTATATATCgatagtggg gagga
agAGTAGATGATGagtgaag tagAGTAGATA
Agaaaaa gtaTAAATAGAGCtgct ataagTGATGCCC
Gacacga aCCTCAATATtgt ATCCAAGAg aaacaAA
ACAAAATcaacaata tgtGTAAATCATtt ataaaagt
CAGTAAAAGAcggacaaaag tgtTCGAAAGAGTt
attactgtaagAAAATTTTtgtcattt gaat
acgCAGGAAAGTgtgaa ttccatATTCTTCGA
ACTgat cggctctggctctgCTTTTTCTGTCa
tctgcc tTTTTCTGCTTAc ttagtagtcTGTCT
ATGGTCaatct taaatATTTTGTGtaca tacgCT
GTGCTaac
tctcaTCTCATCCTtagcatc aaTCTTCATGt cctcGAACG
TGCCATCtca ataCGCCTAATAat cgTACCTCTa accac
CCCCCTCGTaga gTTACATCCTCGg gACAGCTAc
actatGTGACAGTCctat tttcACAGTGTATtc
g atATCTACACAt tttGTCACAGATgg tgcACATTGC
CTtg ataTCTGGTTCt acaTCCGTACacgtt agcaatc
TAAGCGTAGtgaa tATTACGTTAAgc tctatAG
AAGTATTAc gtTAGTTACTTGAGca ACTTTATTT
agTAACTTATCa tACTCGCTTAAT gaacagatacg
AGCGCGctagatcagc acatGTACGCcaa GGTCG
CTAc tgtTAACGAATCGTTtaa gaatTCCGTTTAagg t
taCGAATAAGaaaa aACAGATGAATCttc tactcat
CGACTCAcacgaa tcCACGAAgctag cgactg
ACGTACGATatctat aCCACATACATt
54
Summary about identified patterns
SCPD Of 1498 patterns 315 patterns match
sites 73 patterns are matched by some
site 1134 patterns do not match any nor
is matched by any Of 109 factors with total 799
sites 85 factors are matched by some of the
patterns 19 factors match some patterns 24
factors do not have matches nor is matched Of
498 unique sites of total 799 sites 238 sites
are matched by some of the patterns 21 sites
match some patterns 252 sites do not have
matches nor is matched
TRANSFAC Of 1498 patterns 297 patterns match
sites 61 patterns are matched by some
site 1174 patterns do not match any nor
is matched by any Of 351 DB-entries with total
359 sites 205 factors are matched by some of the
patterns 22 factors match some patterns 134
factors do not have matches nor is matched Of
334 unique sites of total 359 sites 198 sites are
matched by some of the patterns 16 sites match
some patterns 127 sites do not have matches nor
is matched
55
PATMATCH

Match your patterns against sequences
Sequences - extracted from GENOMES
Visualise matches along the sequence
Visualise pattern by pattern if sequence has a
match
Order sequences according to hierarchical
clustering order from EPCLUST
Show clustering and upstream next to each other

56
GGTGGCAA - proteasome associated control element
YOR261C YOR261C RPN8 protein degradation
26S proteasome regulatory subunit
S0005787 1 YDL020C YDL020C RPN4 protein
degradation, ubiquitin26S proteasome subunit
S0002178 1 YDL007W YDL007W
RPT2 protein degradation 26S
proteasome subunit S0002165
1 YDL147W YDL147W RPN5 protein degradation
26S proteasome subunit
S0002306 1 YOL038W YOL038W PRE6
protein degradation 20S proteasome
subunit (alpha4) S0005398 1 YKL145W
YKL145W RPT1 protein degradation,
ubiquitin26S proteasome subunit
S0001628 1 YDL097C YDL097C RPN6
protein degradation 26S proteasome
regulatory subunit S0002255 1 YDR394W
YDR394W RPT3 protein degradation
26S proteasome subunit
S0002802 1 YBR173C YBR173C UMP1 protein
degradation, ubiquitin20S proteasome maturation
factor S0000377 1 YER012W YER012W
PRE1 protein degradation 20S
proteasome subunit C11(beta4) S0000814
1 YPR108W YPR108W RPN7 protein degradation
26S proteasome regulatory subunit
S0006312 1 YOR117W YOR117W RPT5
protein degradation 26S proteasome
regulatory subunit S0005643 1 YJL001W
YJL001W PRE3 protein degradation
20S proteasome subunit (beta1)
S0003538 1 YPR103W YPR103W PRE2 protein
degradation 20S proteasome subunit
(beta5) S0006307 1 YOR157C YOR157C
PUP1 protein degradation 20S
proteasome subunit (beta2) S0005683
1 YGL048C YGL048C RPT6 protein degradation
26S proteasome regulatory subunit
S0003016 1 YHR200W YHR200W RPN10
protein degradation 26S proteasome
subunit S0001243 1 YML092C
YML092C PRE8 protein degradation
20S proteasome subunit Y7 (alpha2
S0004557 1 YIL075C YIL075C RPN2 tRNA
processing 26S proteasome subunit)
S0001337 1 YMR314W YMR314W
PRE5 protein degradation 20S
proteasome subunit(alpha6) S0004931
1 YGR253C YGR253C PUP2 protein degradation
20S proteasome subunit(alpha5)
S0003485 1 YGR135W YGR135W PRE9
protein degradation 20S proteasome
subunit Y13 (alpha3) S0003367 1 YFR004W
YFR004W RPN11 transcription
putative global regulator
S0001900 1 YOR259C YOR259C RPT4 protein
degradation 26S proteasome regulatory
subunit S0005785 1 YFR052W YFR052W
RPN12 protein degradation 26S
proteasome regulatory subunit S0001948
1 YFR050C YFR050C PRE4 protein degradation
proteasome subunit, B type
S0001946 1 YGL011C YGL011C SCL1
protein degradation 20S proteasome
subunit YC7ALPHA/Y8 S0002979 1 YDR427W
YDR427W RPN9 protein degradation
26S proteasome regulatory subunit
S0002835 1 YOR362C YOR362C PRE10 protein
degradation 20S proteasome subunit C1
(alpha7) S0005889 1 YBL041W YBL041W
PRE7 protein degradation 20S
proteasome subunit S0000137
1 YER021W YER021W RPN3 protein degradation
26S proteasome regulatory subunit
S0000823 1 YER094C YER094C PUP3
protein degradation 20S proteasome
subunit (beta3 S0000896 1 YGR270W
YGR270W YTA7 protein degradation
26S proteasome subunit ATPase
S0003502 1 YHR027C YHR027C RPN1 protein
degradation 26S proteasome regulatory
subunit S0001069 1 YER047C YER047C
SAP1 mating type switching AAA
family protein S0000849
1 YGR232W YGR232W unknown
unknown
S0003464 1
57
PATMATCH - combine pattern matching with
expression data
58
Global and Local Data MiningSecondary Data Mining

Find global structure by clustering
Find local structure by pattern discovery
Summarize the findings to the size feasible for
humans to interpret
Find most interesting rules
Find explanations for the behavior
Hypothesis generation for wet-lab

59
Gene networks
promoter1 gene1
promoter2 gene2
promoter3 gene3
promoter4 gene4
DNA
transcription
transcription factors
RNA
translation
proteins
60
A gene network
61
Lac-Operon
Thomas Schlitt
62
Thanks Alvis Brazma EBI Inge Jonassen Universit
y of Bergen Esko Ukkonen University of
Helsinki Thomas Schlitt EBI Alan
Robinson EBI Lawrence Bower EBI
63
Why microarray

Systems biology era
Study the complexity of a biological system as a
whole, e.g. all genes simultaneously
Increases throughput
Better general models
SNP, antibody, protein arrays
Proteomics ...

64
Proteomics

Proteome - complete set of PROteins encoded by
the genOME
Capture proteins, the end products
Post-translational modifications, untranslated
genes
What is the correspondence to mRNA levels?
Proteomics - 2D protein gels, MassSpec are
expensive

65
Some promises of MA data

Hypothesis about the function of genes
Tissue, or cancer biopsies (what is normal?)
Cell lines (e.g. insert viruses)
Animal model systems for diseases
Toxicology and drug testing
Molecular classification of diseases
(class discovery and class prediction)

66
Using binomial distribution

The probability of a pattern occurring in exactly
k out of n sequences is
The probability of a pattern to occur k or more
times in a set of n sequences (in a cluster) is

Write a Comment

User Comments (0)