Jacques van Helden Jacques'van'Heldenulb'ac'be - PowerPoint PPT Presentation

1 / 42

About This Presentation

Title:

Jacques van Helden Jacques'van'Heldenulb'ac'be

Description:

How many oligomers contain exactly a single occurrence of each monomer, for ... Permutations within a set - the factorial ... Number of distinct selections (orderless) ... – PowerPoint PPT presentation

Number of Views:116

Avg rating:3.0/5.0

Slides: 43

Provided by: jacquesv8

Category:

more less

Transcript and Presenter's Notes

Title: Jacques van Helden Jacques'van'Heldenulb'ac'be

1
Statistics for bioinformatics

Cuernavaca - Introduccion a bioinformatica

2
Combinatorial analysis

Statistics Applied to Bioinformatics

3
Problem - oligomers

How many oligomers contain exactly a single
occurrence of each monomer, for oligonucleotides
and oligopeptides, respectively ?

4
Permutations within a set - the factorial

How many distinct permutations can be made from a
set of x elements ?
x 2 2
x 3 32 6
x 4 432 24
any x x(x-1)...1 x!
The factorial x! represents the number of
possible permutations between x objects.
Solution to the problem of oligomers
There are 4!24 distinct oligonucleotides with a
single occurrence of each nucleotide (A, C, G, T)
There are 20!2.41018 distinct oligopeptides
with a single occurrence of each amino acid.

5
Problem - Selection of a subset of elements

A genome contains n6000 genes.
We select a series of genes in the following way
Once a gene has been selected once, it cannot be
selected anymore (no replacement)
We are not interested in the order of the
selection if A and B were selected, we do not
consider whether A came out in first or in second
position.
How many possibilities do we have to select
1 gene ?
2 genes ?
3 genes ?
x genes ?

6
Selection of a subset of elements

Number of possible outcomes
n size of the set
x size of the subset
Possible permutations among the elements of a
subset
Number of distinct selections (orderless).
The coefficient Cxn represents the number of
distinct choices of x elements among n. For this
reason, it is called "Choose x among n". It is
also called binomial coefficient (we will see
later why).

7
Elements of Probabilities

Statistics Applied to Bioinformatics

8
Frequential definition of probability

Trial
a random position is selected in the genome of
Mycoplasma genitalium
Event
If the nucleotide at this position is a adenine
(A), the trial is considered as a success
1,000,000 successive trials were performed
The frequency of success is plotted as a function
of the number of trials
The frequency progressively converges towards the
value of 0.35
The probability is the value that would be
obtained with an infinite number of trials

Selection of random nucleotides in a genome
9
Mutually exclusive events
Where ? is the logical OR ? is the logical AND

E.g. Calculating degenerate nucleotide
probability
P(W) P(A ? T) P(A) P(T)
P(S) P(C ?G) P(C) P(G)

10
Complementary events

E.g. coding / non-coding sequences in
Mycobacterium genitalium
P(coding) 0.902
P(coding ? non-coding) 1? P(non-coding) 1 -
P(coding) 0.098
Example Probability of the degenerate nucleotide
N (in this example we combine the properties of
complementarity and mutual exclusion)
P(N) P(A) P(T) P(C) P(G) 1

11
Stochastic independence

Two events A and B are said stochastically
independent when
P(AB) P(A!B) P(A)
P(BA) P(B!A) P(B)
For stochastically independent events, the joint
probability is the product of probabilities
P(A ? B) P(A)P(B)
E.g. calculating oligonucleotide probability
with a model of independent succession of
nucleotides
P(GATAAG) P(G) P(A) P(T) P(A) P(A)
P(G)
Note this is not appropriate for biological
sequences, where there are strong dependencies
between neighbour nucleotides (see next slide)

12
Non necessarily independent events
P(A ? B) P(A) P(B) P(A ? B)
P(A ? B ? C) P(A) P(B) P(C) P(A ? B)
P(A ? C) P(B ? C) P(A ? B ? C)
13
Probabilities - Exercises

Assuming identically and independently
distributed nucleotides, calculate the
probability to observe an hexanucleotide
containing exactly 4 As and 2 Cs, irrespective of
the relative positions of these residues.

14
Probabilities - Exercises

Assuming a DNA sequence with independently and
identically distributed nucleotides, calculate
the probabilities of the following
oligonucleotides.
A
AA
AAAA
AAAAAA
CCCCCC
CACACA
CANNTG
CACGTK
Calculate the probabilities of the same
oligonucleotides, assuming that nucleotides are
independently distributed, but have the following
prior probabilities
P(A) 0.31 P(T) 0.29 P(C) 0.19 P(G)
0.21

15
Set comparisons

Statistics for bioinformatics

16
Methionine Biosynthesis in S.cerevisiae
17
The "bio-ontologies"

Answer to the problem of inconsistencies in
database annotations
Controlled vocabulary
Hierarchical classification between the terms of
the controlled vocabulary
E.g. The Gene Ontology
molecular function ontology
process ontology
cellular component ontology

18
Gene ontology processes
19
Gene ontology molecular functions
20
Gene ontology cellular components
21
Fragment of the GO annotations for the yeast
Saccharomyces cerevisiae
22
Problem - selection within a set with classes

A given organism has 6,000 genes, among which 40
are involved in methionine metabolism.
A set of 10 genes are co-regulated in a
microarray experiment.
Among them, 6 are related to methionine
metabolism.
What would be the probability to observe such a
correspondence by chance alone ?

Methionine
Co-regulated
34
6
4
Genome (6000)
23
Selection within a set with classes

Let us define
g 6000 number of genes
m 40 genes involved in methionine metabolism
n 5960 genes not involved in methionine
metabolism
k 10 number of genes in the cluster
x 6 number of methionine genes in the cluster
We calculate the number of possibilities for the
following selections
C1 10 distinct genes among 6,000
C2 6 distinct genes among the 40 involved in
methionine
C3 4 genes among the 5960 which are not involved
in methionine
C4 6 methionine and 4 non-methionine genes
Probability to have exactly 6 methionine genes
within a selection of 10
Probability to have at least 6 methionine genes
within a selection of 10

24
The hypergeometric distribution

The hypergeometric distribution represents the
probability to observe x successes in a sampling
without replacement
m number of possible successes
n number of possible failures
k sample size
x number of successes in the sample

The shape of the distribution depends on the
ratio between m and n
m ltlt n i-shaped
m n bell-shaped
m gtgt n j-shaped
The distribution is bounded on both sides (0 ? x
? k)
Statistical parameters

25
Multi-testing corrections

Statistics Applied to Bioinformatics

26
Bonferoni rule

Multi-testing
Assessing the significance of each gene on a chip
represents thousands of simultaneous tests. Let N
be the number of genes.
The risk of error (P-value) associated to each
gene will thus be challenged N times.
The significance thresholds used for single
testing (0.01, 0.001) are thus likely to return
many false positive.
Bonferoni rule
Adapt the threshold to the number of simultaneous
tests.

27
E-value

An alternative but equivalent way to treat the
problem of multi-testing is to calculate the
expected value for each observation.
One can then choose the E-value according to the
number of false positive considered as
acceptable.

28
Family-wise Error Rate (FWER)

Another correction for multiple testing consists
in estimating the probability to observe at least
one false positive in the whole set of tests.
This probability can be calculated quite easily
from the P-value (Pval).

29
False Discovery Rate (FDR)

Yet another approach is to consider, for a given
threshold on P-value, the False Discovery Rate,
i.e. the proportion of false predictions within a
set of predictions.

30
Summary - Multi-testing corrections

Bonferoni rule adapt significance threshold
E-value expected number of false positives
FWER Family-wise error rate probability to
observe at least one false positive
FDR False discovery rate estimated rate of
false positives among the predictions

31
Compare-classes result
32
Application comparing many gene sets with many
gene sets

Statistics Applied to Bioinformatics

33
RegulonDB factor -gt gene network

RegulonDB (Oct. 2005 version)
The graph represents the relationships between
factors and their target genes (factor -gt gene
graph)
125 transcription factors
467 target genes
847 factor-gtgene interactions
45 self-regulations
Note CRP alone regulates 132 target genes.

Factor-gene graph
34
Application yeast protein complexes

High-throughput experiments led to the
identification of thousands of protein complexes
in the yeast Saccharomyces cerevisiae.
Note These high-throughput experiments are
likely to contain a given rate of noise
Question
can we associate specific functions to these
complexes ?
Approach
Compare the gene composition of each complex with
the functional classes associated in the Gene
Ontology.

35
Yeast complexes versus GO
36
Questions

Can we detect pairs of transcription factors
which co-regulate a significant number of genes ?
Are these factors (and their common target genes)
invovled in particular biological functions ?

37
Regulons versus regulons
gene factor GI alkA Ada 1788383 ada
Ada 1788542 aidB Ada 1790630 adiA
AdiY 1790558 agaZ AgaR 1789520 agaS
AgaR 1789525 alsR AlsR 1790527 alsI
AlsR 1790528 hyaA AppY 1787206 appC
AppY 1787212 araB AraC 1786249 araC
AraC 1786251 araJ AraC 1786595 araF
AraC 1788211 araE AraC 1789207 caiT
ArcA 1786224 lpdA ArcA 1786307 betI
ArcA 1786505 betT ArcA 1786506 cyoA
ArcA 1786635 dcuC ArcA 1786839 gltA
ArcA 1786939 sdhC ArcA 1786940 sucA
ArcA 1786945 cydA ArcA 1786953 moeA
ArcA 1787049 focA ArcA 1787132 hyaA
ArcA 1787206 ptsG ArcA 1787343 ndh
ArcA 1787352 icdA ArcA 1787381 hemA
ArcA 1787461 acnA ArcA 1787531 tpx
ArcA 1787584 ...
38
Regulons versus functional classes
gene factor GI alkA Ada 1788383 ada
Ada 1788542 aidB Ada 1790630 adiA
AdiY 1790558 agaZ AgaR 1789520 agaS
AgaR 1789525 alsR AlsR 1790527 alsI
AlsR 1790528 hyaA AppY 1787206 appC
AppY 1787212 araB AraC 1786249 araC
AraC 1786251 araJ AraC 1786595 araF
AraC 1788211 araE AraC 1789207 caiT
ArcA 1786224 lpdA ArcA 1786307 betI
ArcA 1786505 betT ArcA 1786506 cyoA
ArcA 1786635 dcuC ArcA 1786839 gltA
ArcA 1786939 sdhC ArcA 1786940 sucA
ArcA 1786945 cydA ArcA 1786953 moeA
ArcA 1787049 focA ArcA 1787132 hyaA
ArcA 1787206 ptsG ArcA 1787343 ndh
ArcA 1787352 icdA ArcA 1787381 hemA
ArcA 1787461 acnA ArcA 1787531 tpx
ArcA 1787584 ...
39
Microarray groups (Gasch, 2000) versus Gene
Ontology
40
Exercises

The nucleotides frequencies were computed for a
given genome F(A) F(T) 0.35 F(G) F(C)
0.15
What is the probability to observe the following
motif GATWNNHT at a given position of this genome
?
W means A or T
H means not G
N means any nucleotide
Justify the answer in term of combinations of
events.
We searched the orthologs of two genes (A and B)
in 80 bacterial genomes.
An ortholog was found for gene A in 50 genomes.
An ortholog was found for gene B in 60 genomes.
Among these, 40 genomes contain orthologs for
both A and B.
Can we consider that A and B are found in common
in a significant number of genomes ?
Give the name of the distribution which allows to
model this type of situation.
Indicate the general formula which would allow to
estimate this significance (the P-value).
Explain what each term of this formula
represents.
In the formula, replace each variable by the
appropriate number(you dont need to calculate
the final value).
What does the P-value represent ? How would you
interpret a P-value of 15 in terms of risks ?

41
Answer to the first question

The nucleotides frequencies were computed for a
given genome F(A) F(T) 0.35 F(G) F(C)
0.15
What is the probability to observe the following
motif GATWNNHT at a given position of this genome
?
P(W) P(A) P(T) 0.7
P(N) 1
P(H) 1 - P(G) 0.85
P(GATWNNHT) P(G)P(A)P(T)2P(W)P(N)2P(H)
0.150.350.3520.70.85 0.00383
Justify the answer in term of combinations of
events.
P(W) P(A) P(T) because A and T are mutually
exclusive at a same position
P(N) 1 because A, C, G, T are complementary
events (there is no other possible event)
P(H) 1 - P(G) because H is the complementary
event of G (not G)
P(GATWNNHT) P(G)P(A)P(T)2P(W)P(N)2P(H) because
at the nucleotides found at successive positions
are mutually independent -gt the joined
probability is the product of probabilities.

42
Answer to question 2

The Hypergeometric distribution. The P-value is
calculated with the inverse CDF (the right tail
of the distribution).
See below.
Terms
n160 number of labelled elements (genomes with
an ortholog for gene A)
n220 number of unlabelled elements (genomes
without ortholog for gene A)
n50 the size of the selection (genomes with an
ortholog for gene B).
x40 the number of labelled elements in the
selection (genomes with orthologs for both A and
B)
See below.

We searched the orthologs of two genes (A and B)
in 80 bacterial genomes.
An ortholog was found for gene A in 50 genomes.
An ortholog was found for gene B in 60 genomes.
Among these, 40 genomes contain orthologs for
both A and B.
Can we consider that A and B are found in common
in a significant number of genomes ?
Give the name of the distribution which allows to
model this type of situation.
Indicate the general formula which would allow to
estimate this significance (the P-value).
Explain what each term of this formula
represents.
In the formula, replace each variable by the
appropriate number(you dont need to calculate
the final value).
What does the P-value represent ? How would you
interpret a P-value of 14 in terms of risks ?