Title: Jacques van Helden Jacques'van'Heldenulb'ac'be
1Statistics for bioinformatics
- Cuernavaca - Introduccion a bioinformatica
2Combinatorial analysis
- Statistics Applied to Bioinformatics
3Problem - oligomers
- How many oligomers contain exactly a single
occurrence of each monomer, for oligonucleotides
and oligopeptides, respectively ?
4Permutations within a set - the factorial
- How many distinct permutations can be made from a
set of x elements ? - x 2 2
- x 3 32 6
- x 4 432 24
- any x x(x-1)...1 x!
- The factorial x! represents the number of
possible permutations between x objects. - Solution to the problem of oligomers
- There are 4!24 distinct oligonucleotides with a
single occurrence of each nucleotide (A, C, G, T) - There are 20!2.41018 distinct oligopeptides
with a single occurrence of each amino acid.
5Problem - Selection of a subset of elements
- A genome contains n6000 genes.
- We select a series of genes in the following way
- Once a gene has been selected once, it cannot be
selected anymore (no replacement) - We are not interested in the order of the
selection if A and B were selected, we do not
consider whether A came out in first or in second
position. - How many possibilities do we have to select
- 1 gene ?
- 2 genes ?
- 3 genes ?
- x genes ?
6Selection of a subset of elements
- Number of possible outcomes
- n size of the set
- x size of the subset
- Possible permutations among the elements of a
subset - Number of distinct selections (orderless).
- The coefficient Cxn represents the number of
distinct choices of x elements among n. For this
reason, it is called "Choose x among n". It is
also called binomial coefficient (we will see
later why).
7Elements of Probabilities
- Statistics Applied to Bioinformatics
8Frequential definition of probability
- Trial
- a random position is selected in the genome of
Mycoplasma genitalium - Event
- If the nucleotide at this position is a adenine
(A), the trial is considered as a success - 1,000,000 successive trials were performed
- The frequency of success is plotted as a function
of the number of trials - The frequency progressively converges towards the
value of 0.35 - The probability is the value that would be
obtained with an infinite number of trials
Selection of random nucleotides in a genome
9Mutually exclusive events
Where ? is the logical OR ? is the logical AND
- E.g. Calculating degenerate nucleotide
probability - P(W) P(A ? T) P(A) P(T)
- P(S) P(C ?G) P(C) P(G)
10Complementary events
- E.g. coding / non-coding sequences in
Mycobacterium genitalium - P(coding) 0.902
- P(coding ? non-coding) 1? P(non-coding) 1 -
P(coding) 0.098 - Example Probability of the degenerate nucleotide
N (in this example we combine the properties of
complementarity and mutual exclusion) - P(N) P(A) P(T) P(C) P(G) 1
11Stochastic independence
- Two events A and B are said stochastically
independent when - P(AB) P(A!B) P(A)
- P(BA) P(B!A) P(B)
- For stochastically independent events, the joint
probability is the product of probabilities - P(A ? B) P(A)P(B)
- E.g. calculating oligonucleotide probability
with a model of independent succession of
nucleotides - P(GATAAG) P(G) P(A) P(T) P(A) P(A)
P(G) - Note this is not appropriate for biological
sequences, where there are strong dependencies
between neighbour nucleotides (see next slide)
12Non necessarily independent events
P(A ? B) P(A) P(B) P(A ? B)
P(A ? B ? C) P(A) P(B) P(C) P(A ? B)
P(A ? C) P(B ? C) P(A ? B ? C)
13Probabilities - Exercises
- Assuming identically and independently
distributed nucleotides, calculate the
probability to observe an hexanucleotide
containing exactly 4 As and 2 Cs, irrespective of
the relative positions of these residues.
14Probabilities - Exercises
- Assuming a DNA sequence with independently and
identically distributed nucleotides, calculate
the probabilities of the following
oligonucleotides. - A
- AA
- AAAA
- AAAAAA
- CCCCCC
- CACACA
- CANNTG
- CACGTK
- Calculate the probabilities of the same
oligonucleotides, assuming that nucleotides are
independently distributed, but have the following
prior probabilities - P(A) 0.31 P(T) 0.29 P(C) 0.19 P(G)
0.21
15Set comparisons
- Statistics for bioinformatics
16Methionine Biosynthesis in S.cerevisiae
17The "bio-ontologies"
- Answer to the problem of inconsistencies in
database annotations - Controlled vocabulary
- Hierarchical classification between the terms of
the controlled vocabulary - E.g. The Gene Ontology
- molecular function ontology
- process ontology
- cellular component ontology
18Gene ontology processes
19Gene ontology molecular functions
20Gene ontology cellular components
21Fragment of the GO annotations for the yeast
Saccharomyces cerevisiae
22Problem - selection within a set with classes
- A given organism has 6,000 genes, among which 40
are involved in methionine metabolism. - A set of 10 genes are co-regulated in a
microarray experiment. - Among them, 6 are related to methionine
metabolism. - What would be the probability to observe such a
correspondence by chance alone ?
Methionine
Co-regulated
34
6
4
Genome (6000)
23Selection within a set with classes
- Let us define
- g 6000 number of genes
- m 40 genes involved in methionine metabolism
- n 5960 genes not involved in methionine
metabolism - k 10 number of genes in the cluster
- x 6 number of methionine genes in the cluster
- We calculate the number of possibilities for the
following selections - C1 10 distinct genes among 6,000
- C2 6 distinct genes among the 40 involved in
methionine - C3 4 genes among the 5960 which are not involved
in methionine - C4 6 methionine and 4 non-methionine genes
- Probability to have exactly 6 methionine genes
within a selection of 10 - Probability to have at least 6 methionine genes
within a selection of 10
24The hypergeometric distribution
- The hypergeometric distribution represents the
probability to observe x successes in a sampling
without replacement - m number of possible successes
- n number of possible failures
- k sample size
- x number of successes in the sample
- The shape of the distribution depends on the
ratio between m and n - m ltlt n i-shaped
- m n bell-shaped
- m gtgt n j-shaped
- The distribution is bounded on both sides (0 ? x
? k) - Statistical parameters
25Multi-testing corrections
- Statistics Applied to Bioinformatics
26Bonferoni rule
- Multi-testing
- Assessing the significance of each gene on a chip
represents thousands of simultaneous tests. Let N
be the number of genes. - The risk of error (P-value) associated to each
gene will thus be challenged N times. - The significance thresholds used for single
testing (0.01, 0.001) are thus likely to return
many false positive. - Bonferoni rule
- Adapt the threshold to the number of simultaneous
tests.
27E-value
- An alternative but equivalent way to treat the
problem of multi-testing is to calculate the
expected value for each observation. - One can then choose the E-value according to the
number of false positive considered as
acceptable.
28Family-wise Error Rate (FWER)
- Another correction for multiple testing consists
in estimating the probability to observe at least
one false positive in the whole set of tests.
This probability can be calculated quite easily
from the P-value (Pval).
29False Discovery Rate (FDR)
- Yet another approach is to consider, for a given
threshold on P-value, the False Discovery Rate,
i.e. the proportion of false predictions within a
set of predictions.
30Summary - Multi-testing corrections
- Bonferoni rule adapt significance threshold
- E-value expected number of false positives
- FWER Family-wise error rate probability to
observe at least one false positive - FDR False discovery rate estimated rate of
false positives among the predictions
31Compare-classes result
32Application comparing many gene sets with many
gene sets
- Statistics Applied to Bioinformatics
33RegulonDB factor -gt gene network
- RegulonDB (Oct. 2005 version)
- The graph represents the relationships between
factors and their target genes (factor -gt gene
graph) - 125 transcription factors
- 467 target genes
- 847 factor-gtgene interactions
- 45 self-regulations
- Note CRP alone regulates 132 target genes.
Factor-gene graph
34Application yeast protein complexes
- High-throughput experiments led to the
identification of thousands of protein complexes
in the yeast Saccharomyces cerevisiae. - Note These high-throughput experiments are
likely to contain a given rate of noise - Question
- can we associate specific functions to these
complexes ? - Approach
- Compare the gene composition of each complex with
the functional classes associated in the Gene
Ontology.
35Yeast complexes versus GO
36Questions
- Can we detect pairs of transcription factors
which co-regulate a significant number of genes ?
- Are these factors (and their common target genes)
invovled in particular biological functions ?
37Regulons versus regulons
gene factor GI alkA Ada 1788383 ada
Ada 1788542 aidB Ada 1790630 adiA
AdiY 1790558 agaZ AgaR 1789520 agaS
AgaR 1789525 alsR AlsR 1790527 alsI
AlsR 1790528 hyaA AppY 1787206 appC
AppY 1787212 araB AraC 1786249 araC
AraC 1786251 araJ AraC 1786595 araF
AraC 1788211 araE AraC 1789207 caiT
ArcA 1786224 lpdA ArcA 1786307 betI
ArcA 1786505 betT ArcA 1786506 cyoA
ArcA 1786635 dcuC ArcA 1786839 gltA
ArcA 1786939 sdhC ArcA 1786940 sucA
ArcA 1786945 cydA ArcA 1786953 moeA
ArcA 1787049 focA ArcA 1787132 hyaA
ArcA 1787206 ptsG ArcA 1787343 ndh
ArcA 1787352 icdA ArcA 1787381 hemA
ArcA 1787461 acnA ArcA 1787531 tpx
ArcA 1787584 ...
38Regulons versus functional classes
gene factor GI alkA Ada 1788383 ada
Ada 1788542 aidB Ada 1790630 adiA
AdiY 1790558 agaZ AgaR 1789520 agaS
AgaR 1789525 alsR AlsR 1790527 alsI
AlsR 1790528 hyaA AppY 1787206 appC
AppY 1787212 araB AraC 1786249 araC
AraC 1786251 araJ AraC 1786595 araF
AraC 1788211 araE AraC 1789207 caiT
ArcA 1786224 lpdA ArcA 1786307 betI
ArcA 1786505 betT ArcA 1786506 cyoA
ArcA 1786635 dcuC ArcA 1786839 gltA
ArcA 1786939 sdhC ArcA 1786940 sucA
ArcA 1786945 cydA ArcA 1786953 moeA
ArcA 1787049 focA ArcA 1787132 hyaA
ArcA 1787206 ptsG ArcA 1787343 ndh
ArcA 1787352 icdA ArcA 1787381 hemA
ArcA 1787461 acnA ArcA 1787531 tpx
ArcA 1787584 ...
39Microarray groups (Gasch, 2000) versus Gene
Ontology
40Exercises
- The nucleotides frequencies were computed for a
given genome F(A) F(T) 0.35 F(G) F(C)
0.15 - What is the probability to observe the following
motif GATWNNHT at a given position of this genome
? - W means A or T
- H means not G
- N means any nucleotide
- Justify the answer in term of combinations of
events. - We searched the orthologs of two genes (A and B)
in 80 bacterial genomes. - An ortholog was found for gene A in 50 genomes.
- An ortholog was found for gene B in 60 genomes.
- Among these, 40 genomes contain orthologs for
both A and B. - Can we consider that A and B are found in common
in a significant number of genomes ? - Give the name of the distribution which allows to
model this type of situation. - Indicate the general formula which would allow to
estimate this significance (the P-value). - Explain what each term of this formula
represents. - In the formula, replace each variable by the
appropriate number(you dont need to calculate
the final value). - What does the P-value represent ? How would you
interpret a P-value of 15 in terms of risks ?
41Answer to the first question
- The nucleotides frequencies were computed for a
given genome F(A) F(T) 0.35 F(G) F(C)
0.15 - What is the probability to observe the following
motif GATWNNHT at a given position of this genome
? - P(W) P(A) P(T) 0.7
- P(N) 1
- P(H) 1 - P(G) 0.85
- P(GATWNNHT) P(G)P(A)P(T)2P(W)P(N)2P(H)
0.150.350.3520.70.85 0.00383 - Justify the answer in term of combinations of
events. - P(W) P(A) P(T) because A and T are mutually
exclusive at a same position - P(N) 1 because A, C, G, T are complementary
events (there is no other possible event) - P(H) 1 - P(G) because H is the complementary
event of G (not G) - P(GATWNNHT) P(G)P(A)P(T)2P(W)P(N)2P(H) because
at the nucleotides found at successive positions
are mutually independent -gt the joined
probability is the product of probabilities.
42Answer to question 2
- The Hypergeometric distribution. The P-value is
calculated with the inverse CDF (the right tail
of the distribution). - See below.
- Terms
- n160 number of labelled elements (genomes with
an ortholog for gene A) - n220 number of unlabelled elements (genomes
without ortholog for gene A) - n50 the size of the selection (genomes with an
ortholog for gene B). - x40 the number of labelled elements in the
selection (genomes with orthologs for both A and
B) - See below.
- We searched the orthologs of two genes (A and B)
in 80 bacterial genomes. - An ortholog was found for gene A in 50 genomes.
- An ortholog was found for gene B in 60 genomes.
- Among these, 40 genomes contain orthologs for
both A and B. - Can we consider that A and B are found in common
in a significant number of genomes ? - Give the name of the distribution which allows to
model this type of situation. - Indicate the general formula which would allow to
estimate this significance (the P-value). - Explain what each term of this formula
represents. - In the formula, replace each variable by the
appropriate number(you dont need to calculate
the final value). - What does the P-value represent ? How would you
interpret a P-value of 14 in terms of risks ?