Jacques van Helden Jacques'van'Heldenulb'ac'be - PowerPoint PPT Presentation

1 / 42
About This Presentation
Title:

Jacques van Helden Jacques'van'Heldenulb'ac'be

Description:

How many oligomers contain exactly a single occurrence of each monomer, for ... Permutations within a set - the factorial ... Number of distinct selections (orderless) ... – PowerPoint PPT presentation

Number of Views:116
Avg rating:3.0/5.0
Slides: 43
Provided by: jacquesv8
Category:

less

Transcript and Presenter's Notes

Title: Jacques van Helden Jacques'van'Heldenulb'ac'be


1
Statistics for bioinformatics
  • Cuernavaca - Introduccion a bioinformatica

2
Combinatorial analysis
  • Statistics Applied to Bioinformatics

3
Problem - oligomers
  • How many oligomers contain exactly a single
    occurrence of each monomer, for oligonucleotides
    and oligopeptides, respectively ?

4
Permutations within a set - the factorial
  • How many distinct permutations can be made from a
    set of x elements ?
  • x 2 2
  • x 3 32 6
  • x 4 432 24
  • any x x(x-1)...1 x!
  • The factorial x! represents the number of
    possible permutations between x objects.
  • Solution to the problem of oligomers
  • There are 4!24 distinct oligonucleotides with a
    single occurrence of each nucleotide (A, C, G, T)
  • There are 20!2.41018 distinct oligopeptides
    with a single occurrence of each amino acid.

5
Problem - Selection of a subset of elements
  • A genome contains n6000 genes.
  • We select a series of genes in the following way
  • Once a gene has been selected once, it cannot be
    selected anymore (no replacement)
  • We are not interested in the order of the
    selection if A and B were selected, we do not
    consider whether A came out in first or in second
    position.
  • How many possibilities do we have to select
  • 1 gene ?
  • 2 genes ?
  • 3 genes ?
  • x genes ?

6
Selection of a subset of elements
  • Number of possible outcomes
  • n size of the set
  • x size of the subset
  • Possible permutations among the elements of a
    subset
  • Number of distinct selections (orderless).
  • The coefficient Cxn represents the number of
    distinct choices of x elements among n. For this
    reason, it is called "Choose x among n". It is
    also called binomial coefficient (we will see
    later why).

7
Elements of Probabilities
  • Statistics Applied to Bioinformatics

8
Frequential definition of probability
  • Trial
  • a random position is selected in the genome of
    Mycoplasma genitalium
  • Event
  • If the nucleotide at this position is a adenine
    (A), the trial is considered as a success
  • 1,000,000 successive trials were performed
  • The frequency of success is plotted as a function
    of the number of trials
  • The frequency progressively converges towards the
    value of 0.35
  • The probability is the value that would be
    obtained with an infinite number of trials

Selection of random nucleotides in a genome
9
Mutually exclusive events
Where ? is the logical OR ? is the logical AND
  • E.g. Calculating degenerate nucleotide
    probability
  • P(W) P(A ? T) P(A) P(T)
  • P(S) P(C ?G) P(C) P(G)

10
Complementary events
  • E.g. coding / non-coding sequences in
    Mycobacterium genitalium
  • P(coding) 0.902
  • P(coding ? non-coding) 1? P(non-coding) 1 -
    P(coding) 0.098
  • Example Probability of the degenerate nucleotide
    N (in this example we combine the properties of
    complementarity and mutual exclusion)
  • P(N) P(A) P(T) P(C) P(G) 1

11
Stochastic independence
  • Two events A and B are said stochastically
    independent when
  • P(AB) P(A!B) P(A)
  • P(BA) P(B!A) P(B)
  • For stochastically independent events, the joint
    probability is the product of probabilities
  • P(A ? B) P(A)P(B)
  • E.g. calculating oligonucleotide probability
    with a model of independent succession of
    nucleotides
  • P(GATAAG) P(G) P(A) P(T) P(A) P(A)
    P(G)
  • Note this is not appropriate for biological
    sequences, where there are strong dependencies
    between neighbour nucleotides (see next slide)

12
Non necessarily independent events
P(A ? B) P(A) P(B) P(A ? B)
P(A ? B ? C) P(A) P(B) P(C) P(A ? B)
P(A ? C) P(B ? C) P(A ? B ? C)
13
Probabilities - Exercises
  • Assuming identically and independently
    distributed nucleotides, calculate the
    probability to observe an hexanucleotide
    containing exactly 4 As and 2 Cs, irrespective of
    the relative positions of these residues.

14
Probabilities - Exercises
  • Assuming a DNA sequence with independently and
    identically distributed nucleotides, calculate
    the probabilities of the following
    oligonucleotides.
  • A
  • AA
  • AAAA
  • AAAAAA
  • CCCCCC
  • CACACA
  • CANNTG
  • CACGTK
  • Calculate the probabilities of the same
    oligonucleotides, assuming that nucleotides are
    independently distributed, but have the following
    prior probabilities
  • P(A) 0.31 P(T) 0.29 P(C) 0.19 P(G)
    0.21

15
Set comparisons
  • Statistics for bioinformatics

16
Methionine Biosynthesis in S.cerevisiae
17
The "bio-ontologies"
  • Answer to the problem of inconsistencies in
    database annotations
  • Controlled vocabulary
  • Hierarchical classification between the terms of
    the controlled vocabulary
  • E.g. The Gene Ontology
  • molecular function ontology
  • process ontology
  • cellular component ontology

18
Gene ontology processes
19
Gene ontology molecular functions
20
Gene ontology cellular components
21
Fragment of the GO annotations for the yeast
Saccharomyces cerevisiae
22
Problem - selection within a set with classes
  • A given organism has 6,000 genes, among which 40
    are involved in methionine metabolism.
  • A set of 10 genes are co-regulated in a
    microarray experiment.
  • Among them, 6 are related to methionine
    metabolism.
  • What would be the probability to observe such a
    correspondence by chance alone ?

Methionine
Co-regulated
34
6
4
Genome (6000)
23
Selection within a set with classes
  • Let us define
  • g 6000 number of genes
  • m 40 genes involved in methionine metabolism
  • n 5960 genes not involved in methionine
    metabolism
  • k 10 number of genes in the cluster
  • x 6 number of methionine genes in the cluster
  • We calculate the number of possibilities for the
    following selections
  • C1 10 distinct genes among 6,000
  • C2 6 distinct genes among the 40 involved in
    methionine
  • C3 4 genes among the 5960 which are not involved
    in methionine
  • C4 6 methionine and 4 non-methionine genes
  • Probability to have exactly 6 methionine genes
    within a selection of 10
  • Probability to have at least 6 methionine genes
    within a selection of 10

24
The hypergeometric distribution
  • The hypergeometric distribution represents the
    probability to observe x successes in a sampling
    without replacement
  • m number of possible successes
  • n number of possible failures
  • k sample size
  • x number of successes in the sample
  • The shape of the distribution depends on the
    ratio between m and n
  • m ltlt n i-shaped
  • m n bell-shaped
  • m gtgt n j-shaped
  • The distribution is bounded on both sides (0 ? x
    ? k)
  • Statistical parameters

25
Multi-testing corrections
  • Statistics Applied to Bioinformatics

26
Bonferoni rule
  • Multi-testing
  • Assessing the significance of each gene on a chip
    represents thousands of simultaneous tests. Let N
    be the number of genes.
  • The risk of error (P-value) associated to each
    gene will thus be challenged N times.
  • The significance thresholds used for single
    testing (0.01, 0.001) are thus likely to return
    many false positive.
  • Bonferoni rule
  • Adapt the threshold to the number of simultaneous
    tests.

27
E-value
  • An alternative but equivalent way to treat the
    problem of multi-testing is to calculate the
    expected value for each observation.
  • One can then choose the E-value according to the
    number of false positive considered as
    acceptable.

28
Family-wise Error Rate (FWER)
  • Another correction for multiple testing consists
    in estimating the probability to observe at least
    one false positive in the whole set of tests.
    This probability can be calculated quite easily
    from the P-value (Pval).

29
False Discovery Rate (FDR)
  • Yet another approach is to consider, for a given
    threshold on P-value, the False Discovery Rate,
    i.e. the proportion of false predictions within a
    set of predictions.

30
Summary - Multi-testing corrections
  • Bonferoni rule adapt significance threshold
  • E-value expected number of false positives
  • FWER Family-wise error rate probability to
    observe at least one false positive
  • FDR False discovery rate estimated rate of
    false positives among the predictions

31
Compare-classes result
32
Application comparing many gene sets with many
gene sets
  • Statistics Applied to Bioinformatics

33
RegulonDB factor -gt gene network
  • RegulonDB (Oct. 2005 version)
  • The graph represents the relationships between
    factors and their target genes (factor -gt gene
    graph)
  • 125 transcription factors
  • 467 target genes
  • 847 factor-gtgene interactions
  • 45 self-regulations
  • Note CRP alone regulates 132 target genes.

Factor-gene graph
34
Application yeast protein complexes
  • High-throughput experiments led to the
    identification of thousands of protein complexes
    in the yeast Saccharomyces cerevisiae.
  • Note These high-throughput experiments are
    likely to contain a given rate of noise
  • Question
  • can we associate specific functions to these
    complexes ?
  • Approach
  • Compare the gene composition of each complex with
    the functional classes associated in the Gene
    Ontology.

35
Yeast complexes versus GO
36
Questions
  • Can we detect pairs of transcription factors
    which co-regulate a significant number of genes ?
  • Are these factors (and their common target genes)
    invovled in particular biological functions ?

37
Regulons versus regulons
gene factor GI alkA Ada 1788383 ada
Ada 1788542 aidB Ada 1790630 adiA
AdiY 1790558 agaZ AgaR 1789520 agaS
AgaR 1789525 alsR AlsR 1790527 alsI
AlsR 1790528 hyaA AppY 1787206 appC
AppY 1787212 araB AraC 1786249 araC
AraC 1786251 araJ AraC 1786595 araF
AraC 1788211 araE AraC 1789207 caiT
ArcA 1786224 lpdA ArcA 1786307 betI
ArcA 1786505 betT ArcA 1786506 cyoA
ArcA 1786635 dcuC ArcA 1786839 gltA
ArcA 1786939 sdhC ArcA 1786940 sucA
ArcA 1786945 cydA ArcA 1786953 moeA
ArcA 1787049 focA ArcA 1787132 hyaA
ArcA 1787206 ptsG ArcA 1787343 ndh
ArcA 1787352 icdA ArcA 1787381 hemA
ArcA 1787461 acnA ArcA 1787531 tpx
ArcA 1787584 ...
38
Regulons versus functional classes
gene factor GI alkA Ada 1788383 ada
Ada 1788542 aidB Ada 1790630 adiA
AdiY 1790558 agaZ AgaR 1789520 agaS
AgaR 1789525 alsR AlsR 1790527 alsI
AlsR 1790528 hyaA AppY 1787206 appC
AppY 1787212 araB AraC 1786249 araC
AraC 1786251 araJ AraC 1786595 araF
AraC 1788211 araE AraC 1789207 caiT
ArcA 1786224 lpdA ArcA 1786307 betI
ArcA 1786505 betT ArcA 1786506 cyoA
ArcA 1786635 dcuC ArcA 1786839 gltA
ArcA 1786939 sdhC ArcA 1786940 sucA
ArcA 1786945 cydA ArcA 1786953 moeA
ArcA 1787049 focA ArcA 1787132 hyaA
ArcA 1787206 ptsG ArcA 1787343 ndh
ArcA 1787352 icdA ArcA 1787381 hemA
ArcA 1787461 acnA ArcA 1787531 tpx
ArcA 1787584 ...
39
Microarray groups (Gasch, 2000) versus Gene
Ontology
40
Exercises
  • The nucleotides frequencies were computed for a
    given genome F(A) F(T) 0.35 F(G) F(C)
    0.15
  • What is the probability to observe the following
    motif GATWNNHT at a given position of this genome
    ?
  • W means A or T
  • H means not G
  • N means any nucleotide
  • Justify the answer in term of combinations of
    events.
  • We searched the orthologs of two genes (A and B)
    in 80 bacterial genomes.
  • An ortholog was found for gene A in 50 genomes.
  • An ortholog was found for gene B in 60 genomes.
  • Among these, 40 genomes contain orthologs for
    both A and B.
  • Can we consider that A and B are found in common
    in a significant number of genomes ?
  • Give the name of the distribution which allows to
    model this type of situation.
  • Indicate the general formula which would allow to
    estimate this significance (the P-value).
  • Explain what each term of this formula
    represents.
  • In the formula, replace each variable by the
    appropriate number(you dont need to calculate
    the final value).
  • What does the P-value represent ? How would you
    interpret a P-value of 15 in terms of risks ?

41
Answer to the first question
  • The nucleotides frequencies were computed for a
    given genome F(A) F(T) 0.35 F(G) F(C)
    0.15
  • What is the probability to observe the following
    motif GATWNNHT at a given position of this genome
    ?
  • P(W) P(A) P(T) 0.7
  • P(N) 1
  • P(H) 1 - P(G) 0.85
  • P(GATWNNHT) P(G)P(A)P(T)2P(W)P(N)2P(H)
    0.150.350.3520.70.85 0.00383
  • Justify the answer in term of combinations of
    events.
  • P(W) P(A) P(T) because A and T are mutually
    exclusive at a same position
  • P(N) 1 because A, C, G, T are complementary
    events (there is no other possible event)
  • P(H) 1 - P(G) because H is the complementary
    event of G (not G)
  • P(GATWNNHT) P(G)P(A)P(T)2P(W)P(N)2P(H) because
    at the nucleotides found at successive positions
    are mutually independent -gt the joined
    probability is the product of probabilities.

42
Answer to question 2
  • The Hypergeometric distribution. The P-value is
    calculated with the inverse CDF (the right tail
    of the distribution).
  • See below.
  • Terms
  • n160 number of labelled elements (genomes with
    an ortholog for gene A)
  • n220 number of unlabelled elements (genomes
    without ortholog for gene A)
  • n50 the size of the selection (genomes with an
    ortholog for gene B).
  • x40 the number of labelled elements in the
    selection (genomes with orthologs for both A and
    B)
  • See below.
  • We searched the orthologs of two genes (A and B)
    in 80 bacterial genomes.
  • An ortholog was found for gene A in 50 genomes.
  • An ortholog was found for gene B in 60 genomes.
  • Among these, 40 genomes contain orthologs for
    both A and B.
  • Can we consider that A and B are found in common
    in a significant number of genomes ?
  • Give the name of the distribution which allows to
    model this type of situation.
  • Indicate the general formula which would allow to
    estimate this significance (the P-value).
  • Explain what each term of this formula
    represents.
  • In the formula, replace each variable by the
    appropriate number(you dont need to calculate
    the final value).
  • What does the P-value represent ? How would you
    interpret a P-value of 14 in terms of risks ?
Write a Comment
User Comments (0)
About PowerShow.com