CSE182-L9 - PowerPoint PPT Presentation

About This Presentation
Title:

CSE182-L9

Description:

This is violated by many states of the gene structure HMM. ... Measuring the lengths using electrophoresis allows us to get the position of each T ... – PowerPoint PPT presentation

Number of Views:26
Avg rating:3.0/5.0
Slides: 40
Provided by: vineet50
Learn more at: https://cseweb.ucsd.edu
Category:

less

Transcript and Presenter's Notes

Title: CSE182-L9


1
CSE182-L9
  • Gene Finding (DNA signals)
  • Genome Sequencing and assembly

2
An HMM for Gene structure
3
Gene Finding via HMMs
  • Gene finding can be interpreted as a d.p.
    approach that threads genomic sequence through
    the states of a gene HMM.
  • Einit, Efin, Emid,
  • I, IG (intergenic)

IG
I
Efin
Emid
Note all links are not shown here
Einit
i
4
Generalized HMMs, and other refinements
  • A probabilistic model for each of the states (ex
    Exon, Splice site) needs to be described
  • In standard HMMs, there is an exponential
    distribution on the duration of time spent in a
    state.
  • This is violated by many states of the gene
    structure HMM. Solution is to model these using
    generalized HMMs.

5
Length distributions of Introns Exons
6
Generalized HMM for gene finding
  • Each state also emits a duration for which it
    will cycle in the same state. The time is
    generated according to a random process that
    depends on the state.

7
Forward algorithm for gene finding
qk
j
i
Duration Prob. Probability that you stayed in
state qk for j-i1 steps
Emission Prob. Probability that you emitted
Xi..Xj in state qk (given by the 5th order
markov model)
Forward Prob Probability that you emitted i
symbols and ended up in state qk
8
De novo Gene prediction Summary
  • Various signals distinguish coding regions from
    non-coding
  • HMMs are a reasonable model for Gene structures,
    and provide a uniform method for combining
    various signals.
  • Further improvement may come from improved signal
    detection

9
DNA Signals
  • Coding versus non-coding
  • Splice Signals
  • Translation start

10
DNA signal example
  • The donor site marks the junction where an exon
    ends, and an intron begins.
  • For gene finding, we are interested in computing
    a probability
  • Di ProbDonor site at position i
  • Approach Collect a large number of donor sites,
    align, and look for a signal.

11
PWMs
321123456 AAGGTGAGT CCGGTAAGT GAGGTGAGG TAGGTAAGG
  • Fixed length for the splice signal.
  • Each position is generated independently
    according to a distribution
  • Figure shows data from gt 1200 donor sites

12
Improvements to signal detection
  • PrGGTA is a donor site?
  • 0.50.5
  • PrCGTA is a donor site?
  • 0.50.5
  • Is something wrong with this explanation?

GGTA GGTA GGTA GGTA CGTG CGTG CGTG CGTG
13
MDD
  • PWMs do not capture correlations between
    positions
  • Many position pairs in the Donor signal are
    correlated

14
Maximal Dependence Decomposition
  • Choose the position i which has the highest
    correlation score.
  • Split sequences into two those which have the
    consensus at position i, and the remaining.
  • Recurse until ltTerminating conditionsgt
  • Stop if sequences is small enough

15
MDD for Donor sites
16
Gene prediction Summary
  • Various signals distinguish coding regions from
    non-coding
  • HMMs are a reasonable model for Gene structures,
    and provide a uniform method for combining
    various signals.
  • Further improvement may come from improved signal
    detection

17
How many genes do we have?
Nature
Science
18
Alternative splicing
19
Comparative methods
  • Gene prediction is harder with alternative
    splicing.
  • One approach might be to use comparative methods
    to detect genes
  • Given a similar mRNA/protein (from another
    species, perhaps?), can you find the best parse
    of a genomic sequence that matches that target
    sequence
  • Yes, with a variant on alignment algorithms that
    penalize separately for introns, versus other
    gaps.
  • There is a genome sequencing project for a
    different Hirudo species. You could compare the
    Hirudo ESTs against the genome to do gene finding.

20
Comparative gene finding tools
  • Procrustes/Sim4 mRNA vs. genomic
  • Genewise proteins versus genomic
  • CEM genomic versus genomic
  • Twinscan Combines comparative and de novo
    approach.
  • Mass Spec related?
  • Later in the class we will consider mass
    spectrometry data.
  • Can we use this data to identify genes in
    eukaryotic genomes? (Research project)

21
Databases
  • RefSeq and other databases maintain sequences of
    full-length transcripts/genes.
  • We can query using sequence.

22
Course
Gene finding
  • Sequence Comparison (BLAST other tools)
  • Protein Motifs
  • Profiles/Regular Expression/HMMs
  • Discovering protein coding genes
  • Gene finding HMMs
  • DNA signals (splice signals)
  • How is the genomic sequence itself obtained?

ESTs
Protein sequence analysis
23
Silly Quiz
  • Who are these people, and what is the occasion?

24
Genome Sequencing and Assembly
25
DNA Sequencing
  • DNA is double-stranded
  • The strands are separated, and a polymerase is
    used to copy the second strand.
  • Special bases terminate this process early.

26
Sequencing
  • A break at T is shown here.
  • Measuring the lengths using electrophoresis
    allows us to get the position of each T
  • The same can be done with every nucleotide.
    Fluorescent labeling can help separate different
    nucleotides

27
  • Automated detectors read the terminating bases.
  • The signal decays after 1000 bases.

28
Sequencing Genomes Clone by Clone
  • Clones are constructed to span the entire length
    of the genome.
  • These clones are ordered and oriented correctly
    (Mapping)
  • Each clone is sequenced individually

29
Shotgun Sequencing
  • Shotgun sequencing of clones was considered
    viable
  • However, researchers in 1999 proposed shotgunning
    the entire genome.

30
Library
  • Create vectors of the sequence and introduce them
    into bacteria. As bacteria multiply you will have
    many copies of the same clone.

31
Sequencing
32
Questions
  • Algorithmic How do you put the genome back
    together from the pieces? Will be discussed in
    the next lecture.
  • Statistical?
  • EX Let G be the length of the genome, and L be
    the length of a fragment. How many fragments do
    you need to sequence?
  • The answer to the statistical questions had
    already been given in the context of mapping, by
    Lander and Waterman.

33
Lander Waterman Statistics
Island
L
G
34
LW statistics questions
  • As the coverage c increases, more and more areas
    of the genome are likely to be covered. Ideally,
    you want to see 1 island.
  • Q1 What is the expected number of islands?
  • Ans N exp(-c?)
  • The number increases at first, and gradually
    decreases.

35
Analysis Expected Number Islands
  • Computing Expected islands.
  • Let Xi1 if an island ends at position i, Xi0
    otherwise.
  • Number of islands ?i Xi
  • Expected islands E(?i Xi) ?i E(Xi)

36
Prob. of an island ending at i
i
L
T
  • E(Xi) Prob (Island ends at pos. i)
  • Prob(clone began at position i-L1
  • AND no clone began in the next L-T positions)

37
LW statistics
  • PrIsland contains exactly j clones?
  • Consider an island that has already begun. With
    probability e-c?, it will never be continued.
    Therefore
  • PrIsland contains exactly j clones
  • Expected j-clone islands

38
Expected of clones in an island
Why?
39
Expected length of an island
Write a Comment
User Comments (0)
About PowerShow.com