CSE182-L9 - PowerPoint PPT Presentation

About This Presentation

Title:

CSE182-L9

Description:

This is violated by many states of the gene structure HMM. ... Measuring the lengths using electrophoresis allows us to get the position of each T ... – PowerPoint PPT presentation

Number of Views:26

Avg rating:3.0/5.0

Slides: 40

Provided by: vineet50

Learn more at: https://cseweb.ucsd.edu

Category:

more less

Transcript and Presenter's Notes

Title: CSE182-L9

1
CSE182-L9

Gene Finding (DNA signals)
Genome Sequencing and assembly

2
An HMM for Gene structure
3
Gene Finding via HMMs

Gene finding can be interpreted as a d.p.
approach that threads genomic sequence through
the states of a gene HMM.
Einit, Efin, Emid,
I, IG (intergenic)

IG
I
Efin
Emid
Note all links are not shown here
Einit
i
4
Generalized HMMs, and other refinements

A probabilistic model for each of the states (ex
Exon, Splice site) needs to be described
In standard HMMs, there is an exponential
distribution on the duration of time spent in a
state.
This is violated by many states of the gene
structure HMM. Solution is to model these using
generalized HMMs.

5
Length distributions of Introns Exons
6
Generalized HMM for gene finding

Each state also emits a duration for which it
will cycle in the same state. The time is
generated according to a random process that
depends on the state.

7
Forward algorithm for gene finding
qk
j
i
Duration Prob. Probability that you stayed in
state qk for j-i1 steps
Emission Prob. Probability that you emitted
Xi..Xj in state qk (given by the 5th order
markov model)
Forward Prob Probability that you emitted i
symbols and ended up in state qk
8
De novo Gene prediction Summary

Various signals distinguish coding regions from
non-coding
HMMs are a reasonable model for Gene structures,
and provide a uniform method for combining
various signals.
Further improvement may come from improved signal
detection

9
DNA Signals

Coding versus non-coding
Splice Signals
Translation start

10
DNA signal example

The donor site marks the junction where an exon
ends, and an intron begins.
For gene finding, we are interested in computing
a probability
Di ProbDonor site at position i
Approach Collect a large number of donor sites,
align, and look for a signal.

11
PWMs
321123456 AAGGTGAGT CCGGTAAGT GAGGTGAGG TAGGTAAGG

Fixed length for the splice signal.
Each position is generated independently
according to a distribution
Figure shows data from gt 1200 donor sites

12
Improvements to signal detection

PrGGTA is a donor site?
0.50.5
PrCGTA is a donor site?
0.50.5
Is something wrong with this explanation?

GGTA GGTA GGTA GGTA CGTG CGTG CGTG CGTG
13
MDD

PWMs do not capture correlations between
positions
Many position pairs in the Donor signal are
correlated

14
Maximal Dependence Decomposition

Choose the position i which has the highest
correlation score.
Split sequences into two those which have the
consensus at position i, and the remaining.
Recurse until ltTerminating conditionsgt
Stop if sequences is small enough

15
MDD for Donor sites
16
Gene prediction Summary

Various signals distinguish coding regions from
non-coding
HMMs are a reasonable model for Gene structures,
and provide a uniform method for combining
various signals.
Further improvement may come from improved signal
detection

17
How many genes do we have?
Nature
Science
18
Alternative splicing
19
Comparative methods

Gene prediction is harder with alternative
splicing.
One approach might be to use comparative methods
to detect genes
Given a similar mRNA/protein (from another
species, perhaps?), can you find the best parse
of a genomic sequence that matches that target
sequence
Yes, with a variant on alignment algorithms that
penalize separately for introns, versus other
gaps.
There is a genome sequencing project for a
different Hirudo species. You could compare the
Hirudo ESTs against the genome to do gene finding.

20
Comparative gene finding tools

Procrustes/Sim4 mRNA vs. genomic
Genewise proteins versus genomic
CEM genomic versus genomic
Twinscan Combines comparative and de novo
approach.
Mass Spec related?
Later in the class we will consider mass
spectrometry data.
Can we use this data to identify genes in
eukaryotic genomes? (Research project)

21
Databases

RefSeq and other databases maintain sequences of
full-length transcripts/genes.
We can query using sequence.

22
Course
Gene finding

Sequence Comparison (BLAST other tools)
Protein Motifs
Profiles/Regular Expression/HMMs
Discovering protein coding genes
Gene finding HMMs
DNA signals (splice signals)
How is the genomic sequence itself obtained?

ESTs
Protein sequence analysis
23
Silly Quiz

Who are these people, and what is the occasion?

24
Genome Sequencing and Assembly
25
DNA Sequencing

DNA is double-stranded
The strands are separated, and a polymerase is
used to copy the second strand.
Special bases terminate this process early.

26
Sequencing

A break at T is shown here.
Measuring the lengths using electrophoresis
allows us to get the position of each T
The same can be done with every nucleotide.
Fluorescent labeling can help separate different
nucleotides

Automated detectors read the terminating bases.
The signal decays after 1000 bases.

28
Sequencing Genomes Clone by Clone

Clones are constructed to span the entire length
of the genome.
These clones are ordered and oriented correctly
(Mapping)
Each clone is sequenced individually

29
Shotgun Sequencing

Shotgun sequencing of clones was considered
viable
However, researchers in 1999 proposed shotgunning
the entire genome.

30
Library

Create vectors of the sequence and introduce them
into bacteria. As bacteria multiply you will have
many copies of the same clone.

31
Sequencing
32
Questions

Algorithmic How do you put the genome back
together from the pieces? Will be discussed in
the next lecture.
Statistical?
EX Let G be the length of the genome, and L be
the length of a fragment. How many fragments do
you need to sequence?
The answer to the statistical questions had
already been given in the context of mapping, by
Lander and Waterman.

33
Lander Waterman Statistics
Island
L
G
34
LW statistics questions

As the coverage c increases, more and more areas
of the genome are likely to be covered. Ideally,
you want to see 1 island.

Q1 What is the expected number of islands?
Ans N exp(-c?)
The number increases at first, and gradually
decreases.

35
Analysis Expected Number Islands

Computing Expected islands.
Let Xi1 if an island ends at position i, Xi0
otherwise.
Number of islands ?i Xi
Expected islands E(?i Xi) ?i E(Xi)

36
Prob. of an island ending at i
i
L
T

E(Xi) Prob (Island ends at pos. i)
Prob(clone began at position i-L1
AND no clone began in the next L-T positions)

37
LW statistics

PrIsland contains exactly j clones?
Consider an island that has already begun. With
probability e-c?, it will never be continued.
Therefore
PrIsland contains exactly j clones