Title: Computational Biology
1Computational Biology
- The advent of genomic sequencing has led to a new
era in biological studies, one involving aspects
of informational technologies to expedite
analysis of biological data.
2BIOS 480 Goals
- Provide a comprehensive understanding of current
methods in biological sequence analysis - Assess challenges and approaches in new
bioinformatics-related disciplines - Support in-depth, hands-on experience in design
and implementation of bioinformatics tools (BIOS
482)
3Grades
- The primary assessment tool in this course will
be participation as exhibited by in-class
discussion (40) and performance on assigned
homework (30). - A final exam testing your comprehensive knowledge
of the material will count 20 of your final
grade. - www.uwp.edu/barber/bioinformatics/BIOS480.htm
has lectures and important materials for this
class and BIOS 482
4Introduction to
- Sequence alignments
- Gene prediction
- Motifs
- Phylogeny
- Genomics
- Proteomics
- Metabolomics
5The advent of genome sequencing brought
bioinformatics into its own
- Yeah, but now that the human genome is done,
isnt genomics done - No
6 Capillary and Slab gel electrophoresis use a
modified Sanger technology with
fluorescent dyes
Typical reads of 500-750 nt on an hour
timescale. Variation depending on sequencer.
7Microfabricated Capillary Arrays
- Etch a glass chip with T-shaped channels that are
7 cm long, and mM in depth and width, can devise
a 96 well chip that would be capable of 150,000
bases/h - Miniaturization is one booming field driving
bioinformatics
8Free Solution Electrophoresis
- Possibly will improve separation time (no matrix)
without losing read length - Label DNA molecules with friction increasing
molecule such as streptavidin - Currently can read 100 bp, a long way to go
9Who needs electrophoresis?
- Pyrosequencing
- MALDI-TOF Mass Spectrometry
- Sequencing by Hybridization
- Massively Parallel Signature Sequencing
- A testimony to innovative molecular biology
- Single molecule methods
10Pyrosequencing
- Real-time sequencing measuring release of PPi
during DNA synthesis - Has been of particular use for SNP analysis
11Put the sequencing reactions through a mass
spectrometer
Spectra of the C- and G- terminated
oligonucleotides
Current limit 100 bp, Facilitated by sensitivity
and high-throughput loading
12Potential innovations in DNA sequencing
- Sequencing by hybridization
- Cot-based analysis
- http//www.msstate.edu/research/mgel/cbcs/cbcs.htm
- Chip-based analysis
- http//www.hyseq.com/content/131.php
- http//citeseer.nj.nec.com/context/471959/0
- Linear Read
- http//www.usgenomics.com/about/index.shtml
13Cot analysis
14(No Transcript)
15Growth in genomic technology
- U.S. Genomics's technology platform, the
GeneEngine, has two components, (1)
nanotechnology systems for positioning DNA so
that it can be read linearly
(broadly termed DNA Delivery
Mechanism(s)) and (2) detection technologies
that allow the reading of information from the
DNA Delivery Mechanism(s). (FRET-based??)
16The future looks bright, but what about right now?
17Overview of Genomic Sequencing
Original DNA
- Break DNA into random fragments (8-10X Coverage)
- Amplify fragments in a vector and sequence
500-700 bases - in from each end
Base calling performed by Phred software
http//www.phrap.org/ http//www.genome.org/c
gi/reprint/8/3/175.pdf
18Overview of Genomic Sequencing
Original DNA
- Break DNA into random fragments (8-10X Coverage)
- Amplify fragments in a vector and sequence
500-700 bases - in from each end
- Assemble fragments of sequence that have been
read
Contig 1
Contig 2
19Phred Software
- Calls bases in four phases
- Predicting peaks (ideal locations)
- Locating observed peaks
- Matching observed to predicted
- Finding missing peaks
- http//www.genome.org/cgi/reprint/8/3/186.pdf
- http//www.genome.org/cgi/reprint/8/3/175.pdf
20Errors in Sequencing Reads
- Each base call is assigned a quality score
- q -10 x log10(p) Higher quality scores
correspond to low error probabilities - Errors are associated with peak vicinity, use the
following parameters in error probability
determination on a TRAINING SET - Peak spacing
- Uncalled/called ration (two window sizes)
- Peak resolution
- Result in a look-up table inherent to Phred
software -
21Common Sources of Sequencing Errors
- The first fifty or so peaks of a trace are noisy
and unevenly spaced due to anomalous migration of
short DNA fragments, and unreacted dye-primer and
dye-terminator molecules. - Near the end of the trace, peaks become less
evenly spaced due to less accurate trace
processing, less well resolved as diffusion
effects increase, and also labeled molecules
decrease. - Compressions most common in GC-rich regions
when bases near the end of a single-stranded
fragment bind to a complementary region forming a
hairpin (migrates more rapidly than expected) - Dye-terminator sequencing method helps resolve
compressions, but has own problems About 85
of high quality dye terminator errors resulted
from a missing G peak following an A, or a
missing A folling a T, Ewing and Green, 1998.
22Assembly of large DNA sequences
- Several assembly programs exist and can be run
with different degrees of success Phrap, TIGR
Assembler, CAP, STROLL, etc.
23Overlap-layout-consensus
- Most fragment assembly algorithms include the
following three steps - Overlap. Finding potentially overlapping
fragments. - Layout. Finding the order of fragments.
- Consensus. Deriving the DNA sequence from the
layout.
24Overlap
- The overlap problem is to find the best match
between the suffix of one sequence and the prefix
of another. - If no sequencing errors, simply find the longest
suffix of one string that exactly matches the
prefix of another string. - Since errors are small, the common practice is to
use filtration method and to filter out pairs of
fragments that do not share a significantly long
common substring.
25Layout
- Many algorithms select a pair of fragments with
the best overlap at every step. - The score of overlap is either the similarity
score or a more involved probablilistic score. - The selected pair of fragments with the best
overlap score is checked for consistency. - If this check is accepted, the two fragments are
merged.
26Layout
- At later stages of the algorithm the collections
of fragments (contig) rather than individual
fragments are merged. - The difficulty with the layout step is deciding
whether two fragments with a good overlap really
overlap (i.e. their differences are caused by
sequencing errors) or represent a repeat in a
genome (i.e. their differences are caused by
mutations). - Use additional scaffolding measures physical
mapping, optical mapping, http//schwartzlab.biote
ch.wisc.edu/omm/omm.html
27Consensus
- The simplest way to build the consensus is to
report the most frequent character in the
substring layout that is (implicitly) constructed
after the layout step is completed.
28The Human Touch
- Consed A Graphical Tool for Editing Phrap
Assemblies.
29Some definitions
- Heuristics A term in computer science that
refers to 'guesses" made by a program to obtain
approximately accurate results. Typically, these
are used to increase the speed of a program
greatly at the cost of potentially yielding
suboptimal results. BLAST and FASTA use
heuristics based on knowledge of how sequences
evolve. - Greedy algorithm The idea behind a greedy
algorithm is to perform a single procedure in the
recipe over and over again until it can't be done
any more and see what kind of results it will
produce.