Title: Dr' Wishart
1Dr. Wishart
- Office hours
- Generally available right after class
- Best time to catch me is after 400 pm in my
office in Athabasca Hall - Usually around til 600 pm
- Arranging an appointment is always best (email)
- Email responses take 1-2 days
2Where to get your notes
- http//redpoll.pharmacy.ualberta.ca/
- Look under Courses
3General Course Outline
- Bioinformatics introduction
- Bionformatics Databases
- Sequence Alignment
- Protein Feature ID
- Computational microbiology
- Peptide/Protein Analysis
- Protein Structure (Xray)
4General Course Outline
- Protein Structure (NMR)
- Mass Spectrometry 1
- Mass Spectrometry 2
- Proteomics
- Systems Biology
- Enzymology/Systems Biology
- Protein Structure (Xray)
5Introduction to Bioinformatics
- Microbiology 343
- David Wishart Rm. 3-41 Ath
- david.wishart_at_ualberta.ca
6Objectives Outline
- Definitions and roles of bioinformatics
- DNA sequencing (foundation to bioinformatics)
- Genomes and Genomics
- Gene finding in prokaryotes
- From genes to proteins
7Bioinformatics
Definition - A field of information
technology which endeavours
to improve the storage,
management and analysis of
biological, medical and
pharmaceutical data. A blend information
technology and biotechnology
8Bioinformatics - Converting Data to Knowledge
Data
Knowledge
9Bioinformatics Software
10Whats the Appeal?
- No spills or smells
- No need for a lab coat
- No need to get hands or clothes messy
- Provides a faster or alternative route to create
hypotheses, perform difficult experiments, avoid
unnecessary experiments, compare, visualize and
analyze data, make predictions, see what is
unseeable and handle a growing tidal wave of
both data and knowledge
11Bioinformatics
Genomics
Proteomics
Bioinformatics
12High Throughput DNA Sequencing
13Shotgun Sequencing
Isolate Chromosome
ShearDNA into Fragments
Clone into Seq. Vectors
Sequence
14Principles of DNA Sequencing
Primer
DNA fragment
Amp
PBR322
Tet
Ori
Denature with heat to produce ssDNA
Klenow ddNTP dNTP primers
15The Secret to Sanger Sequencing
16Principles of DNA Sequencing
3 Template
G C A T G C
5
5 Primer
GddC
GCddA
GCAddT
ddG
GCATGddC
GCATddG
17Principles of DNA Sequencing
G
T
_
_
short
C
A
G C A T G C
long
18Capillary Electrophoresis
Separation by Electro-osmotic Flow
19Multiplexed CE with Fluorescent detection
ABI 3700
96x700 bases
20Shotgun Sequencing
Assembled Sequence
Sequence Chromatogram
Send to Computer
21Shotgun Sequencing
- Very efficient process for small-scale (10 kb)
sequencing (preferred method) - First applied to whole genome sequencing in 1995
(H. influenzae) - Now standard for all prokaryotic genome
sequencing projects - Successfully applied to D. melanogaster
- Moderately successful for H. sapiens
22The Finished Product
GATTACAGATTACAGATTACAGATTACAGATTACAG ATTACAGATTACA
GATTACAGATTACAGATTACAGA TTACAGATTACAGATTACAGATTACA
GATTACAGAT TACAGATTAGAGATTACAGATTACAGATTACAGATT AC
AGATTACAGATTACAGATTACAGATTACAGATTA CAGATTACAGATTAC
AGATTACAGATTACAGATTAC AGATTACAGATTACAGATTACAGATTAC
AGATTACA GATTACAGATTACAGATTACAGATTACAGATTACAG ATTA
CAGATTACAGATTACAGATTACAGATTACAGA TTACAGATTACAGATTA
CAGATTACAGATTACAGAT
23Sequenced Genomes
http//www.genomenewsnetwork.org/
24Genomes to Date
- 8 vertebrates (human, mouse, rat, fugu, dog,
chimp) - 3 plants (arabadopsis, rice, poplar)
- 2 insects (fruit fly, mosquito)
- 2 nematodes (C. elegans, C. briggsae)
- 1 sea squirt
- 4 parasites (plasmodium, guillardia)
- 4 fungi (S. cerevisae, S. pombe)
- 200 bacteria and archebacteria
- 2000 viruses
25Prokaryotes
- Simple gene structure
- Small genomes (0.5 to 10 million bp)
- No introns (uninterrupted)
- Genes are called Open Reading Frames of ORFs
(include start stop codon) - High coding density (gt90)
- Some genes overlap (nested)
- Some genes are quite short (lt60 bp)
26Prokaryotic Gene Structure
ORF (open reading frame)
TATA box
Stop codon
Start codon
ATGACAGATTACAGATTACAGATTACAGGATAG
Frame 1
Frame 2
Frame 3
27Simple Gene Finding
- Scan forward strand until a start codon is found
- Staying in same frame scan in groups of three
until a stop codon is found - If of codons between start and end is greater
than 50, identify as gene and go to last start
codon and proceed with step 1 - If codons between start and end is less than
50, go back to last start codon and go to step 1 - At end of chromosome, repeat process for reverse
complement
28Advanced Gene Finding
- Identify all ORFs (open reading frames) gt 200
bases on both strands using normal and alternate
start/stop codons - Find high scoring -10,-35 and RBS sites at 5
ends of putative ORFs - Find high scoring rho terminators at 3 ends of
putative ORFs - Exclude ORFs without identified signals at 5 or
3 ends
29Key Prokaryotic Gene Signals
- Alternate start codons
- RNA polymerase promoter site (-10, -35 site or
TATA box) - Shine-Dalgarno sequence (Ribosome binding
site-RBS) - Stem-loop (rho-independent) terminators
- High GC content (CpG islands)
30Alternate Start Codons (E. coli)
ATG Met GTG Val TTG Leu
Class I Class IIa
CTG Met ATT Val ATA Leu ACG Thr
31-10, -35 Site (RNA pol Promoter)
-36 -35 -34 -33 -32 . -13 -12 -11 -10 -9 -8 T
T G A C T A t A A T
32RBS (Shine Dalgarno Seq)
-13 -12 -11 -10 -9 -8 .. -1 0 1 2 3 4 G G
G G G G n A T G n C
33Terminator Stem-loops
34More Sophisticated Methods
RBS site
promoter site
HMM
35Really Sophisticated Methods
- GLIMMER
- http//www.tigr.org/software/glimmer/
- Uses interpolated markov models (IMM)
- Requires training of sample genes
- Takes about 1 minute/genome
- GeneMark.hmm
- http//opal.biology.gatech.edu/GeneMark/gmhmm2_pro
k.cgi - Available as a web server
- Uses hidden markov models (HMM)
36Glimmer Performance
37What Next?
- Raw DNA sequence ? Gene sequences
- Gene seqs ? Protein sequences
- Gene Protein seqs ? Databases
- Gene Protein info ? Databases
- Most protein and DNA sequence data is entered
into GenBank through XXX - Next Lecture Databases
38Sample Exam Question
- Describe an algorithm or sketch a flowchart for
gene finding in prokaryotes - What are the key features of a prokaryotic ORF?
- Following is a gene sequence identify and label
all major features