Title: Course Outline
1Course Outline
2What Happens when HGP is Completed ?
- Known gene number, location, function and
regulation. - Chromosome structure and organization.
- Non-coding DNA types, amount, distribution and
function. - Gene expression, protein synthesis, and
post-translation events, - protein interactions and networks.
- Evolutionary conservation among organisms,
(structure and function). - Correlation of SNPs with health and disease.
- Genes involved in complex traits and multi-gene
diseases.
3The Next Step
Locate all the genes and describe their
function. This will probably take another 15-20
years !
4Reminder Genome Browsers and Gene Prediction
http//genome.ucsc.edu/
5Eukaryotes vs Prokaryotes
Typical human bacterial cells drawn to
scale.
- Eukaryotic cells are
- characterized by
- membrane-bound
- compartments,
- which are absent
- in prokaryotes.
BIOS Scientific Publishers Ltd, 1999
6Gene Prediction is Different for Eukaryotes and
Prokaryotes
- Prokaryotic genes
- Eukaryotic genes
http//csbl.bmb.uga.edu/resources/slides/gene-find
ing-2.ppt275,4,Review
7Eukaryotes Splice Signals
8The operon model of prokaryotic gene regulation
- Small genomes, high gene density -
- Example Haemophilus influenza genome
- 85 genic.
- Groups of genes coding for related
- proteins are arranged in units known as
- operons.
- No introns - One gene, one protein.
- Open reading frames - One ORF per gene, ORFs
begin with start, end with stop codon.
http//www.emc.maricopa.edu/faculty/farabee/BIOBK/
BioBookGENCTRL.html
9Gene finding is simple !
http//tardigrada.cap.ed.ac.uk/teaching/genomics/t
ech2/img2.htm
10Prokaryotic Gene Prediction Tools
- Glimmer
- http//cbcb.umd.edu/software/glimmer/
- GeneMark
- http//exon.gatech.edu/genemark/genemark_prok_gms_
plus.cgi - ORNL Annotation Pipeline
- http//compbio.ornl.gov/GP3/pro.html
- FramePlot - protein-coding region prediction tool
for high GC-content bacteria - http//www.nih.go.jp/jun/cgi-bin/frameplot.pl
11Gene Prediction in Eukaryotes
- Complex gene structure.
- Large genomes (0.1 to 3 billion bases).
- Exons and introns (interrupted).
- Low coding density (lt30).
- 2-3 in humans, 25 in Fugu, 60 in yeast
- Alternate splicing (40-60 of all genes).
- Considerable number of pseudogenes.
NATURE BIOTECHNOLOGY VOLUME 25 NUMBER 8 AUGUST
2007
12Gene Prediction in Eukaryotes
13 Non-coding In Genomes
http//fig.cox.miami.edu/cmallery/150/gene/noncod
ing.genes.jpg
14Non-Protein Coding Gene Tools
- tRNA
- tRNA-ScanSE (http//www.genetics.wustl.edu/eddy/tR
NAscan-SE/) - FAStRNA (http//bioweb.pasteur.fr/seqanal/interfac
es/fastrna.html) - snoRNA
- snoRNA database (http//rna.wustl.edu/snoRNAdb/)
- microRNA
- Sfold (http//www.bioinfo.rpi.edu/applications/sfo
ld/index.pl) - SIRNA (http//bioweb.pasteur.fr/seqanal/interfaces
/sirna.html)
http//harlequin.jax.org/GenomeAnalysis/GeneFindin
g04.ppt257,38,Non-protein Coding Gene Tools and
Information
15Approaches to Gene Finding
- Direct Look for something that looks like one
gene (homology) - Homology based gene prediction - Exact or near-exact matches (similarity searches)
of - EST, cDNA, or proteins from the same, or closely
related - organisms (comparative genomics).
- Indirect
- Look for something that looks like all genes (ab
initio). - Hybrid
- Combine homology and ab initio.
16(No Transcript)
17(No Transcript)
18General Things to Remember about
(Protein-Coding) Gene Prediction Software
- Work best on genes that are reasonably similar to
something seen previously. - Finds protein coding regions far better than
non-coding regions. First and last exons are
difficult to annotate because they contain UTRs.
Small genes are not statistically significant and
therefore hard to predict. - In the absence of external (direct) information,
alternative forms will not be identified. - It is imperfect ! (Its biology, after all).
19Finding Eukaryotic Genes Computationally
- Gene finding based on homology evidence BLAST,
FASTA, BLAT etc. - Content-based Methods
- CpG islands, GC content, hexamer repeats,
composition statistics, codon frequencies - Feature-based Methods
- donor sites, acceptor sites, promoter sites,
start/stop codons, polyA signals, feature lengths - Similarity-based Methods
- sequence homology (DNA, protein), EST searches
- Pattern-based
- HMMs, Artificial Neural Networks
- Most effective is a combination of all the above !
20Content-Based Methods
- CpG islands (in and near approximately 40 of
promoters of mammalian genes (about 70 in human
promoters) - Very abundant near gene start site
- High GC content found in 5 ends of genes
- Codon Bias
- Some codons are strongly preferred in coding
regions, others are not - Positional Bias
- 3rd base tends to be G/C rich in coding regions
21Feature-Based Methods
- Based on identifying gene signals (promoter
elements, splice sites, start/stop codons, polyA
sites, etc.) - Wide range of methods
- Consensus sequences
- Weight matrices
- Neural networks
- Decision trees
- Hidden Markov Models (HMMs)
22Eukaryotic Genomic Hints that a Gene is Nearby
- PolII RNA promoter elements
- GC box, TATA box, CCAAT region.
- Kozak consensus sequence (eukaryotic ribosome
binding site-RBS consensus (gcc)gccRccAUGG,
where R is a purine (adenine or guanine) three
bases upstream of the start codon (AUG). - Splicing signals donor, acceptor.
- Termination signal.
- Polyadenylation signal.
- Promoter elements.
23Pol II Promoter Elements
- 5 cap region/signal
- (a modified guanine nucleotide that has been
added to the "front" or 5' end of a eukaryotic
messenger RNA shortly after the start of
transcription, important for ribosome
recognition). - nCAGTnG
- TATA box (30 bp upstream)
- - TATAAA
- CCAAT box (80 bp upstream)
- - TAGCCAATG
- GC box (200 bp upstream)
- - GGGCGG
- None of these are essential for gene expression
- Each of these may have more than one copy, except
the TATA box, to produce greater effect - There are other promoter elements
24Control of Gene Expression Promoter Prediction
25How does it Works Motif Identification
- Exon-Intron Borders Splice Sites
Exon Intron
Exon gaggcatcagGTttgtagactgtgtttcAG
tgcacccact ccgccgctgaGTgagccgtgtc
tattctAGgacgcgcggg tgtgaattagGTaagaggtt
atatctccAGatggagatca ccatgaggagGTgagtg
ccattatttccAGgtatgagacg
Splice site Splice site
Motif Extraction Programs at http//www-btls.jst.g
o.jp/ Tools for genome analysis
http//www-btls.jst.go.jp/cgi-bin/Tools/index.cgi?
langen
http//www.seas.gwu.edu/simhaweb/cs177/spring2004
/zeeberglecture1Part2.ppt317,25,How it works I
Motif identification
26Prediction of Splice Junction Sites
Splice site prediction tools - 3 splice site
CAG/GT 5 splice
site MAG/GTRAGT
(M is A or C R is A or G). http//l25.itba.mi.cn
r.it/webgene/wwwspliceview.html
http//dot.imgen.bcm.tmc.edu9331/seq-search/gene
-search.html Splice predictor
http//bioinformatics.iastate.edu/cgi-bin/sp.cgi
Splice site prediction by Neural Network
http//www.fruitfly.org/seq_tools/splice.html
27Prediction of Translation Start Site
- Translation start ATG
- How to predict a translation start
ATG
GCCATGGCGA .. ACGATGCTGT . GACATGGTAC
AGGATGGGCT GCGATGTGGC
AUG codon finder tool http//www.cbs.dtu.dk/
services/NetStart/ http//l25.itba.mi.cnr.it/web
gene/wwwaug.html
28Polyadenylation Termination Signals
- Polyadenylation signal
- AA/TTAAA
- Located 20 bp upstream of poly-A cleavage site
- Termination Signal
- AGTGTTCA
- Located 30 bp downstream of poly-A cleavage site
29Why Polyadenylation is Really Useful
Complementary Base Pairing
AAAAAAAAAAA TTTTTTTTTTT
30Links for Gene Finding Software
POLY-A signal prediction - 3 end of most
eukaryotic mRNAs (60-200 residues),
post-transcription added http//dot.imgen.bcm.tmc
.edu9331/seq-search/gene-search.html
http//l25.itba.mi.cnr.it/genebin/wwwHC_POLYA
http//genomic.sanger.ac.uk/gf/gf.shtml http//ru
lai.cshl.org/tools/polyadq/polyadq_form.html
Repeated elements (Repeat Masker) http//ftp.ge
nome.washington.edu/RM/RepeatMasker.html
http//l25.itba.mi.cnr.it/genebin/wwwrepeat.pl
GC rich areas, http//rulai.cshl.org/tools/CpG_p
romoter/
31The Annotation Pipeline
- Mask repeats using RepeatMasker
(http//www.repeatmasker.org/). - Run sequence through several gene prediction
programs. - Validation
- Take predicted genes and do similarity search
against ESTs and genes from other organisms. - Do similarity search for non-coding sequences to
find ncRNA.
32Eukaryotic Gene Prediction Tools
- Genscan (ab initio), GenomeScan (hybrid)
- (http//genes.mit.edu/)
- (http//genes.mit.edu/genomescan.html)
- Twinscan (hybrid)
- (http//genes.cs.wustl.edu/)
- FGENESH (ab initio)
- (http//www.softberry.com/berry.phtml?topicgfind)
- GeneMark.hmm (ab initio)
- (http//opal.biology.gatech.edu/GeneMark/eukhmm.cg
i) - MZEF (ab initio)
- (http//rulai.cshl.org/tools/genefinder/)
- GrailEXP (hybrid)
- (http//grail.lsd.ornl.gov/grailexp/)
- GeneID (hybrid)
- (http//www1.imim.es/software/geneid/geneid.html
) - FirstEF
- - http//rulai.cshl.org/tools/FirstEF/
33Gene Finding Software Promoter Prediction
Promoter Prediction McPromoter
http//genes.mit.edu/McPromoter.html Human
Promoter Prediction http//www.softberry.com/berr
y.phtml?topicfpromgroupprogramssubgrouppromot
er Consite tool http//asp.ii.uib.no8090/cgi-bi
n/CONSITE/consite?rmt_input_single DNA Motif
search (in TRANSFAC) http//motif.genome.jp/
TFBind http//tfbind.ims.u-tokyo.ac.jp/
TFSearch http//www.cbrc.jp/research/db/TFSEARCH
.html http//molsun1.cbrc.aist.go.jp/research/db/
TFSEARCH.html Promoter prediction - DNA region
that RNA polymerase binds before initiating
transcription (TATA box prediction) http//www.fr
uitfly.org/seq_tools/promoter.html Transcription
factor binding sites prediction http//www.cbil.u
penn.edu/tess/index.html
34Example - gene finding tool http//genes.mit.e
du/genomescan.html
35When selecting ORF, you can blast the amino
acid sequence and get a protein.
http//www.ncbi.nlm.nih.gov/gorf/gorf.html
6 frames
36GENSCAN output for sequence
Optimal exon
Initial exon
Internal exon
Terminal exon
Single Exon gene
Sub-Optimal exon
http//genes.mit.edu/
37Gene Prediction Pipeline - Example
- Get a genomic sequence from genome browser
- BRCA1 genomic NG_005905 Fasta format.
- Apply to gene prediction tools provided
- GENSCAN http//genes.mit.edu/GENSCAN.html
- ORFinder http//www.ncbi.nlm.nih.gov/gorf/go
rf.html - Also, worth trying
- GeneBuilder http//l25.itba.mi.cnr.it/webgene/g
enebuilder.html - Compare results.
- Try CONSITE example for Transcription factor
recognition of promoter region - (http//asp.ii.uib.no8090/cgi-bin/CONSITE/consite
?rmt_input_single)
Step-by-step practical exercise for gene
annotation http//www1.imim.es/eblanco/seminars/
docs/Reus2005/exercise1/index.html
38Annotation of Putative Genes
Annotation category
- Matches known protein sequence
- Strong similarity to protein sequence
- Similar to known protein
- Similar to unknown protein
- Similar to EST (i.e., putative protein)
- No EST or protein matches (i.e., hypothetical
protein)
39Sequencing and Assembling a Genome
- To sequence a genome, the first task is to cut it
into many small, overlapping pieces. - Then clone each piece.
http//media.wiley.com/assets/1312/61/chapter05.pp
t317,15,Sequencing and Assembling a Genome (I)
40Sequencing and Assembling a Genome
- Each piece must be sequenced.
- Sequencing machines cannot do an entire sequence
at once. - They can only produce short sequences smaller
than 1 Kb. - These pieces are called reads.
- It is necessary to assemble the reads into
contigs.
http//media.wiley.com/assets/1312/61/chapter05.pp
t322,16,Sequencing and Assembling a Genome (II)
41Cleaning DNA Sequences
- In order to sequence genomes, DNA sequences are
often cloned in a vector (plasmid, YAC, or
cosmide). - Sequences of the vector can be mixed with your
DNA sequence. - Before working with your DNA sequence, you should
always clean it with VecScreen
http//www.ncbi.nlm.nih.gov/VecScreen/VecScreen.ht
ml
42Gene Assembly
Usually genes are found in fragments, and need to
be assembled Simple example input
ACCGT CGTGC TTAC TACCGT
Output --ACCGT-- ----CGTGC
TTAC----- -TACCGT--
Spaces are ignored. Fragments are
assembled according to overlapping areas.
___________ TTACCGTGC
A multiple alignment of the fragments result in
consensus sequence.
Assembled piece 9 bases
43Sequence Assembly - Real life is more complicated
Errors in sequencing Example
Output --ACCGT--
----CGTGC TTAC-----
-TGCCGT--
Input ACCFT CGTGC TTAC TGCCGT
TTACCGTGC
The consensus is still correct because of
majority voting.
Insertion error Example
Output
--ACC-GT-- ----CAGTGC TTAC------ -TACC-GT--
Input ACCGT CAGTGC TTAC TACCGT
TTACC?GTGC
44Gene Assembly - Real life is more complicated
Deletion error Example Input ACCGT Output
--ACCGT-- CGTGC
----CGTGC TTAC TTAC----- TACGT
-TAC_GT--
Repeats
TTACCGTGC
x
x
x
Lack of coverage
Target DNA
Uncovered area
45Sequence Assembly
CAP3 http//pbil.univ-lyon1.fr/cap3.php http//bi
oweb.pasteur.fr/seqanal/interfaces/cap3.html
http//deepc2.psi.iastate.edu/aat/cap/cap.html
Example sequences (seq) http//genome.cs.mtu.edu/
cap/data/
46Compute a Restriction Map
NEBcutter V2.0 http//tools.neb.com/NEBcutter2/in
dex.php http//bioinformatics.org/sms2/rest_map.ht
ml WatCut An on-line tool for restriction
analysis, silent mutation scanning, and SNP-RFLP
analysis http//watcut.uwaterloo.ca/watcut/watcut
/template.php Sequence Extractor generates a
clickable restriction map and PCR primer map of a
DNA sequence. Protein translations and
intron/exon boundaries are also shown
http//bioinformatics.org/seqext/
47Perform PCR Using a Computer
- Polymerase Chain Reaction (PCR) is a method for
amplifying DNA. - PCR is used for many applications, including
- Gene cloning
- Forensic analysis and paternity tests
- PCR amplifies the DNA between two anchors called
PCR primers. - Hydrogen bonding of single-stranded nucleic acids
is referred to as "annealing two complementary
sequences will form hydrogen bonds between their
complementary bases (G to C, and A to T or U) and
form a stable double-stranded molecule.
primers
http//media.wiley.com/assets/1312/61/chapter05.pp
t309,6,Making PCR with a Computer
48Degenerate Primers
Degenerate Primers are used for amplification of
sequences from different organisms a set of
primers which have a number of options at several
positions in the sequence so as to allow
annealing to and amplification of a variety of
related sequences.
http//www.mcb.uct.ac.za/pcroptim.htm http//www.s
cielo.br/img/revistas/mioc/v100n6/a08fig01.gif
49PCR Primer Design 10 Rules
1. Primer length should be 18-22 bases in
length This length is long enough for adequate
specificity, and short enough for primers to bind
easily to the template at the annealing
temperature. 2. Base composition determines
DNA stability (40-60 GC content higher GC
makes more stable DNA) 3. GC clamp Primers
should end (3') in a G or C, or CG or GC this
promotes specific binding 4. Runs of three or
more Cs or Gs at the 3'-ends of primers may
promote mis-priming at G or C-rich sequences
(because of stability of annealing), and should
be avoided. http//www.premierbiosoft.com/tech
_notes/PCR_Primer_Design.html
50PCR Primer Design 10 Rules
5. Primer Tms (melting temp.) between 52-58oC
are preferred definition the temperature at
which one half of the DNA duplex will dissociate
to become single stranded and indicates the
duplex stability. Annealing temp
determines reaction specificity. 6. Secondary
structure 3'-ends of primers should not be
complementary (ie. base pair), as otherwise
primer self dimers will be synthesized
preferentially to any other product 7.
Secondary structure Cross dimers formed
between pairs of primers. 8. Target sequence
Amplicon length (usually 100-500 bps). 9. Check
cross homology with related genes and possible
pseudo-genes. Amplicon location (distance from
3'end). 10. Intron spanning.
http//www.premierbiosoft.com/tech_notes/PCR_Prime
r_Design.html
51Primer Design Tools
Primer3 http//biotools.umassmed.edu/bioapps/pr
imer3_www.cgi PrimerQuest (also option for
RT-PCR) http//www.idtdna.com/Scitools/Applicatio
ns/PrimerQuest PCR primers based upon
multialignments (PriFi use demo) http//cgi-www.
daimi.au.dk/cgi-chili/PriFi/main?config.x101conf
ig.y30 PCR primers designed from protein
multiple sequence alignments http//bioinformatics
.weizmann.ac.il/blocks/codehop.html GeneFisher
input single or multiple sequence(s), either
nucleotide or amino acid. For multiple sequences
a multiple alignment will be calculated (or
upload already aligned sequences).
http//bibiserv.techfak.uni-bielefeld.de/genefishe
r2/submission.html (example sequences in
web-site).
52Example
53Special Features Primer Design
Create overlapping PCR products in large
sequences. http//www2.eur.nl/fgg/kgen/primer/Over
lapping_Primers.html Creating primers around
exons in genomic DNA. http//www2.eur.nl/fgg/kgen/
primer/Genomic_Primers.html Creating primers
around SNPs in genomic DNA. http//www2.eur.nl/fg
g/kgen/primer/SNP_Primers.html Creating primers
around the Open Reading Frame of cDNAs.
http//www2.eur.nl/fgg/kgen/primer/cDNA_Primers.h
tml
54Examine Primer Properties
NetPrimer http//www.premierbiosoft.com/netprime
r/netprlaunch/netprlaunch.html Blast primer
sequence against NCBI and provide primer
propertieshttp//www.idtdna.com/analyzer/Applica
tions/OligoAnalyzer/ OligoCalculator
Oligonucleotide properties calculator http//www.
basic.northwestern.edu/biotools/oligocalc.html
OligoAnalyzer http//www.idtdna.com/analyzer/Ap
plications/OligoAnalyzer/
55Other Primer Utilities
UCSC In-Silico PCR UCSC http//genome.ucsc.ed
u/cgi-bin/hgPcr?commandstart Electronic PCR
(e-PCR) is computational procedure that is used
to identify sequence tagged sites(STSs), within
DNA sequences http//www.ncbi.nlm.nih.gov/sutils/
e-pcr/ In silico simulation of molecular
biology experiments http//insilico.ehu.es/
http//insilico.ehu.es/PCR/
56Primers Design Pipeline
- Use accurate sequence data as much as possible
! - Restrict primer search to regions that best
reflect goals - Locate candidate primers
- Discard candidate primers that show
undesirable self-hybridization - Verify the site-specificity of the primer
57Primers Design Pipeline (example)
- Question
- Hybrid cells are formed containing human BRCA1
cDNA in mouse cells. - We want to identify cells which had incorporated
the human cDNA, using PCR primers. - Steps
- Look for mouse and human BRCA1 in NCBI/GENE
databse. - Locate BRCA1 cDNA sequence (is there only 1
transcript ?). - If there are more than 1 transcripts, compare
between them (lets take only 3 for this example)
and find the region that would identify all
transcripts (hint use http//www.ebi.ac.uk/Tools
/clustalw/index.html). - Find primer-pairs for PCR (http//frodo.wi.mit.e
du/primer3/input.htm). - Check primer specificity (human only, not
mouse) - UCSC http//genome.ucsc.edu/cgi-bin/hgPcr?comman
dstart and/or - NCBI (BLAST) http//genome.ucsc.edu/cgi-bin/hgPc
r?commandstart - Check primer properties (http//www.basic.north
western.edu/biotools/oligocalc.html)
58Bioinformatics - Past and Present
ORTHOLOG GENES (Taxonomy)
SEQUENCE ALIGNMENT
CODING REGIONS
CONSERVED DOMAINS
SEQUENCES LITERATURE
3-D STRUCTURE
GENE FAMILIES
SIGNAL PEPTIDE
MUTATIONS POLYMORPHISM
GENOME MAPS
CELLULAR LOCATION
59Bioinformatics - Present and Future
ORTHOLOG GENES (Taxonomy)
SEQUENCE ALIGNMENT
CODING REGIONS
CONSERVED DOMAINS
GENE EXPRESSION, GENES FUNCTION, DRUG PERSONAL
THERAPY
3-D STRUCTURE
GENE FAMILIES
SIGNAL PEPTIDE
MUTATIONS POLYMORPHISM
GENOME MAPS
CELLULAR LOCATION
60Are We Done ?
Now this is not the end. It is not even the
beginning of the end. But it is perhaps, the
end of the beginning. Winston Churchill, 1942
(3 years into WW2)
http//www.globecartoon.com/neweconomy/10.html
61Thank-you !
Dr. Metsada Pasmanik-Chor Bioinformatics Unit,
001 Sherman Bldg. Faculty of Life Science,
TAU Tel x 6992 E-mail metsada_at_bioinfo.tau.ac.il
Bioinfo. Unit webpage http//www.tau.ac.il/lifes
ci/bioinfo/