Title: Genome Annotation and Pathway Mining
1Genome Annotation and Pathway Mining
2Genome Annotation
- Various related (but distinct) questions
- Does genome contain a homologue of gene X?
(gene-by-gene) - Does genome contain homologues of genes
involved in pathway X? (pathway mining) - What genes are there? (whole-genome)
- What genes are being transcribed? (experimental)
3By-gene
- Look for (high-scoring) alignments of protein to
genome - Various tools for doing this
- TBLASTX - translates DNA in 6 frames
- (limited handling of gaps)
- GeneWise - allows frameshifts introns
- (better handling of gaps)
- Exonerate - GeneWise replacement
- (best handling of gaps)
4How Exonerate works
High-scoring Segment Pairs (HSPs)
Dynamic programming to fill in the gaps between
HSPs
Exonerates scoring scheme uses finite state
machine theory
5Finite state machines
- Most dynamic programming algorithms for
pattern-matching or alignment can be formulated
as finite-state machines - E.g. the regular expression for matching the MCB
binding site/ACGCGT/(equivalent
to)/.ACGCGT./
6Finite state machines
Pairwise alignment scoring schemes can also be
specified as finite-state machines
This one is Needleman-Wunsch
7Finite state machines
Adding padding states to Needleman-Wunsch gives
us Smith-Waterman
8Finite state machines
Adding a state to track whether the last column
used a gap gives us Gotoh
Genewise and Exonerate add extra states to track
frameshifts, introns, etc.
9Pathway mining
- Gather representative protein sequences for
enzymes in your pathway - Repeat the by-gene analysis for each gene
- Small numbers of genes can do manually, e.g.
using remote web-based services - Larger numbers of genes use local tools,
scripting (e.g. Perl) - Sift results for biological relevance
10Pathway mining
- E.g. Can my organism synthesize purine?
- Diagram shows purine biosynthesis salvage in
yeast
11How to gather protein search set?
- Pick your pathway description resource (GO,
MetaCyc, EcoCyc.) - Use most appropriate for task at hand
- GO is best for eukaryotic, so-so for prokaryotes
- You may need more than one
- Find all relevant nodes/terms
- Map terms to sequences
- Ease of this step depends on quality of
annotation - Again, Perl scripting could be useful
- Again, sift results for biological relevance
12Example GO
- Find your term of interest
- E.g. purine biosynthesis
- Find all descendants of this term
- The GO path list is useful for this step
- Map GO terms to sequences
13Why you need to sift results
- One of the child terms of purine biosynthesis
in GO is regulation of purine biosynthesis - But, regulation typically involves proteins like
transcription factors, kinases, etc. - all of
which can regulate other things too!
14Other classification schemes for genes and
pathways
- GO/Reactome
- KEGG
- Kyoto Encyclopaedia of Genes and Genomes
- COG
- Clusters of Orthologous Groups
- E.C. numbers
- E.C. Enzyme Commission
15Whole-genome
- Start with a set of gene predictions
- Use a gene prediction tool
- E.g. for prokaryotic genomes GLIMMER
- E.g. for eukaryotic genomes GENSCAN, SNAP
- Convert into protein sequences
- Scripting, scripting, scripting. Perl
- Annotate each protein
- C.f. by-gene approach (look for high-scoring
alignments to annotated protein database)
16GENSCAN(Burge Karlin, 1997)
- Basic transcriptional, translational and splicing
signals - Donor acceptor splice sites
- Poly-adenylation signal
- Length distributions and compositional features
of introns, exons intergenic regions - Different parameterizations of the program for
regions of different GC-content
17Homology-based gene prediction
- Use statistical profiles of
- Whole proteins, or
- protein domains
- Databases of such profiles exist
- INTERPRO - includes PFAM, SMART, PROSITE.
- These databases use different methods
- For purposes of automation, may be easier to
narrow down on one method - Preferred method is Hidden Markov Models
18Hidden Markov Model Profiles
Leucine-rich repeat
Like a weight-matrix profile, but also models the
position-specific probability of deleting or
inserting amino acids
19Profile HMM databases
- PFAM, SMART are databases of Profile HMMs
- Both use the HMMER program
- Can download entire HMM database and run a search
against your proteins - Each HMM corresponds to one protein domain
- GeneWise program also allows you to run a protein
HMM against a DNA sequence (6-frame translation) - GO has mappings to profile databases like
InterPro ( hence PFAM/SMART) as well as protein
sequence databases like UniProt
20Gene prediction tools
- Many gene prediction programs are also based on
Hidden Markov Models - but a different kind of HMM
- Not the profile HMMs used by HMMER
- HMMs are essentially probabilistic regular
expressions or finite state machines - In gene prediction tools, they are used to
recognise statistical patterns e.g. hexamer
usage, specific motifs associated with introns,
etc.
21Phylogeny
- Phylogenetic analysis is one approach to more
specific predictions - Idea is to infer a phylogenetic tree relating
your unknown proteins (X,Y,Z.) with annotated
proteins (A,B,C,D) in the same family
A
X
B
C
Y
Z
D
22Why Phylogeny Matters
Different subfamilies of a particular protein
family can have quite different functions
(substrate specificities, variations in
structure, different binding pockets or
interaction domains, or completely new roles)
23Experimental data
- Evidence of transcription
- Expressed Sequence Tags (ESTs)
- Complementary DNA sequences (cDNAs)
- Gene- or Genome-tiling microarrays
- Other transcript-detecting nanotech
- Evidence of function
- E.g. experimental evidence that a bacterium needs
nucleotides in its diet
24Genome-genome comparison
- Increasingly popular as sequencing gets cheaper
- Get a holistic view of how evolution has
progressed - Genome alignment approach
- May need to make several local alignments due to
rearrangements - Analyse alignments for regions of high/low
conservation, characteristic of genes, etc. - Gene complement approach
- Predict genes, annotate them, compare functions
OR - Predict genes, align to find closest homologues,
then assign compare functions
25Comparative genomics
VISTA genome conservation browser
26Comparative genomics
MAUVE genome rearrangement browser
27Comparative genomics
JGI IMG (Integrated Microbial Genomics) Portal
Abundance Profiles
28Summary pathway mining
(make gene predictions)
Identify biological processes
Gather representative protein sequences
Gather statistical profiles of protein domains
Search genome (or predicted protein products)
using Exonerate, GeneWise, BLAST, etc.
Post-process results - e.g. by visual inspection
or phylogenetic analysis