Genome Annotation and Pathway Mining - PowerPoint PPT Presentation

1 / 28

About This Presentation

Title:

Genome Annotation and Pathway Mining

Description:

Sift results for biological relevance. Pathway mining. E.g. 'Can my organism ... Again, sift results for biological relevance. Example: GO. Find ... sift ... – PowerPoint PPT presentation

Number of Views:31

Avg rating:3.0/5.0

Slides: 29

Provided by: ianho9

Category:

more less

Transcript and Presenter's Notes

Title: Genome Annotation and Pathway Mining

1
Genome Annotation and Pathway Mining

BioE131

2
Genome Annotation

Various related (but distinct) questions
Does genome contain a homologue of gene X?
(gene-by-gene)
Does genome contain homologues of genes
involved in pathway X? (pathway mining)
What genes are there? (whole-genome)
What genes are being transcribed? (experimental)

3
By-gene

Look for (high-scoring) alignments of protein to
genome
Various tools for doing this
TBLASTX - translates DNA in 6 frames
(limited handling of gaps)
GeneWise - allows frameshifts introns
(better handling of gaps)
Exonerate - GeneWise replacement
(best handling of gaps)

4
How Exonerate works
High-scoring Segment Pairs (HSPs)
Dynamic programming to fill in the gaps between
HSPs
Exonerates scoring scheme uses finite state
machine theory
5
Finite state machines

Most dynamic programming algorithms for
pattern-matching or alignment can be formulated
as finite-state machines
E.g. the regular expression for matching the MCB
binding site/ACGCGT/(equivalent
to)/.ACGCGT./

6
Finite state machines
Pairwise alignment scoring schemes can also be
specified as finite-state machines
This one is Needleman-Wunsch
7
Finite state machines
Adding padding states to Needleman-Wunsch gives
us Smith-Waterman
8
Finite state machines
Adding a state to track whether the last column
used a gap gives us Gotoh
Genewise and Exonerate add extra states to track
frameshifts, introns, etc.
9
Pathway mining

Gather representative protein sequences for
enzymes in your pathway
Repeat the by-gene analysis for each gene
Small numbers of genes can do manually, e.g.
using remote web-based services
Larger numbers of genes use local tools,
scripting (e.g. Perl)
Sift results for biological relevance

10
Pathway mining

E.g. Can my organism synthesize purine?
Diagram shows purine biosynthesis salvage in
yeast

11
How to gather protein search set?

Pick your pathway description resource (GO,
MetaCyc, EcoCyc.)
Use most appropriate for task at hand
GO is best for eukaryotic, so-so for prokaryotes
You may need more than one
Find all relevant nodes/terms
Map terms to sequences
Ease of this step depends on quality of
annotation
Again, Perl scripting could be useful
Again, sift results for biological relevance

12
Example GO

Find your term of interest
E.g. purine biosynthesis
Find all descendants of this term
The GO path list is useful for this step
Map GO terms to sequences

13
Why you need to sift results

One of the child terms of purine biosynthesis
in GO is regulation of purine biosynthesis
But, regulation typically involves proteins like
transcription factors, kinases, etc. - all of
which can regulate other things too!

14
Other classification schemes for genes and
pathways

GO/Reactome
KEGG
Kyoto Encyclopaedia of Genes and Genomes
COG
Clusters of Orthologous Groups
E.C. numbers
E.C. Enzyme Commission

15
Whole-genome

Start with a set of gene predictions
Use a gene prediction tool
E.g. for prokaryotic genomes GLIMMER
E.g. for eukaryotic genomes GENSCAN, SNAP
Convert into protein sequences
Scripting, scripting, scripting. Perl
Annotate each protein
C.f. by-gene approach (look for high-scoring
alignments to annotated protein database)

16
GENSCAN(Burge Karlin, 1997)

Basic transcriptional, translational and splicing
signals
Donor acceptor splice sites
Poly-adenylation signal
Length distributions and compositional features
of introns, exons intergenic regions
Different parameterizations of the program for
regions of different GC-content

17
Homology-based gene prediction

Use statistical profiles of
Whole proteins, or
protein domains
Databases of such profiles exist
INTERPRO - includes PFAM, SMART, PROSITE.
These databases use different methods
For purposes of automation, may be easier to
narrow down on one method
Preferred method is Hidden Markov Models

18
Hidden Markov Model Profiles
Leucine-rich repeat
Like a weight-matrix profile, but also models the
position-specific probability of deleting or
inserting amino acids
19
Profile HMM databases

PFAM, SMART are databases of Profile HMMs
Both use the HMMER program
Can download entire HMM database and run a search
against your proteins
Each HMM corresponds to one protein domain
GeneWise program also allows you to run a protein
HMM against a DNA sequence (6-frame translation)
GO has mappings to profile databases like
InterPro ( hence PFAM/SMART) as well as protein
sequence databases like UniProt

20
Gene prediction tools

Many gene prediction programs are also based on
Hidden Markov Models
but a different kind of HMM
Not the profile HMMs used by HMMER
HMMs are essentially probabilistic regular
expressions or finite state machines
In gene prediction tools, they are used to
recognise statistical patterns e.g. hexamer
usage, specific motifs associated with introns,
etc.

21
Phylogeny

Phylogenetic analysis is one approach to more
specific predictions
Idea is to infer a phylogenetic tree relating
your unknown proteins (X,Y,Z.) with annotated
proteins (A,B,C,D) in the same family

A
X
B
C
Y
Z
D
22
Why Phylogeny Matters
Different subfamilies of a particular protein
family can have quite different functions
(substrate specificities, variations in
structure, different binding pockets or
interaction domains, or completely new roles)
23
Experimental data

Evidence of transcription
Expressed Sequence Tags (ESTs)
Complementary DNA sequences (cDNAs)
Gene- or Genome-tiling microarrays
Other transcript-detecting nanotech
Evidence of function
E.g. experimental evidence that a bacterium needs
nucleotides in its diet

24
Genome-genome comparison

Increasingly popular as sequencing gets cheaper
Get a holistic view of how evolution has
progressed
Genome alignment approach
May need to make several local alignments due to
rearrangements
Analyse alignments for regions of high/low
conservation, characteristic of genes, etc.
Gene complement approach
Predict genes, annotate them, compare functions
OR
Predict genes, align to find closest homologues,
then assign compare functions

25
Comparative genomics
VISTA genome conservation browser
26
Comparative genomics
MAUVE genome rearrangement browser
27
Comparative genomics
JGI IMG (Integrated Microbial Genomics) Portal
Abundance Profiles
28
Summary pathway mining
(make gene predictions)
Identify biological processes
Gather representative protein sequences
Gather statistical profiles of protein domains
Search genome (or predicted protein products)
using Exonerate, GeneWise, BLAST, etc.
Post-process results - e.g. by visual inspection
or phylogenetic analysis

Write a Comment

User Comments (0)