Genome Annotation and Pathway Mining - PowerPoint PPT Presentation

1 / 28
About This Presentation
Title:

Genome Annotation and Pathway Mining

Description:

Sift results for biological relevance. Pathway mining. E.g. 'Can my organism ... Again, sift results for biological relevance. Example: GO. Find ... sift ... – PowerPoint PPT presentation

Number of Views:31
Avg rating:3.0/5.0
Slides: 29
Provided by: ianho9
Category:

less

Transcript and Presenter's Notes

Title: Genome Annotation and Pathway Mining


1
Genome Annotation and Pathway Mining
  • BioE131

2
Genome Annotation
  • Various related (but distinct) questions
  • Does genome contain a homologue of gene X?
    (gene-by-gene)
  • Does genome contain homologues of genes
    involved in pathway X? (pathway mining)
  • What genes are there? (whole-genome)
  • What genes are being transcribed? (experimental)

3
By-gene
  • Look for (high-scoring) alignments of protein to
    genome
  • Various tools for doing this
  • TBLASTX - translates DNA in 6 frames
  • (limited handling of gaps)
  • GeneWise - allows frameshifts introns
  • (better handling of gaps)
  • Exonerate - GeneWise replacement
  • (best handling of gaps)

4
How Exonerate works
High-scoring Segment Pairs (HSPs)
Dynamic programming to fill in the gaps between
HSPs
Exonerates scoring scheme uses finite state
machine theory
5
Finite state machines
  • Most dynamic programming algorithms for
    pattern-matching or alignment can be formulated
    as finite-state machines
  • E.g. the regular expression for matching the MCB
    binding site/ACGCGT/(equivalent
    to)/.ACGCGT./

6
Finite state machines
Pairwise alignment scoring schemes can also be
specified as finite-state machines
This one is Needleman-Wunsch
7
Finite state machines
Adding padding states to Needleman-Wunsch gives
us Smith-Waterman
8
Finite state machines
Adding a state to track whether the last column
used a gap gives us Gotoh
Genewise and Exonerate add extra states to track
frameshifts, introns, etc.
9
Pathway mining
  • Gather representative protein sequences for
    enzymes in your pathway
  • Repeat the by-gene analysis for each gene
  • Small numbers of genes can do manually, e.g.
    using remote web-based services
  • Larger numbers of genes use local tools,
    scripting (e.g. Perl)
  • Sift results for biological relevance

10
Pathway mining
  • E.g. Can my organism synthesize purine?
  • Diagram shows purine biosynthesis salvage in
    yeast

11
How to gather protein search set?
  • Pick your pathway description resource (GO,
    MetaCyc, EcoCyc.)
  • Use most appropriate for task at hand
  • GO is best for eukaryotic, so-so for prokaryotes
  • You may need more than one
  • Find all relevant nodes/terms
  • Map terms to sequences
  • Ease of this step depends on quality of
    annotation
  • Again, Perl scripting could be useful
  • Again, sift results for biological relevance

12
Example GO
  • Find your term of interest
  • E.g. purine biosynthesis
  • Find all descendants of this term
  • The GO path list is useful for this step
  • Map GO terms to sequences

13
Why you need to sift results
  • One of the child terms of purine biosynthesis
    in GO is regulation of purine biosynthesis
  • But, regulation typically involves proteins like
    transcription factors, kinases, etc. - all of
    which can regulate other things too!

14
Other classification schemes for genes and
pathways
  • GO/Reactome
  • KEGG
  • Kyoto Encyclopaedia of Genes and Genomes
  • COG
  • Clusters of Orthologous Groups
  • E.C. numbers
  • E.C. Enzyme Commission

15
Whole-genome
  • Start with a set of gene predictions
  • Use a gene prediction tool
  • E.g. for prokaryotic genomes GLIMMER
  • E.g. for eukaryotic genomes GENSCAN, SNAP
  • Convert into protein sequences
  • Scripting, scripting, scripting. Perl
  • Annotate each protein
  • C.f. by-gene approach (look for high-scoring
    alignments to annotated protein database)

16
GENSCAN(Burge Karlin, 1997)
  • Basic transcriptional, translational and splicing
    signals
  • Donor acceptor splice sites
  • Poly-adenylation signal
  • Length distributions and compositional features
    of introns, exons intergenic regions
  • Different parameterizations of the program for
    regions of different GC-content

17
Homology-based gene prediction
  • Use statistical profiles of
  • Whole proteins, or
  • protein domains
  • Databases of such profiles exist
  • INTERPRO - includes PFAM, SMART, PROSITE.
  • These databases use different methods
  • For purposes of automation, may be easier to
    narrow down on one method
  • Preferred method is Hidden Markov Models

18
Hidden Markov Model Profiles
Leucine-rich repeat
Like a weight-matrix profile, but also models the
position-specific probability of deleting or
inserting amino acids
19
Profile HMM databases
  • PFAM, SMART are databases of Profile HMMs
  • Both use the HMMER program
  • Can download entire HMM database and run a search
    against your proteins
  • Each HMM corresponds to one protein domain
  • GeneWise program also allows you to run a protein
    HMM against a DNA sequence (6-frame translation)
  • GO has mappings to profile databases like
    InterPro ( hence PFAM/SMART) as well as protein
    sequence databases like UniProt

20
Gene prediction tools
  • Many gene prediction programs are also based on
    Hidden Markov Models
  • but a different kind of HMM
  • Not the profile HMMs used by HMMER
  • HMMs are essentially probabilistic regular
    expressions or finite state machines
  • In gene prediction tools, they are used to
    recognise statistical patterns e.g. hexamer
    usage, specific motifs associated with introns,
    etc.

21
Phylogeny
  • Phylogenetic analysis is one approach to more
    specific predictions
  • Idea is to infer a phylogenetic tree relating
    your unknown proteins (X,Y,Z.) with annotated
    proteins (A,B,C,D) in the same family

A
X
B
C
Y
Z
D
22
Why Phylogeny Matters
Different subfamilies of a particular protein
family can have quite different functions
(substrate specificities, variations in
structure, different binding pockets or
interaction domains, or completely new roles)
23
Experimental data
  • Evidence of transcription
  • Expressed Sequence Tags (ESTs)
  • Complementary DNA sequences (cDNAs)
  • Gene- or Genome-tiling microarrays
  • Other transcript-detecting nanotech
  • Evidence of function
  • E.g. experimental evidence that a bacterium needs
    nucleotides in its diet

24
Genome-genome comparison
  • Increasingly popular as sequencing gets cheaper
  • Get a holistic view of how evolution has
    progressed
  • Genome alignment approach
  • May need to make several local alignments due to
    rearrangements
  • Analyse alignments for regions of high/low
    conservation, characteristic of genes, etc.
  • Gene complement approach
  • Predict genes, annotate them, compare functions
    OR
  • Predict genes, align to find closest homologues,
    then assign compare functions

25
Comparative genomics
VISTA genome conservation browser
26
Comparative genomics
MAUVE genome rearrangement browser
27
Comparative genomics
JGI IMG (Integrated Microbial Genomics) Portal
Abundance Profiles
28
Summary pathway mining
(make gene predictions)
Identify biological processes
Gather representative protein sequences
Gather statistical profiles of protein domains
Search genome (or predicted protein products)
using Exonerate, GeneWise, BLAST, etc.
Post-process results - e.g. by visual inspection
or phylogenetic analysis
Write a Comment
User Comments (0)
About PowerShow.com