Comparative Genomics - PowerPoint PPT Presentation

1 / 36
About This Presentation
Title:

Comparative Genomics

Description:

The identity of the gene in another organism. The identity of nearby genes ... Gene: CARP cardiac ankyrin repeat protein. Genome Analysis: ... – PowerPoint PPT presentation

Number of Views:118
Avg rating:3.0/5.0
Slides: 37
Provided by: michael325
Learn more at: http://www.mscs.mu.edu
Category:

less

Transcript and Presenter's Notes

Title: Comparative Genomics


1
Comparative Genomics
  • Tuesday, November 5, 2002
  • Michael Thomas

2
What can homology tell us?
  • The identity of the gene in another organism
  • The identity of nearby genes
  • The function of the gene (if annotated)
  • Suggestions of how the gene might be causing
    disease

3
Godzilla (VISTA)
  • Hypothesis areas with high sequence similarity
    are likely to contain genetic elements
  • Godzilla automatically finds an orthologue for
    your imput sequence and performs a VISTA plot
  • Example Rat BAC ggqj (AC097115)
  • For alignment, uses AVID program
  • Quickly aligns 100s of kb
  • Can handle sequence in draft format
  • Uses HMM-like algorithm to find strong anchors
    from a collection of maximal matches
  • Uses VISTA sequence alignment visualization
    tool
  • Allows easy visualization of areas with high
    similarity
  • Visualization is scalable allows you to zoom
    in/out

4
(No Transcript)
5
(No Transcript)
6
myGodzilla output
7
Gene CARP cardiac ankyrin repeat protein
8
Genome Analysis
  • Traditional genetics understand the role of a
    given gene and design experiments to test the
    prediction
  • Genomics Explain the role of and interactions
    among all genetic elements (and intergenic
    sequences) of an organism
  • Genomics Use bioinformatics tools and
    comparisons with other organisms to
  • Reveal evolutionary relationships
  • Understand the function of genes and gene
    families
  • Understand the role of polymorphism
  • Understand the role of genes in disease and other
    complex traits
  • Understanding the relationship between gene
    expression and phenotype
  • Etc., etc.

9
Why do we annotate genomes?
  • If we find a gene involved in a disease, we want
    to know what its normally doing and how its
    changed to become a disease gene
  • If we want to treat or cure a disease, we need to
    know the system that is perturbed by the disease
    gene - we might be able to treat other genes in
    the pathway
  • If we find the gene in a model organism (like the
    rat), then we need to know what the homolog is in
    humans
  • If we find the gene in a model organism, we need
    to know if its doing the same thing in humans
  • If we DONT know what gene is implicated in a
    disease, we can annotate ALL the genes in a
    region and find candidates for further study

10
Comparative Genomics
  • What makes one organism different from all other
    organisms?
  • Molecular biology
  • Physiology
  • Pathogenesis
  • Epidemiology
  • Genetics

11
Genome Informatics
  • Annotation and Analysis
  • Data Handling
  • Metabolic Reconstruction
  • Comparative Genomics
  • Functional Genomics

12
Sequence Annotation
  • ORF identification
  • Frameshift resolution
  • Genome map construction
  • Functional assignments
  • Metabolic pathway assignment
  • Metabolic pathway Reconstruction
  • Comparative analysis

13
Annotation Problems
  • Problems with existing sequence databases
  • Incomplete datasets
  • Skewed datasets
  • Incorrectly annotated records
  • Annotations based on experimental vs. predicted
    data
  • Nomenclature differences
  • Transitive errors in gene function predictions
  • Functional predictions for hypothetical genes

14
Reminder Orthologues Paralogues
t
0
2
3
1
Frog alpha
Alpha chain
Orthologues
Early globin gene
Human alpha
Paralogues
Human Beta
Frog beta
Beta chain
First duplication event
Second duplication event
15
M. genitalium - M. pneumoniae Gene Order
M. genitalium Gene Position
M. pneumoniae Gene Position
16
M. genitalium - U. urealyticum Gene Order
M. genitalium Gene Position
U. urealyticum Gene Position
17
All-to-all self-comparisons
  • Used to identify gene families
  • Protein alignments involving all proteins in a
    single organism (the proteome)
  • High scoring hits with the same domain structure
    are most likely paralogues
  • High scoring hits with slightly different domain
    structures may be paralogues, but it difficult to
    tell due to common, conserved domains that have
    complicated histories
  • Cluster analysis can help sort this out

18
Between-proteome comparisons
  • Used to identify orthologues
  • Protein alignments involving a search of one
    protein from species A against the proteome of a
    species B
  • High scoring reciprocal best hits with the same
    domain structure are most likely orthologues
  • share common ancestry
  • likely to have the same function
  • Function likely to be more essential
    (replication, etc)
  • Genes are not unique to either organism
  • E should be gt0.01 and alignment should stretch
    over gt60 of each protein
  • High scoring hits with slightly different domain
    structures may be homologous, but it difficult to
    tell due to common, conserved domains that have
    complicated histories
  • Cluster analysis can help sort this out

19
Worm v. fly sequences
20
The Virtual Comparative Map (VCMap) Project
goals for annotation of genome sequence
1. Homology analysis. 2. EST
prediction. 3. UniGene/EST scoring.
4. Visualization 5.
Clickable image map generation.
21
Rat-Human-Mouse VCMap
Virtual map with rat backbone in the middle,
human and mouse maps in both sides. Red markers
in the rat map are RH framework markers. Blue
number in human and mouse maps are chromosome
number. The conserved regions in the rat map were
highlighted in the common colors with human and
mouse maps.
22
Query Result for Psma3
23
Anchors between 480 cR and 570 cR in rat chr.06
24
Predictions between 480 cR and 570 cR in rat
chr.06
25
VCMap with QTL
Virtual comparative map with rat-human virtual
map (left 2), rat RH map(red) and QTL map (blue).
A interested phenotype (RF_HS_24HR_CLEAR_CREAT)
was highlighted. Follow the flanking markers
(D17Rat9 and D17Rat16), a conserved region in
human chromosome 6 was found.
26
Bin Prediction
A 'bin' prediction is made by predicting the
location of the homologous UniGenes of the
backbone species, based upon the mapped borders
(breakpoints) of the homologous region of the
reference (non-backbone) species. The position of
the predicted UniGene is defined by the first
flanking UniGene anchor and its position, and the
second flanking UniGene anchor and its position.
27
Stream prediction
A 'stream' prediction, either upstream or
downstream of the gene, is based upon the
breakpoints in the backbone species and the
position of the homologous UniGene in the
reference (non-backbone) species. Grl and
Rn.12583 (red) in the rat backbone are homologue
to syntenic region borders (NR3C1 and Hs.75639)
in the human Chr. 5. The predicted streams
Rn.52607 and Rn.18446 are homologue to Hs.172148
and KIAA0438 which lie down/up stream of the
nearest syntenic anchors (NR3C1 and Hs.75639). If
the prediction is correct, the breakpoints and
syntenic region will be extended to new anchors
(Rn.52607 Hs.172148, Rn.18446 KIAA0438).
28
Whats missing?
What is the gene doing? What systems or pathways
is the gene invovled with? How is the gene
causing a disease?
29
Top twenty families from Pfam - the Protein
Family Database Name acc number seed full av.
len av. id structure Description GP120 PF00516 24
24877 148 aa 54 1gc1 Envelope glycoprotein
GP120 zf-C2H2 PF00096 197 14973 23
aa 35 1zaa Zinc finger, C2H2 type LRR PF00560 373
2 12110 23 aa 27 1bnh Leucine Rich
Repeat rvt PF00078 179 11488 160
aa 70 1hmv Rev.transcriptase (DNA
polymerase) cytochrome_b PF00033 9 10802 151
aa 67 3bcc Cytochrome b(N-terminal)/b6/petB rvp P
F00077 53 10526 95 aa 87 1ida Retroviral
aspartyl protease WD40 PF00400 1930 9117 37
aa 19 1gp2 WD domain, G-beta repeat ig PF00047 11
3 8581 63 aa 19 8fab Immunoglobulin
domain ank PF00023 1219 6452 31
aa 26 1awc Ankyrin repeat COX1 PF00115 24 6421 22
8 aa 46 1occ Cyt. C Quinol oxidase polypeptide
I pkinase PF00069 67 6334 247 aa 23 1apm Protein
kinase domain RuBisCO PF00016 17 6076 290
aa 83 3rub Ribulose bisphosphate
carboxylase Cytochrom.bC PF00032 10 6020 89
aa 74 1bcc Cytochrome b(C-terminal)/b6/petD RuBis
CO_N PF02788 17 5909 121 aa 86 3rub Ribulose
bisphosphate carboxylase N PPR PF01535 563 5679 32
aa 19   PPR repeat oxidored_q1 PF00361 33 5678 2
18 aa 32   NADH-Ubiquinone/plastoquinone
EGF PF00008 87 5086 33 aa 34 1apo EGF-like
domain HCV_NS1 PF01560 10 4639 75
aa 47   Hepatitis C virus non-structural protein
E2/NS1 ABC_tran PF00005 63 4338 183
aa 26 1b0u ABC transporter efhand PF00036 972 421
3 28 aa 24 1osa EF hand
  • There are over 3360 protein families in Pfam
  • Pfam is searched by the CDD
  • Pfam provides informatioin about protein function

30
A disease gene involving a metabolic pathway
Phenylketonuria
  • Disease is identified, mapped, cloned
  • Heritable component
  • One gene out of dozens involved with this pathway
  • All genes in the pathway are expressed in similar
    tissues at similar times
  • What does this have to do with annotation??

31
(No Transcript)
32
Metabolic Pathway Reconstruction
  • Role assignment
  • Extract metabolic pathways from genomes
  • Navigation and analysis
  • Pathway editing

33
Evaluating Gene Prediction Engines
  • Sensitivity
  • How many exons or genes were correctly
    identified?
  • TP/(TPFN)
  • For exons, it ranges from 0.27 to 0.70
  • For genes it ranges from 0.02 to 0.40
  • Specificity
  • How many exon/genes predictions are true?
  • TP/(TPFP)
  • For exons it ranges from 0.29 to 0.57
  • For genes it ranges from 0.05 to 0.30

34
Comparison of 2 knowledge-based gene prediction
engines Grail, GenScan
  • GenScan
  • Hidden Markov Model (HMM)
  • Predicts entire gene structure, not just features
  • With any single gene prediction engine,
    predictive ability is poor
  • Grail
  • Neural net
  • Predicts gene features
  • Codon usage
  • Base composition
  • Splice site characteristics
  • PolyA signals
  • Di-, tri-, hexa-nucleotide frequencies
  • Translation signals
  • Transcription signals
  • Size distributions

35
MetaGene Combining the results of several gene
prediction engines
These lines let you easier to visually compare
the engines and predicted features. These lines
can be turned off or on
The accuracy of prediction in nucleotide level if
known features available
Drag this handle to change threshold
The predicted features
36
Structure and transcription of a Eukaryotic gene
Write a Comment
User Comments (0)
About PowerShow.com