Title: Comparative Genomics
1Comparative Genomics
- Tuesday, November 5, 2002
- Michael Thomas
2What can homology tell us?
- The identity of the gene in another organism
- The identity of nearby genes
- The function of the gene (if annotated)
- Suggestions of how the gene might be causing
disease
3Godzilla (VISTA)
- Hypothesis areas with high sequence similarity
are likely to contain genetic elements - Godzilla automatically finds an orthologue for
your imput sequence and performs a VISTA plot - Example Rat BAC ggqj (AC097115)
- For alignment, uses AVID program
- Quickly aligns 100s of kb
- Can handle sequence in draft format
- Uses HMM-like algorithm to find strong anchors
from a collection of maximal matches - Uses VISTA sequence alignment visualization
tool - Allows easy visualization of areas with high
similarity - Visualization is scalable allows you to zoom
in/out
4(No Transcript)
5(No Transcript)
6myGodzilla output
7Gene CARP cardiac ankyrin repeat protein
8Genome Analysis
- Traditional genetics understand the role of a
given gene and design experiments to test the
prediction - Genomics Explain the role of and interactions
among all genetic elements (and intergenic
sequences) of an organism - Genomics Use bioinformatics tools and
comparisons with other organisms to - Reveal evolutionary relationships
- Understand the function of genes and gene
families - Understand the role of polymorphism
- Understand the role of genes in disease and other
complex traits - Understanding the relationship between gene
expression and phenotype - Etc., etc.
9Why do we annotate genomes?
- If we find a gene involved in a disease, we want
to know what its normally doing and how its
changed to become a disease gene - If we want to treat or cure a disease, we need to
know the system that is perturbed by the disease
gene - we might be able to treat other genes in
the pathway - If we find the gene in a model organism (like the
rat), then we need to know what the homolog is in
humans - If we find the gene in a model organism, we need
to know if its doing the same thing in humans - If we DONT know what gene is implicated in a
disease, we can annotate ALL the genes in a
region and find candidates for further study
10Comparative Genomics
- What makes one organism different from all other
organisms? - Molecular biology
- Physiology
- Pathogenesis
- Epidemiology
- Genetics
11Genome Informatics
- Annotation and Analysis
- Data Handling
- Metabolic Reconstruction
- Comparative Genomics
- Functional Genomics
12Sequence Annotation
- ORF identification
- Frameshift resolution
- Genome map construction
- Functional assignments
- Metabolic pathway assignment
- Metabolic pathway Reconstruction
- Comparative analysis
13Annotation Problems
- Problems with existing sequence databases
- Incomplete datasets
- Skewed datasets
- Incorrectly annotated records
- Annotations based on experimental vs. predicted
data - Nomenclature differences
- Transitive errors in gene function predictions
- Functional predictions for hypothetical genes
14Reminder Orthologues Paralogues
t
0
2
3
1
Frog alpha
Alpha chain
Orthologues
Early globin gene
Human alpha
Paralogues
Human Beta
Frog beta
Beta chain
First duplication event
Second duplication event
15M. genitalium - M. pneumoniae Gene Order
M. genitalium Gene Position
M. pneumoniae Gene Position
16M. genitalium - U. urealyticum Gene Order
M. genitalium Gene Position
U. urealyticum Gene Position
17All-to-all self-comparisons
- Used to identify gene families
- Protein alignments involving all proteins in a
single organism (the proteome) - High scoring hits with the same domain structure
are most likely paralogues - High scoring hits with slightly different domain
structures may be paralogues, but it difficult to
tell due to common, conserved domains that have
complicated histories - Cluster analysis can help sort this out
18Between-proteome comparisons
- Used to identify orthologues
- Protein alignments involving a search of one
protein from species A against the proteome of a
species B - High scoring reciprocal best hits with the same
domain structure are most likely orthologues - share common ancestry
- likely to have the same function
- Function likely to be more essential
(replication, etc) - Genes are not unique to either organism
- E should be gt0.01 and alignment should stretch
over gt60 of each protein - High scoring hits with slightly different domain
structures may be homologous, but it difficult to
tell due to common, conserved domains that have
complicated histories - Cluster analysis can help sort this out
19Worm v. fly sequences
20The Virtual Comparative Map (VCMap) Project
goals for annotation of genome sequence
1. Homology analysis. 2. EST
prediction. 3. UniGene/EST scoring.
4. Visualization 5.
Clickable image map generation.
21Rat-Human-Mouse VCMap
Virtual map with rat backbone in the middle,
human and mouse maps in both sides. Red markers
in the rat map are RH framework markers. Blue
number in human and mouse maps are chromosome
number. The conserved regions in the rat map were
highlighted in the common colors with human and
mouse maps.
22Query Result for Psma3
23Anchors between 480 cR and 570 cR in rat chr.06
24Predictions between 480 cR and 570 cR in rat
chr.06
25VCMap with QTL
Virtual comparative map with rat-human virtual
map (left 2), rat RH map(red) and QTL map (blue).
A interested phenotype (RF_HS_24HR_CLEAR_CREAT)
was highlighted. Follow the flanking markers
(D17Rat9 and D17Rat16), a conserved region in
human chromosome 6 was found.
26Bin Prediction
A 'bin' prediction is made by predicting the
location of the homologous UniGenes of the
backbone species, based upon the mapped borders
(breakpoints) of the homologous region of the
reference (non-backbone) species. The position of
the predicted UniGene is defined by the first
flanking UniGene anchor and its position, and the
second flanking UniGene anchor and its position.
27Stream prediction
A 'stream' prediction, either upstream or
downstream of the gene, is based upon the
breakpoints in the backbone species and the
position of the homologous UniGene in the
reference (non-backbone) species. Grl and
Rn.12583 (red) in the rat backbone are homologue
to syntenic region borders (NR3C1 and Hs.75639)
in the human Chr. 5. The predicted streams
Rn.52607 and Rn.18446 are homologue to Hs.172148
and KIAA0438 which lie down/up stream of the
nearest syntenic anchors (NR3C1 and Hs.75639). If
the prediction is correct, the breakpoints and
syntenic region will be extended to new anchors
(Rn.52607 Hs.172148, Rn.18446 KIAA0438).
28Whats missing?
What is the gene doing? What systems or pathways
is the gene invovled with? How is the gene
causing a disease?
29Top twenty families from Pfam - the Protein
Family Database Name acc number seed full av.
len av. id structure Description GP120 PF00516 24
24877 148 aa 54 1gc1 Envelope glycoprotein
GP120 zf-C2H2 PF00096 197 14973 23
aa 35 1zaa Zinc finger, C2H2 type LRR PF00560 373
2 12110 23 aa 27 1bnh Leucine Rich
Repeat rvt PF00078 179 11488 160
aa 70 1hmv Rev.transcriptase (DNA
polymerase) cytochrome_b PF00033 9 10802 151
aa 67 3bcc Cytochrome b(N-terminal)/b6/petB rvp P
F00077 53 10526 95 aa 87 1ida Retroviral
aspartyl protease WD40 PF00400 1930 9117 37
aa 19 1gp2 WD domain, G-beta repeat ig PF00047 11
3 8581 63 aa 19 8fab Immunoglobulin
domain ank PF00023 1219 6452 31
aa 26 1awc Ankyrin repeat COX1 PF00115 24 6421 22
8 aa 46 1occ Cyt. C Quinol oxidase polypeptide
I pkinase PF00069 67 6334 247 aa 23 1apm Protein
kinase domain RuBisCO PF00016 17 6076 290
aa 83 3rub Ribulose bisphosphate
carboxylase Cytochrom.bC PF00032 10 6020 89
aa 74 1bcc Cytochrome b(C-terminal)/b6/petD RuBis
CO_N PF02788 17 5909 121 aa 86 3rub Ribulose
bisphosphate carboxylase N PPR PF01535 563 5679 32
aa 19 Â PPR repeat oxidored_q1 PF00361 33 5678 2
18 aa 32 Â NADH-Ubiquinone/plastoquinone
EGF PF00008 87 5086 33 aa 34 1apo EGF-like
domain HCV_NS1 PF01560 10 4639 75
aa 47 Â Hepatitis C virus non-structural protein
E2/NS1 ABC_tran PF00005 63 4338 183
aa 26 1b0u ABC transporter efhand PF00036 972 421
3 28 aa 24 1osa EF hand
- There are over 3360 protein families in Pfam
- Pfam is searched by the CDD
- Pfam provides informatioin about protein function
30A disease gene involving a metabolic pathway
Phenylketonuria
- Disease is identified, mapped, cloned
- Heritable component
- One gene out of dozens involved with this pathway
- All genes in the pathway are expressed in similar
tissues at similar times - What does this have to do with annotation??
31(No Transcript)
32Metabolic Pathway Reconstruction
- Role assignment
- Extract metabolic pathways from genomes
- Navigation and analysis
- Pathway editing
33Evaluating Gene Prediction Engines
- Sensitivity
- How many exons or genes were correctly
identified? - TP/(TPFN)
- For exons, it ranges from 0.27 to 0.70
- For genes it ranges from 0.02 to 0.40
- Specificity
- How many exon/genes predictions are true?
- TP/(TPFP)
- For exons it ranges from 0.29 to 0.57
- For genes it ranges from 0.05 to 0.30
34Comparison of 2 knowledge-based gene prediction
engines Grail, GenScan
- GenScan
- Hidden Markov Model (HMM)
- Predicts entire gene structure, not just features
- With any single gene prediction engine,
predictive ability is poor
- Grail
- Neural net
- Predicts gene features
- Codon usage
- Base composition
- Splice site characteristics
- PolyA signals
- Di-, tri-, hexa-nucleotide frequencies
- Translation signals
- Transcription signals
- Size distributions
35MetaGene Combining the results of several gene
prediction engines
These lines let you easier to visually compare
the engines and predicted features. These lines
can be turned off or on
The accuracy of prediction in nucleotide level if
known features available
Drag this handle to change threshold
The predicted features
36Structure and transcription of a Eukaryotic gene