Title: Integrative Genomics
1Integrative Genomics
- Anil Jegga
- Biomedical Informatics
- CCHMC
- Anil.Jegga_at_cchmc.org
- http//anil.cchmc.org
2Two Separate Worlds..
Medical Informatics
Bioinformatics the omes
PubMed
Proteome
Disease Database
Patient Records
OMIM Clinical Synopsis
Clinical Trials
382 omes so far and there is UNKNOME too -
genes with no function known http//omics.org/inde
x.php/Alphabetically_ordered_list_of_omics
With Some Data Exchange
3Motivation
To correlate diseases with anatomical parts
affected, the genes/proteins involved, and the
underlying physiological processes (interactions,
pathways, processes). In other words, bringing
the disciplines of Medical Informatics (MI) and
BioInformatics (BI) together (Biomedical
Informatics - BMI) to support personalized or
tailor-made medicine.
How to integrate multiple types of genome-scale
data across experiments and phenotypes in order
to find genes associated with diseases
4Model Organism Databases Common Issues
- Heterogeneous Data Sets - Data Integration
- From Genotype to Phenotype
- Experimental and Consensus Views
- Incorporation of Large Datasets
- Whole genome annotation pipelines
- Large scale mutagenesis/variation projects
(dbSNP) - Computational vs. Literature-based Data
Collection and Evaluation (MedLine) - Data Mining
- extraction of new knowledge
- testable hypotheses (Hypothesis Generation)
5Support Complex Queries
- Show me all genes involved in brain development
that are expressed in the Central Nervous System. - Show me all genes involved in brain development
in human and mouse that also show iron ion
binding activity. - For this set of genes, what aspects of function
and/or cellular localization do they share? - For this set of genes, what mutations are
reported to cause pathological conditions?
6Bioinformatic Data-1978 to present
- DNA sequence
- Gene expression
- Protein expression
- Protein Structure
- Genome mapping
- SNPs Mutations
- Metabolic networks
- Regulatory networks
- Trait mapping
- Gene function analysis
- Scientific literature
- and others..
7Human Genome Project Data Deluge
No. of Human Gene Records currently in NCBI
29413 (excluding pseudogenes, mitochondrial genes
and obsolete records). Includes 460 microRNAs
NCBI Human Genome Statistics as on February12,
2008
8The Gene Expression Data Deluge
Till 2000 413 papers on microarray!
Problems Deluge! Allison DB, Cui X, Page GP,
Sabripour M. 2006. Microarray data analysis from
disarray to consolidation and consensus. Nat Rev
Genet. 7(1) 55-65.
9Information Deluge..
- 3 scientific journals in 1750
- Now - gt120,000 scientific journals!
- gt500,000 medical articles/year
- gt4,000,000 scientific articles/year
- gt16 million abstracts in PubMed derived from
gt32,500 journals
10Data-driven Problems..
- Generally, the names refer to some feature of the
mutant phenotype - Dickies small eye (Thieler et al., 1978, Anat
Embryol (Berl), 155 81-86) is now Pax6 - Gleeful "This gene encodes a C2H2 zinc finger
transcription factor with high sequence
similarity to vertebrate Gli proteins, so we have
named the gene gleeful (Gfl)." (Furlong et al.,
2001, Science 293 1632)
Whats in a name!
Rose is a rose is a rose is a rose!
Gene Nomenclature
- Disease names
- Mobius Syndrome with Polands Anomaly
- Werners syndrome
- Downs syndrome
- Angelmans syndrome
- Creutzfeld-Jacob disease
- Accelerin
- Antiquitin
- Bang Senseless
- Bride of Sevenless
- Christmas Factor
- Cockeye
- Crack
- Draculin
- Dickies small eye
- Draculin
- Fidgetin
- Gleeful
- Knobhead
- Lunatic Fringe
- Mortalin
- Orphanin
- Profilactin
- Sonic Hedgehog
11Rose is a rose is a rose is a rose.. Not Really!
What is a cell?
- any small compartment
- (biology) the basic structural and functional
unit of all organisms they may exist as
independent units of life (as in monads) or may
form colonies or tissues as in higher plants and
animals - a device that delivers an electric current as the
result of a chemical reaction - a small unit serving as part of or as the nucleus
of a larger political movement - cellular telephone a hand-held mobile
radiotelephone for use in an area divided into
small sections, each with its own short-range
transmitter/receiver - small room is which a monk or nun lives
- a room where a prisoner is kept
Image Sources Somewhere from the internet
12Foundation Model Explorer
13(No Transcript)
14- COLORECTAL CANCER 3-BP DEL, SER45DEL
- COLORECTAL CANCER SER33TYR
- PILOMATRICOMA, SOMATIC SER33TYR
- HEPATOBLASTOMA, SOMATIC THR41ALA
- DESMOID TUMOR, SOMATIC THR41ALA
- PILOMATRICOMA, SOMATIC ASP32GLY
- OVARIAN CARCINOMA, ENDOMETRIOID TYPE, SOMATIC
SER37CYS - HEPATOCELLULAR CARCINOMA SOMATIC SER45PHE
- HEPATOCELLULAR CARCINOMA SOMATIC SER45PRO
- MEDULLOBLASTOMA, SOMATIC SER33PHE
The REAL Problems
Many disease states are complex, because of many
genes (alleles ethnicity, gene families, etc.),
environmental effects (life style, exposure,
etc.) and the interactions.
15The REAL Problems
16Integrative Genomics - what is it?Another
buzzword or a meaningful concept useful for
biomedical research?
Acquisition, Integration, Curation, and Analysis
of biological data
Integrative Genomics the study of complex
interactions between genes, organism and
environment, the triple helix of biology. Gene
ltgt Organism lt-gt Environment It is definitely
beyond the buzzword stage - Universities now have
programs named 'Integrated Genomics.'
17Methods for Integration
- Link driven federations
- Explicit links between databanks.
- Warehousing
- Data is downloaded, filtered, integrated and
stored in a warehouse. Answers to queries are
taken from the warehouse. - Others.. Semantic Web, etc
18Link-driven Federations
- Creates explicit links between databanks
- query get interesting results and use web links
to reach related data in other databanks - Examples NCBI-Entrez, SRS
19http//www.ncbi.nlm.nih.gov/Database/datamodel/
20http//www.ncbi.nlm.nih.gov/Database/datamodel/
21http//www.ncbi.nlm.nih.gov/Database/datamodel/
22http//www.ncbi.nlm.nih.gov/Database/datamodel/
23http//www.ncbi.nlm.nih.gov/Database/datamodel/
24Querying Entrez-Gene
25(No Transcript)
26Link-driven Federations
- Advantages
- complex queries
- Fast
- Disadvantages
- require good knowledge
- syntax based
- terminology problem not solved
27Data Warehousing
Data is downloaded, filtered, integrated and
stored in a warehouse. Answers to queries are
taken from the warehouse.
- Advantages
- Good for very-specific, task-based queries and
studies. - Since it is custom-built and usually
expert-curated, relatively less error-prone
- Disadvantages
- Can become quickly outdated needs constant
updates. - Limited functionality For e.g., one
disease-based or one system-based.
28http//concise-scanner.cchmc.org
Sequence Context
To identify putative gene targets of
transcription factors
List of Transcription Factor Binding Sites
29(No Transcript)
30GenomeTrafac Tracks
31http//polydoms.cchmc.org
32No Integrative Genomics is Complete without
Ontologies
- Unified Medical Language System (UMLS)
33The 3 Gene Ontologies
- Molecular Function elemental activity/task
- the tasks performed by individual gene products
examples are carbohydrate binding and ATPase
activity - What a product does, precise activity
- Biological Process biological goal or objective
- broad biological goals, such as dna repair or
purine metabolism, that are accomplished by
ordered assemblies of molecular functions - Biological objective, accomplished via one or
more ordered assemblies of functions - Cellular Component location or complex
- subcellular structures, locations, and
macromolecular complexes examples include
nucleus, telomere, and RNA polymerase II
holoenzyme - is located in (is a subcomponent of )
http//www.geneontology.org
34Example Gene Product hammer
Function (what) Process (why) Drive a nail -
into wood Carpentry Drive stake - into soil
Gardening Smash a bug Pest Control A
performers juggling object Entertainment
http//www.geneontology.org
35GO term associations Evidence Codes
- ISS Inferred from sequence or structural
similarity - IDA Inferred from direct assay
- IPI Inferred from physical interaction
- TAS Traceable author statement
- IMP Inferred from mutant phenotype
- IGI Inferred from genetic interaction
- IEP Inferred from expression pattern
- ND no data available
http//www.geneontology.org
36What can researchers do with GO?
- Access gene product functional information
- Find how much of a proteome is involved in a
process/ function/ component in the cell - Map GO terms and incorporate manual annotations
into own databases - Provide a link between biological knowledge and
- gene expression profiles
- proteomics data
- Getting the GO and GO_Association Files
- Data Mining
- My Favorite Gene
- By GO
- By Sequence
- Analysis of Data
- Clustering by function/process
- Other Tools
And how?
37http//www.geneontology.org/
Gene list enrichment analysis tools (DAVID,
FatiGO, ToppGene
38Open biomedical ontologies
http//obo.sourceforge.net/
39Unified Medical Language System Knowledge Server
UMLSKShttp//umlsks.nlm.nih.gov/kss/
- The UMLS Metathesaurus contains information about
biomedical concepts and terms from many
controlled vocabularies and classifications used
in patient records, administrative health data,
bibliographic and full-text databases, and expert
systems. - The Semantic Network, through its semantic types,
provides a consistent categorization of all
concepts represented in the UMLS Metathesaurus.
The links between the semantic types provide the
structure for the Network and represent important
relationships in the biomedical domain. - The SPECIALIST Lexicon is an English language
lexicon with many biomedical terms, containing
syntactic, morphological, and orthographic
information for each term or word.
40Unified Medical Language SystemMetathesaurus
- about over 1 million biomedical concepts
- About 5 million concept names from more than 100
controlled vocabularies and classifications (some
in multiple languages) used in patient records,
administrative health data, bibliographic and
full-text databases and expert systems. - The Metathesaurus is organized by concept or
meaning. Alternate names for the same concept
(synonyms, lexical variants, and translations)
are linked together. - Each Metathesaurus concept has attributes that
help to define its meaning, e.g., the semantic
type(s) or categories to which it belongs, its
position in the hierarchical contexts from
various source vocabularies, and, for many
concepts, a definition. - Customizable Users can exclude vocabularies that
are not relevant for specific purposes or not
licensed for use in their institutions.
MetamorphoSys, the multi-platform Java install
and customization program distributed with the
UMLS resources, helps users to generate
pre-defined or custom subsets of the
Metathesaurus. - Uses
- linking between different clinical or biomedical
vocabularies - information retrieval from databases with human
assigned subject index terms and from free-text
information sources - linking patient records to related information in
bibliographic, full-text, or factual databases - natural language processing and automated
indexing research
41UMLSKS Semantic Network
- Complexity reduced by grouping concepts according
to the semantic types that have been assigned to
them. - There are currently 15 semantic groups that
provide a partition of the UMLS Metathesaurus for
99.5 of the concepts.
ACTIActivities BehaviorsT053Behavior ANATAna
tomyT024Tissue CHEMChemicals
DrugsT195Antibiotic CONCConcepts
IdeasT170Intellectual Product DEVIDevicesT074
Medical Device DISODisordersT047Disease or
Syndrome GENEGenes Molecular
SequencesT085Molecular Sequence GEOGGeographic
AreasT083Geographic Area LIVBLiving
BeingsT005Virus OBJCObjectsT073Manufactured
Object OCCUOccupationsT091Biomedical
Occupation or Discipline ORGAOrganizationsT093H
ealth Care Related Organization PHENPhenomenaT03
8Biologic Function PHYSPhysiologyT040Organism
Function PROCProceduresT061Therapeutic or
Preventive Procedure
42UMLSKS Semantic Navigator
43Disease Gene Identification and Prioritization
Hypothesis Majority of genes that impact or
cause disease share membership in any of several
functional relationships OR Functionally similar
or related genes cause similar phenotype.
- Functional Similarity Common/shared
- Gene Ontology term
- Pathway
- Phenotype
- Chromosomal location
- Expression
- Cis regulatory elements (Transcription factor
binding sites) - miRNA regulators
- Interactions
- Other features..
44Background, Problems Issues
- Most of the common diseases are multi-factorial
and modified by genetically and mechanistically
complex polygenic interactions and environmental
factors. - High-throughput genome-wide studies like linkage
analysis and gene expression profiling, tend to
be most useful for classification and
characterization but do not provide sufficient
information to identify or prioritize specific
disease causal genes.
45Background, Problems Issues
- Since multiple genes are associated with same or
similar disease phenotypes, it is reasonable to
expect the underlying genes to be functionally
related. - Such functional relatedness (common pathway,
interaction, biological process, etc.) can be
exploited to aid in the finding of novel disease
genes. For e.g., genetically heterogeneous
hereditary diseases such as Hermansky-Pudlak
syndrome and Fanconi anaemia have been shown to
be caused by mutations in different interacting
proteins.
46PPI - Predicting Disease Genes
- Direct proteinprotein interactions (PPI) are one
of the strongest manifestations of a functional
relation between genes. - Hypothesis Interacting proteins lead to same or
similar disease phenotypes when mutated. - Several genetically heterogeneous hereditary
diseases are shown to be caused by mutations in
different interacting proteins. For e.g.
Hermansky-Pudlak syndrome and Fanconi anaemia.
Hence, proteinprotein interactions might in
principle be used to identify potentially
interesting disease gene candidates.
47- Prioritize candidate genes in the interacting
partners of the disease-related genes - Training sets disease related genes
- Test sets interacting partners of the training
genes
4815
342
2469
49ToppGene General Schema
http//toppgene.cchmc.org
50TOPPGene - Data Sources
- Gene Ontology GO and NCBI Entrez Gene
- Mouse Phenotype MGI (used for the first time for
human disease gene prioritization) - Pathways KEGG, BioCarta, BioCyc, Reactome,
GenMAPP, MSigDB - Domains UniProt (Pfam, Interpro,etc.)
- Interactions NCBI Entrez Gene (Biogrid,
Reactome, BIND, HPRD, etc.) - Pubmed IDs NCBI Entrez Gene
- Expression GEO
- Cytoband MSigDB
- Cis-Elements MSigDB
- miRNA Targets MSigDB
New features added
51- ToppGene web server (http//toppgene.cchmc.org)
- For functional enrichment analysis
52- ToppGene web server (http//toppgene.cchmc.org)
- For functional enrichment analysis
53- ToppGene web server (http//toppgene.cchmc.org)
- For functional enrichment analysis
54- ToppGene web server (http//toppgene.cchmc.org)
- For functional enrichment analysis
55- ToppGene web server (http//toppgene.cchmc.org)
- For candidate gene prioritization
56- ToppGene web server (http//toppgene.cchmc.org)
- For candidate gene prioritization
57- ToppGene web server (http//toppgene.cchmc.org)
- For candidate gene prioritization
58Limitations
- General limitations of any training-test
strategy - Prior knowledge of disease-gene associations.
- Assumption that the disease genes yet to discover
will be consistent with what is already known
about a disease. - Depend on the accuracy and completeness of the
functional annotations. - Only one-fifth of the known human genes have
pathway or phenotype annotations and there are
still more than 40 genes whose functions are not
defined!
Chen et al., 2007 BMC Bioinformatics
59And the Benefits of Integrative Genomics.
- To unravel the connection between genotype and
phenotype - Systematically identify novel
phenotypegenotype relationships. - Encourage standard terminology usage.
- Saves time for researchers.
- Hypotheses generator.
- Paves way for prognosis, diagnosis, and
personalized medicine (adverse drug reactions,
etc.). - Deeper understanding of disease and an enhanced
integration of medicine with biology. - Increasing knowledge of the genes associated with
diseases will allow researchers to address more
complicated issues, including the relative
contributions to disease of genes in the core
biological set shared by all species and those
encoding proteins specific to humans how
sequence features (such as conservation and
polymorphism) relate to disease characteristics
and how protein function relates to the outcome
of clinical treatment - And MANY MORE..
60Take-home messages
- Networks and integration of databases are keys to
success in Bioinformatics. - Integration of data computation and data
integration into a single cohesive whole will
increase the efficiency of research effort - by reducing the serendipity hit and miss nature
of empirical research and - will provide valuable clues to the biomedical
researchers on their choice of experiments -
limitations of funds, manpower and time. - Users have to know what is available and how to
access (what are the limitations) and use the
resources they are offered.
61PubMed
OMIM
62Acknowledgement
- Jing Chen
- Siva Gowrisankar
- Vivek Kaimal
- Mrunal Deshmukh
- Huan Xu
63Thank You!
http//sbw.kgi.edu/
64(No Transcript)