Integrative Genomics - PowerPoint PPT Presentation

1 / 64
About This Presentation
Title:

Integrative Genomics

Description:

500,000 medical articles/year 4,000,000 scientific articles/year ... Christmas Factor. Cockeye. Crack. Draculin. Dickie's small eye. Disease names ... – PowerPoint PPT presentation

Number of Views:213
Avg rating:3.0/5.0
Slides: 65
Provided by: anilC1
Category:

less

Transcript and Presenter's Notes

Title: Integrative Genomics


1
Integrative Genomics
  • Anil Jegga
  • Biomedical Informatics
  • CCHMC
  • Anil.Jegga_at_cchmc.org
  • http//anil.cchmc.org

2
Two Separate Worlds..
Medical Informatics
Bioinformatics the omes
PubMed
Proteome
Disease Database
Patient Records
OMIM Clinical Synopsis
Clinical Trials
382 omes so far and there is UNKNOME too -
genes with no function known http//omics.org/inde
x.php/Alphabetically_ordered_list_of_omics
With Some Data Exchange
3
Motivation
To correlate diseases with anatomical parts
affected, the genes/proteins involved, and the
underlying physiological processes (interactions,
pathways, processes). In other words, bringing
the disciplines of Medical Informatics (MI) and
BioInformatics (BI) together (Biomedical
Informatics - BMI) to support personalized or
tailor-made medicine.
How to integrate multiple types of genome-scale
data across experiments and phenotypes in order
to find genes associated with diseases
4
Model Organism Databases Common Issues
  • Heterogeneous Data Sets - Data Integration
  • From Genotype to Phenotype
  • Experimental and Consensus Views
  • Incorporation of Large Datasets
  • Whole genome annotation pipelines
  • Large scale mutagenesis/variation projects
    (dbSNP)
  • Computational vs. Literature-based Data
    Collection and Evaluation (MedLine)
  • Data Mining
  • extraction of new knowledge
  • testable hypotheses (Hypothesis Generation)

5
Support Complex Queries
  • Show me all genes involved in brain development
    that are expressed in the Central Nervous System.
  • Show me all genes involved in brain development
    in human and mouse that also show iron ion
    binding activity.
  • For this set of genes, what aspects of function
    and/or cellular localization do they share?
  • For this set of genes, what mutations are
    reported to cause pathological conditions?



6
Bioinformatic Data-1978 to present
  • DNA sequence
  • Gene expression
  • Protein expression
  • Protein Structure
  • Genome mapping
  • SNPs Mutations
  • Metabolic networks
  • Regulatory networks
  • Trait mapping
  • Gene function analysis
  • Scientific literature
  • and others..

7
Human Genome Project Data Deluge
No. of Human Gene Records currently in NCBI
29413 (excluding pseudogenes, mitochondrial genes
and obsolete records). Includes 460 microRNAs
NCBI Human Genome Statistics as on February12,
2008
8
The Gene Expression Data Deluge
Till 2000 413 papers on microarray!
Problems Deluge! Allison DB, Cui X, Page GP,
Sabripour M. 2006. Microarray data analysis from
disarray to consolidation and consensus. Nat Rev
Genet. 7(1) 55-65.
9
Information Deluge..
  • 3 scientific journals in 1750
  • Now - gt120,000 scientific journals!
  • gt500,000 medical articles/year
  • gt4,000,000 scientific articles/year
  • gt16 million abstracts in PubMed derived from
    gt32,500 journals

10
Data-driven Problems..
  • Generally, the names refer to some feature of the
    mutant phenotype
  • Dickies small eye (Thieler et al., 1978, Anat
    Embryol (Berl), 155 81-86) is now Pax6
  • Gleeful "This gene encodes a C2H2 zinc finger
    transcription factor with high sequence
    similarity to vertebrate Gli proteins, so we have
    named the gene gleeful (Gfl)." (Furlong et al.,
    2001, Science 293 1632)

Whats in a name!
Rose is a rose is a rose is a rose!
Gene Nomenclature
  • Disease names
  • Mobius Syndrome with Polands Anomaly
  • Werners syndrome
  • Downs syndrome
  • Angelmans syndrome
  • Creutzfeld-Jacob disease
  • Accelerin
  • Antiquitin
  • Bang Senseless
  • Bride of Sevenless
  • Christmas Factor
  • Cockeye
  • Crack
  • Draculin
  • Dickies small eye
  • Draculin
  • Fidgetin
  • Gleeful
  • Knobhead
  • Lunatic Fringe
  • Mortalin
  • Orphanin
  • Profilactin
  • Sonic Hedgehog

11
Rose is a rose is a rose is a rose.. Not Really!
What is a cell?
  • any small compartment
  • (biology) the basic structural and functional
    unit of all organisms they may exist as
    independent units of life (as in monads) or may
    form colonies or tissues as in higher plants and
    animals
  • a device that delivers an electric current as the
    result of a chemical reaction
  • a small unit serving as part of or as the nucleus
    of a larger political movement
  • cellular telephone a hand-held mobile
    radiotelephone for use in an area divided into
    small sections, each with its own short-range
    transmitter/receiver
  • small room is which a monk or nun lives
  • a room where a prisoner is kept

Image Sources Somewhere from the internet
12
Foundation Model Explorer
13
(No Transcript)
14
  • COLORECTAL CANCER 3-BP DEL, SER45DEL
  • COLORECTAL CANCER SER33TYR
  • PILOMATRICOMA, SOMATIC SER33TYR
  • HEPATOBLASTOMA, SOMATIC THR41ALA
  • DESMOID TUMOR, SOMATIC THR41ALA
  • PILOMATRICOMA, SOMATIC ASP32GLY
  • OVARIAN CARCINOMA, ENDOMETRIOID TYPE, SOMATIC
    SER37CYS
  • HEPATOCELLULAR CARCINOMA SOMATIC SER45PHE
  • HEPATOCELLULAR CARCINOMA SOMATIC SER45PRO
  • MEDULLOBLASTOMA, SOMATIC SER33PHE

The REAL Problems
Many disease states are complex, because of many
genes (alleles ethnicity, gene families, etc.),
environmental effects (life style, exposure,
etc.) and the interactions.
15
The REAL Problems
16
Integrative Genomics - what is it?Another
buzzword or a meaningful concept useful for
biomedical research?
Acquisition, Integration, Curation, and Analysis
of biological data
Integrative Genomics the study of complex
interactions between genes, organism and
environment, the triple helix of biology. Gene
ltgt Organism lt-gt Environment It is definitely
beyond the buzzword stage - Universities now have
programs named 'Integrated Genomics.'
17
Methods for Integration
  • Link driven federations
  • Explicit links between databanks.
  • Warehousing
  • Data is downloaded, filtered, integrated and
    stored in a warehouse. Answers to queries are
    taken from the warehouse.
  • Others.. Semantic Web, etc

18
Link-driven Federations
  • Creates explicit links between databanks
  • query get interesting results and use web links
    to reach related data in other databanks
  • Examples NCBI-Entrez, SRS

19
http//www.ncbi.nlm.nih.gov/Database/datamodel/
20
http//www.ncbi.nlm.nih.gov/Database/datamodel/
21
http//www.ncbi.nlm.nih.gov/Database/datamodel/
22
http//www.ncbi.nlm.nih.gov/Database/datamodel/
23
http//www.ncbi.nlm.nih.gov/Database/datamodel/
24
Querying Entrez-Gene
25
(No Transcript)
26
Link-driven Federations
  • Advantages
  • complex queries
  • Fast
  • Disadvantages
  • require good knowledge
  • syntax based
  • terminology problem not solved

27
Data Warehousing
Data is downloaded, filtered, integrated and
stored in a warehouse. Answers to queries are
taken from the warehouse.
  • Advantages
  • Good for very-specific, task-based queries and
    studies.
  • Since it is custom-built and usually
    expert-curated, relatively less error-prone
  • Disadvantages
  • Can become quickly outdated needs constant
    updates.
  • Limited functionality For e.g., one
    disease-based or one system-based.

28
http//concise-scanner.cchmc.org
Sequence Context
To identify putative gene targets of
transcription factors
List of Transcription Factor Binding Sites
29
(No Transcript)
30
GenomeTrafac Tracks
31
http//polydoms.cchmc.org
32
No Integrative Genomics is Complete without
Ontologies
  • Gene Ontology (GO)
  • Unified Medical Language System (UMLS)

33
The 3 Gene Ontologies
  • Molecular Function elemental activity/task
  • the tasks performed by individual gene products
    examples are carbohydrate binding and ATPase
    activity
  • What a product does, precise activity
  • Biological Process biological goal or objective
  • broad biological goals, such as dna repair or
    purine metabolism, that are accomplished by
    ordered assemblies of molecular functions
  • Biological objective, accomplished via one or
    more ordered assemblies of functions
  • Cellular Component location or complex
  • subcellular structures, locations, and
    macromolecular complexes examples include
    nucleus, telomere, and RNA polymerase II
    holoenzyme
  • is located in (is a subcomponent of )

http//www.geneontology.org
34
Example Gene Product hammer
Function (what) Process (why) Drive a nail -
into wood Carpentry Drive stake - into soil
Gardening Smash a bug Pest Control A
performers juggling object Entertainment
http//www.geneontology.org
35
GO term associations Evidence Codes
  • ISS Inferred from sequence or structural
    similarity
  • IDA Inferred from direct assay
  • IPI Inferred from physical interaction
  • TAS Traceable author statement
  • IMP Inferred from mutant phenotype
  • IGI Inferred from genetic interaction
  • IEP Inferred from expression pattern
  • ND no data available

http//www.geneontology.org
36
What can researchers do with GO?
  • Access gene product functional information
  • Find how much of a proteome is involved in a
    process/ function/ component in the cell
  • Map GO terms and incorporate manual annotations
    into own databases
  • Provide a link between biological knowledge and
  • gene expression profiles
  • proteomics data
  • Getting the GO and GO_Association Files
  • Data Mining
  • My Favorite Gene
  • By GO
  • By Sequence
  • Analysis of Data
  • Clustering by function/process
  • Other Tools

And how?
37
http//www.geneontology.org/
Gene list enrichment analysis tools (DAVID,
FatiGO, ToppGene
38
Open biomedical ontologies
http//obo.sourceforge.net/
39
Unified Medical Language System Knowledge Server
UMLSKShttp//umlsks.nlm.nih.gov/kss/
  • The UMLS Metathesaurus contains information about
    biomedical concepts and terms from many
    controlled vocabularies and classifications used
    in patient records, administrative health data,
    bibliographic and full-text databases, and expert
    systems.
  • The Semantic Network, through its semantic types,
    provides a consistent categorization of all
    concepts represented in the UMLS Metathesaurus.
    The links between the semantic types provide the
    structure for the Network and represent important
    relationships in the biomedical domain.
  • The SPECIALIST Lexicon is an English language
    lexicon with many biomedical terms, containing
    syntactic, morphological, and orthographic
    information for each term or word.

40
Unified Medical Language SystemMetathesaurus
  • about over 1 million biomedical concepts
  • About 5 million concept names from more than 100
    controlled vocabularies and classifications (some
    in multiple languages) used in patient records,
    administrative health data, bibliographic and
    full-text databases and expert systems.
  • The Metathesaurus is organized by concept or
    meaning. Alternate names for the same concept
    (synonyms, lexical variants, and translations)
    are linked together.
  • Each Metathesaurus concept has attributes that
    help to define its meaning, e.g., the semantic
    type(s) or categories to which it belongs, its
    position in the hierarchical contexts from
    various source vocabularies, and, for many
    concepts, a definition.
  • Customizable Users can exclude vocabularies that
    are not relevant for specific purposes or not
    licensed for use in their institutions.
    MetamorphoSys, the multi-platform Java install
    and customization program distributed with the
    UMLS resources, helps users to generate
    pre-defined or custom subsets of the
    Metathesaurus.
  • Uses
  • linking between different clinical or biomedical
    vocabularies
  • information retrieval from databases with human
    assigned subject index terms and from free-text
    information sources
  • linking patient records to related information in
    bibliographic, full-text, or factual databases
  • natural language processing and automated
    indexing research

41
UMLSKS Semantic Network
  • Complexity reduced by grouping concepts according
    to the semantic types that have been assigned to
    them.
  • There are currently 15 semantic groups that
    provide a partition of the UMLS Metathesaurus for
    99.5 of the concepts.

ACTIActivities BehaviorsT053Behavior ANATAna
tomyT024Tissue CHEMChemicals
DrugsT195Antibiotic CONCConcepts
IdeasT170Intellectual Product DEVIDevicesT074
Medical Device DISODisordersT047Disease or
Syndrome GENEGenes Molecular
SequencesT085Molecular Sequence GEOGGeographic
AreasT083Geographic Area LIVBLiving
BeingsT005Virus OBJCObjectsT073Manufactured
Object OCCUOccupationsT091Biomedical
Occupation or Discipline ORGAOrganizationsT093H
ealth Care Related Organization PHENPhenomenaT03
8Biologic Function PHYSPhysiologyT040Organism
Function PROCProceduresT061Therapeutic or
Preventive Procedure
42
UMLSKS Semantic Navigator
43
Disease Gene Identification and Prioritization
Hypothesis Majority of genes that impact or
cause disease share membership in any of several
functional relationships OR Functionally similar
or related genes cause similar phenotype.
  • Functional Similarity Common/shared
  • Gene Ontology term
  • Pathway
  • Phenotype
  • Chromosomal location
  • Expression
  • Cis regulatory elements (Transcription factor
    binding sites)
  • miRNA regulators
  • Interactions
  • Other features..

44
Background, Problems Issues
  • Most of the common diseases are multi-factorial
    and modified by genetically and mechanistically
    complex polygenic interactions and environmental
    factors.
  • High-throughput genome-wide studies like linkage
    analysis and gene expression profiling, tend to
    be most useful for classification and
    characterization but do not provide sufficient
    information to identify or prioritize specific
    disease causal genes.

45
Background, Problems Issues
  • Since multiple genes are associated with same or
    similar disease phenotypes, it is reasonable to
    expect the underlying genes to be functionally
    related.
  • Such functional relatedness (common pathway,
    interaction, biological process, etc.) can be
    exploited to aid in the finding of novel disease
    genes. For e.g., genetically heterogeneous
    hereditary diseases such as Hermansky-Pudlak
    syndrome and Fanconi anaemia have been shown to
    be caused by mutations in different interacting
    proteins.

46
PPI - Predicting Disease Genes
  • Direct proteinprotein interactions (PPI) are one
    of the strongest manifestations of a functional
    relation between genes.
  • Hypothesis Interacting proteins lead to same or
    similar disease phenotypes when mutated.
  • Several genetically heterogeneous hereditary
    diseases are shown to be caused by mutations in
    different interacting proteins. For e.g.
    Hermansky-Pudlak syndrome and Fanconi anaemia.
    Hence, proteinprotein interactions might in
    principle be used to identify potentially
    interesting disease gene candidates.

47
  • Prioritize candidate genes in the interacting
    partners of the disease-related genes
  • Training sets disease related genes
  • Test sets interacting partners of the training
    genes

48
  • Example Breast cancer

15
342
2469
49
ToppGene General Schema
http//toppgene.cchmc.org
50
TOPPGene - Data Sources
  • Gene Ontology GO and NCBI Entrez Gene
  • Mouse Phenotype MGI (used for the first time for
    human disease gene prioritization)
  • Pathways KEGG, BioCarta, BioCyc, Reactome,
    GenMAPP, MSigDB
  • Domains UniProt (Pfam, Interpro,etc.)
  • Interactions NCBI Entrez Gene (Biogrid,
    Reactome, BIND, HPRD, etc.)
  • Pubmed IDs NCBI Entrez Gene
  • Expression GEO
  • Cytoband MSigDB
  • Cis-Elements MSigDB
  • miRNA Targets MSigDB

New features added
51
  • ToppGene web server (http//toppgene.cchmc.org)
  • For functional enrichment analysis

52
  • ToppGene web server (http//toppgene.cchmc.org)
  • For functional enrichment analysis

53
  • ToppGene web server (http//toppgene.cchmc.org)
  • For functional enrichment analysis

54
  • ToppGene web server (http//toppgene.cchmc.org)
  • For functional enrichment analysis

55
  • ToppGene web server (http//toppgene.cchmc.org)
  • For candidate gene prioritization

56
  • ToppGene web server (http//toppgene.cchmc.org)
  • For candidate gene prioritization

57
  • ToppGene web server (http//toppgene.cchmc.org)
  • For candidate gene prioritization

58
Limitations
  • General limitations of any training-test
    strategy
  • Prior knowledge of disease-gene associations.
  • Assumption that the disease genes yet to discover
    will be consistent with what is already known
    about a disease.
  • Depend on the accuracy and completeness of the
    functional annotations.
  • Only one-fifth of the known human genes have
    pathway or phenotype annotations and there are
    still more than 40 genes whose functions are not
    defined!

Chen et al., 2007 BMC Bioinformatics
59
And the Benefits of Integrative Genomics.
  • To unravel the connection between genotype and
    phenotype - Systematically identify novel
    phenotypegenotype relationships.
  • Encourage standard terminology usage.
  • Saves time for researchers.
  • Hypotheses generator.
  • Paves way for prognosis, diagnosis, and
    personalized medicine (adverse drug reactions,
    etc.).
  • Deeper understanding of disease and an enhanced
    integration of medicine with biology.
  • Increasing knowledge of the genes associated with
    diseases will allow researchers to address more
    complicated issues, including the relative
    contributions to disease of genes in the core
    biological set shared by all species and those
    encoding proteins specific to humans how
    sequence features (such as conservation and
    polymorphism) relate to disease characteristics
    and how protein function relates to the outcome
    of clinical treatment
  • And MANY MORE..

60
Take-home messages
  • Networks and integration of databases are keys to
    success in Bioinformatics.
  • Integration of data computation and data
    integration into a single cohesive whole will
    increase the efficiency of research effort
  • by reducing the serendipity hit and miss nature
    of empirical research and
  • will provide valuable clues to the biomedical
    researchers on their choice of experiments -
    limitations of funds, manpower and time.
  • Users have to know what is available and how to
    access (what are the limitations) and use the
    resources they are offered.

61
PubMed
OMIM
62
Acknowledgement
  • Jing Chen
  • Siva Gowrisankar
  • Vivek Kaimal
  • Mrunal Deshmukh
  • Huan Xu

63
Thank You!
http//sbw.kgi.edu/
64
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com