Title: Medical Informatics
1Medical Informatics
Bioinformatics
2Biomedical informatics The broad discipline
concerned with the study and application of
computer science, information science,
informatics, cognitive science and human-computer
interaction in the practice of biological
research, biomedical science, medicine and
healthcare. Bioinformatics, clinical informatics
and public health informatics or medical
informatics can be considered as sub-domains
within biomedical informatics.
Bioinformatics The merger of biotechnology and
information technology with the goal of revealing
new insights and principles in biology OR The
science of managing and analyzing biological data
using advanced computing techniques. Especially
important in analyzing genomic research
data. Health Informatics or Medical Informatics
The intersection of information science, computer
science, and health care. It deals with the
resources, devices, and methods required to
optimize the acquisition, storage, retrieval, and
use of information in health and biomedicine.
Wikipedia
3Mining Bio-Medical Mountains How Computer Science
can help Biomedical Research and Health Sciences
Anil Jegga Division of Biomedical Informatics,
Cincinnati Childrens Hospital Medical Center
(CCHMC) Department of Pediatrics, University of
Cincinnati http//anil.cchmc.org Anil.Jegga_at_cchmc.
org
4Algorithm A fixed procedure embodied in a
computer program. Base One of the molecules
that form DNA and RNA molecules. Base pair Two
nitrogenous bases (adenine and thymine or guanine
and cytosine) held together by weak bonds. Two
strands of DNA are held together in the shape of
a double helix by the bonds between base pairs.
Wikipedia
5Nucleotide A subunit of DNA or RNA consisting of
a nitrogenous base (adenine, guanine, thymine, or
cytosine in DNA adenine, guanine, uracil, or
cytosine in RNA), a phosphate molecule, and a
sugar molecule (deoxyribose in DNA and ribose in
RNA). Thousands of nucleotides are linked to form
a DNA or RNA molecule. Genome All the genetic
material in the chromosomes of a particular
organism its size is generally given as its
total number of base pairs. Genomics The study
of genes and their function. Functional
Genomics The study of genes, their resulting
proteins, and the role played by the proteins the
body's biochemical processes.
Wikipedia
6Two Separate Worlds..
Medical Informatics
Bioinformatics the omes
PubMed
Proteome
Disease Database
Patient Records
OMIM Clinical Synopsis
Clinical Trials
382 omes so far and there is UNKNOME too -
genes with no function known http//omics.org/inde
x.php/Alphabetically_ordered_list_of_omics
With Some Data Exchange
7The genome is our Genetic Blueprint
- Nearly every human cell contains 23 pairs of
chromosomes - 1 to 22 and
- XY or XX
- XY Male
- XX Female
- Length of chromosomes 1 to 22, X, Y together is
3.2 billion bases.
8The Genome is Who We Are on the inside!
- Chromosomes consist of DNA
- molecular strings of A, C, G, T
- base pairs, A-T, C-G
- Genes
- DNA sequences that encode proteins
- less than 3 of human genome
Information coded in DNA
95000 bases per page..
CACACTTGCATGTGAGAGCTTCTAATATCTAAATTAATGTTGAATCATT
ATTCAGAAACAGAGAGCTAACTGTTATCCCATCCTGACTTTATTCTTTAT
GAGAAAAATACAGTGATTCCAAGTTACCAAGTTAGTGCTGCTTGCTTTAT
AAATGAAGTAATATTTTAAAAGTTGTGCATAAGTTAAAATTCAGAAATAA
AACTTCATCCTAAAACTCTGTGTGTTGCTTTAAATAATCAGAGCATCTGC
TACTTAATTTTTTGTGTGTGGGTGCACAATAGATGTTTAATGAGATCCTG
TCATCTGTCTGCTTTTTTATTGTAAAACAGGAGGGGTTTTAATACTGGAG
GAACAACTGATGTACCTCTGAAAAGAGAAGAGATTAGTTATTAATTGAAT
TGAGGGTTGTCTTGTCTTAGTAGCTTTTATTCTCTAGGTACTATTTGATT
ATGATTGTGAAAATAGAATTTATCCCTCATTAAATGTAAAATCAACAGGA
GAATAGCAAAAACTTATGAGATAGATGAACGTTGTGTGAGTGGCATGGTT
TAATTTGTTTGGAAGAAGCACTTGCCCCAGAAGATACACAATGAAATTCA
TGTTATTGAGTAGAGTAGTAATACAGTGTGTTCCCTTGTGAAGTTCATAA
CCAAGAATTTTAGTAGTGGATAGGTAGGCTGAATAACTGACTTCCTATCA
TTTTCAGGTTCTGCGTTTGATTTTTTTTACATATTAATTTCTTTGATCCA
CATTAAGCTCAGTTATGTATTTCCATTTTATAAATGAAAAAAAATAGGCA
CTTGCAAATGTCAGATCACTTGCCTGTGGTCATTCGGGTAGAGATTTGTG
GAGCTAAGTTGGTCTTAATCAAATGTCAAGCTTTTTTTTTTCTTATAAAA
TATAGGTTTTAATATGAGTTTTAAAATAAAATTAATTAGAAAAAGGCAAA
TTACTCAATATATATAAGGTATTGCATTTGTAATAGGTAGGTATTTCATT
TTCTAGTTATGGTGGGATATTATTCAGACTATAATTCCCAATGAAAAAAC
TTTAAAAAATGCTAGTGATTGCACACTTAAAACACCTTTTAAAAAGCATT
GAGAGCTTATAAAATTTTAATGAGTGATAAAACCAAATTTGAAGAGAAAA
GAAGAACCCAGAGAGGTAAGGATATAACCTTACCAGTTGCAATTTGCCGA
TCTCTACAAATATTAATATTTATTTTGACAGTTTCAGGGTGAATGAGAAA
GAAACCAAAACCCAAGACTAGCATATGTTGTCTTCTTAAGGAGCCCTCCC
CTAAAAGATTGAGATGACCAAATCTTATACTCTCAGCATAAGGTGAACCA
GACAGACCTAAAGCAGTGGTAGCTTGGATCCACTACTTGGGTTTGTGTGT
GGCGTGACTCAGGTAATCTCAAGAATTGAACATTTTTTTAAGGTGGTCCT
ACTCATACACTGCCCAGGTATTAGGGAGAAGCAAATCTGAATGCTTTATA
AAAATACCCTAAAGCTAAATCTTACAATATTCTCAAGAACACAGTGAAAC
AAGGCAAAATAAGTTAAAATCAACAAAAACAACATGAAACATAATTAGAC
ACACAAAGACTTCAAACATTGGAAAATACCAGAGAAAGATAATAAATATT
TTACTCTTTAAAAATTTAGTTAAAAGCTTAAACTAATTGTAGAGAAAAAA
CTATGTTAGTATTATATTGTAGATGAAATAAGCAAAACATTTAAAATACA
AATGTGATTACTTAAATTAAATATAATAGATAATTTACCACCAGATTAGA
TACCATTGAAGGAATAATTAATATACTGAAATACAGGTCAGTAGAATTTT
TTTCAATTCAGCATGGAGATGTAAAAAATGAAAATTAATGCAAAAAATAA
GGGCACAAAAAGAAATGAGTAATTTTGATCAGAAATGTATTAAAATTAAT
AAACTGGAAATTTGACATTTAAAAAAAGCATTGTCATCCAAGTAGATGTG
TCTATTAAATAGTTGTTCTCATATCCAGTAATGTAATTATTATTCCCTCT
CATGCAGTTCAGATTCTGGGGTAATCTTTAGACATCAGTTTTGTCTTTTA
TATTATTTATTCTGTTTACTACATTTTATTTTGCTAATGATATTTTTAAT
TTCTGACATTCTGGAGTATTGCTTGTAAAAGGTATTTTTAAAAATACTTT
ATGGTTATTTTTGTGATTCCTATTCCTCTATGGACACCAAGGCTATTGAC
ATTTTCTTTGGTTTCTTCTGTTACTTCTATTTTCTTAGTGTTTATATCAT
TTCATAGATAGGATATTCTTTATTTTTTATTTTTATTTAAATATTTGGTG
ATTCTTGGTTTTCTCAGCCATCTATTGTCAAGTGTTCTTATTAAGCATTA
TTATTAAATAAAGATTATTTCCTCTAATCACATGAGAATCTTTATTTCCC
CCAAGTAATTGAAAATTGCAATGCCATGCTGCCATGTGGTACAGCATGGG
TTTGGGCTTGCTTTCTTCTTTTTTTTTTAACTTTTATTTTAGGTTTGGGA
GTACCTGTGAAAGTTTGTTATATAGGTAAACTCGTGTCACCAGGGTTTGT
TGTACAGATCATTTTGTCACCTAGGTACCAAGTACTCAACAATTATTTTT
CCTGCTCCTCTGTCTCCTGTCACCCTCCACTCTCAAGTAGACTCCGGTGT
CTGCTGTTCCATTCTTTGTGTCCATGTGTTCTCATAATTTAGTTCCCCAC
TTGTAAGTGAGAACATGCAGTATTTTCTAGTATTTGGTTTTTTGTTCCTG
TGTTAATTTGCCCAGTATAATAGCCTCCAGCTCCATCCATGTTACTGCAA
AGAACATGATCTCATTCTTTTTTATAGCTCCATGGTGTCTATATACCACA
TTTTCTTTATCTAAACTCTTATTGATGAGCATTGAGGTGGATTCTATGTC
TTTGCTATTGTGCATATTGCTGCAAGAACATTTGTGTGCATGTGTCTTTA
TGGTAGAATGATATATTTTCTTCTGGGTATATATGCAGTAATGCGATTGC
TGGTTGGAATGGTAGTTCTGCTTTTATCTCTTTGAGGAATTGCCATGCTG
CTTTCCACAATAGTTGAACTAACTTACACTCCCACTAACAGTGTGTAAGT
GTTTCCTTTTCTCCACAACCTGCCAGCATCTGTTATTTTTTGACATTTTA
ATAGTAGCCATTTTAACTGGTATGAAATTATATTTCATTGTGGTTTTAAT
TTGCATTTCTCTAATGATCAGTGATATTGAGTTTGTTTTTTTTCACATGC
TTGTTGGCTGCATGTATGTCTTCTTTTAAAAAGTGTCTGTTCATGTACTT
TGCCCACATTTTAATGGGGTTGTTTTTCTCTTGTAAATTTGTTTAAATTC
CTTATAGGTGCTGGATTTTAGACATTTGTCAGACGCATAGTTTGCAAATA
GTTTCTCCCATTCTGTAGGTTGTCTGTTTATTTTGTTAATAGTTTCTTTT
GCTATGCAGAAGCTCTTAATAAGTTTAATGAGATCCTGATATGTTAGGCT
TTGTGTCCCCACCCAAATCTCATCTTGAATTATATCTCCATAATCACCAC
ATGGAGAGACCAGGTGGAGGTAATTGAATCTGGGGGTGGTTTCACCCATG
CTGTTCTTGTGATAGTGAATGAGTTCTCACGAGATCTAATGGTTTTATGA
GGGGCTCTTCCCAGCTTTGCCTGGTACTTCTCCTTCCTGCCGCTTTGTGA
AAAAGGTGCATTGCGTCCCTTTCACCTTCTTCTATAATTGTAAGTTTCCT
GAGGCCTTCCCAGCCATGCTGAACTTCAAGTCAATTAAACCTTTTTCTTT
ATAAATTACTCAGTCTCTGGTGGTTCTTTATAGCAGTGTGAAAATGGACT
AATGAAGTTCCCATTTATGAATTTTTGCTTTTGTTGCAATTGCTTTTGAC
ATCTTAGTCATGAAATCCTTGCCTGTTCTAAGTACAGGACGGTATTGCCT
AGGTTGTCTTCCAGGGTTTTTCTAATTTTGTGTTTTGCATTTAAGTGTTT
AATCCATCTTGAGTTGATTTTTGTATATTGTGTAAGGAAGGGGTCCAGTT
TCAATCTTTTGCATATGGCTAGTTAGTTATCCCAGTACCATTTATTGAAA
AGACAGTCTTTTCCCCATCGCTCGTTTTTGTCAGTTTTATTGATGATCAG
ATAATCATAGCTGTGTGGCTTTATTTCTGGGTTCTTTATTCTGTTCTATT
GGTTTATGTCCCTGTTTTTGTGCCAGTACCATGCTGTTTTGGTTAACATA
GCCCTGTAGTATAGTTTGAGGTCAGATAGCCTGATGCTTCCAGCTTTGTT
CTTTTTCTTAAGATTGCCTTGGCTATTTGGCCTCTTTTTTGGTTCCACAT
GAATTTTAAAACAGTTGTTTCTAGTTTTTGAAGAATGTCATTGGTAGTTT
GATAGAAATAGCATTTAATCTGTAAATTGATTTGTGCAGTATGGCCTTTT
AATGATATTGATTCTTCCTATCCATGAGCATGATATGTTTTCCATTTTGT
TTGTATCCTCTCTGATTTCTTTGTGCAGTGTTTTGTAATTCTCATTGTAG
AGATTTTTCACCTCCCTGGTTAGTTGTATTTTACCCTAGATATTTTATTC
TTTTTGTGAAAATTGTGAATGGGATTGCCTTCCTGATTTGACTGCCAGCT
TGGTTACTGTTGGTTTATAGAAATGCTAGTGATTTTTGTACATTGATTTT
CTTTCTAAAACTTTGCTGAAGTTTTTTTTATTAGCAGAAGGAGCTTTGGG
GCTGAGACTATGGGGTTTTCTAGATATAGAATCATGTCAGCTTCAAATAG
GGATAATTTTACTTCCTCTCTTCCTATTTGGATGCCCTTTATTTCTTTCT
CTTGCCTGATTACTCTGGCTGGGATTTCCTATGTTGAATAGGAGTCATGA
GAGAGGGCATCAAATCTACACATATCAAATACTAACCTTGAATGTCTAGA
T
10How much data make up the human genome?
- 3 pallets with 40 boxes per pallet x 5000 pages
per box x 5000 bases per page 3,000,000,000
bases!
- To get an accurate sequence requires
- 6-fold coverage!
- Now imagine shredding 18 pallets and reassembling!
11Human Genome ProjectInitial Stages
- Most of the initial phases were primarily focused
on improving speeding the technology to
sequence and analyze DNA. - Scientists all around the world worked to make
detailed maps of our chromosomes and sequence
model organisms, like worm, fruit fly, and mouse.
Image Courtesy Google Images
12Overwhelming Challenges
- First there was the Assembly
- The DNA sequence is so long that no
technology can read it all at once, so it was
broken into pieces. - There were millions of clones (small sequence
fragments). - The assembly process included finding where
the pieces overlapped in order to put the draft
together.
3,200,000 piece puzzle anyone?
13(No Transcript)
14The Completion of the Human Genome Sequence
- One June 26, 2000 President Clinton, with J.
Craig Venter, and Francis Collins, announces
completion of "the first survey of the entire
human genome - 80 working draft. - Publication of 90 percent of the sequence in the
February 2001 issue of the journal Nature. - Completion of 99.99 of the genome as finished
sequence on July 2003.
Image Courtesy Google Images
15Butthe Project is not Done
Human Genome is finally Sequenced!!!
- Next there is the Annotation
- The sequence is like a topographical map, the
annotation would include cities, towns, schools,
libraries and coffee shops! - So, where are the genes?
-
- How do genes function?
- How do we use this information for scientific
understanding? - How does it benefit or improve the health care?
16What do genes do anyway?
- As per current estimate, we only have 27,000
genes! That means each gene has to do a lot! - Genes make proteins that make up nearly all we
are (bones, muscles, hair, eyes, etc.). - Almost everything that happens in our bodies
happens because of proteins (walking, digestion,
fighting disease).
Image Courtesy Google Images
17Of Mice and Men Its all in the genes
- Humans and Mice have about the same number of
genes. But then why are we so different from each
other, how is this possible?
Did you say cheese?
Mmm, Cheese!
- While one human gene can make many different
proteins a mouse gene can only make a few
probably!
Image Courtesy Google Images
18Genes are important
- By selecting different pieces of a gene, your
body can make many kinds of proteins. (This
process is called alternative splicing.) - If a gene is expressed that means it is turned
on and it will make proteins.
19What weve learned from our genome so far
- There are a relatively small number of human
genes, less than 30,000, but they have a complex
architecture that we are only beginning to
understand and appreciate. - We know where 85 of genes are in the sequence.
- We dont know where the other 15 are because we
havent seen them on (they may only be
expressed during fetal development). - We only know what about 50 of our genes do so
far. - So it is relatively easy to locate genes in the
genome, but it is hard to figure out what they
do.
20How do scientists find genes?
- The genome is so large that useful information is
hard to find. - Researchers use a computational microscope to
help scientists search the genome. - Just as you would use google to find something
on the internet, researchers can use the Genome
Browser to find information in the human genome.
Image Courtesy Google Images
21The Continuing Project
- Finding the complete set of genes and annotating
the entire sequence. Annotation is like
detailing scientists annotate sequence by
listing what has been learnt experimentally and
computationally about its function. - Proteomics is studying the structure and function
of groups of proteins. Proteins are really
important, but we dont really understand how
they work. - Comparative Genomics is the process of comparing
different genomes in order to better understand
what they do and how they work. Like comparing
humans, chimpanzees, and mice that are all
mammals but all quite different.
Image Courtesy Google Images
22Who works on this stuff anyway?
- Biologists and Chemists understand the physical
sciences-they take biology and chemistry classes. - Computer Scientists program the computers (the
same people who make video games!)-they take math
and computer classes. - Computer Engineers try to build better, faster,
smarter computers-they take math, physics and
computer classes. - Social Scientists try to understand how this new
information and technology will impact our
lives-they take sociology and philosophy classes.
23How can I work on this project, or something like
it?
- Read about it, online at http//www.genome.gov,
or in Nature, Science, or other scientific
magazines. - Take classes in biology, chemistry, mathematics
and physics classes at high school. - Go to college and get a degree in science,
engineering, mathematics, or social sciences.
24Bioinformatics Opportunities
Director/Professor - University Company
(Pharmaceutical) National Laboratory Research
Foundation
Ph.D.
Bioinformatics Biochemistry Biology Computer
Science Computer Engineering Mathematics Physics L
inguistics Education, Sociology, Philosophy,
Psychology, Community Studies) A research degree
in any of these majors will take you far!
Research Staff - Company/University National
Laboratory Research Foundation Teaching
- Community College Public Schools
M.S. (M.A.)
Entry-Level - Company National Laboratory Teaching
Private Schools
B.S. (B.A.)
25now. The number 1 FAQ
How much biology should I know??
No simple or straight-forward answer
unfortunately!
But the mantra is Take the classes and Interact
routinely with biologists OR Work with the
biologists or the biological data
High School Senior Summer Internship http//www.ci
ncinnatichildrens.org/ed/research/undergrad/hs/def
ault.htm Summer Undergraduate Research
Fellowship http//www.cincinnatichildrens.org/ed/r
esearch/undergrad/surf/default.htm
26But I want to start with some basics..
- http//www.ncbi.nlm.nih.gov/Education
- http//www.ebi.ac.uk/2can/
- http//www.genome.gov/Education/
- http//genomics.energy.gov/
- Books
- Introduction to Bioinformatics by Teresa Attwood,
David Parry-Smith - A Primer of Genome Science by Gibson G and Muse
SV - Bioinformatics A Practical Guide to the Analysis
of Genes and Proteins, Second Edition by Andreas
D. Baxevanis, B. F. Francis Ouellette - Algorithms on Strings, Trees, and Sequences
Computer Science and Computational Biology by Dan
Gusfield - Bioinformatics Sequence and Genome Analysis by
David W. Mount - Discovering Genomics, Proteomics, and
Bioinformatics by A. Malcolm Campbell and Laurie
J. Heyer
27Biological Challenges - Computer Engineers
- Post-genomic Era and the goal of bio-medicine
- to develop a quantitative understanding of how
living things are built from the genome that
encodes them. - Deciphering the genome code
- Identifying unknown genes and assigning function
by computational analysis of genomic sequence - Identifying the regulatory mechanisms
- Identifying their role in normal
development/states vs disease states
28Biological Challenges - Computer Engineers
- Data Deluge exponential growth of data silos and
different data types - Human-computer interaction specialists need to
work closely with academic and clinical
biomedical researchers to not only manage the
data deluge but to convert information into
knowledge. - Biological data is very complex and interlinked!
- Creating information systems that allow
biologists to seamlessly follow these links
without getting lost in a sea of information - a
huge opportunity for computer scientists.
29Biological Challenges - Computer Engineers
A major goal in molecular biology is Functional
Genomics Study of the relationships among genes
in DNA their function in normal and disease
states
- Networks, networks, and networks!
- Each gene in the genome is not an independent
entity. Multiple genes interact to perform a
specific function. - Environmental influences Genotype-environment
interaction - Integrating genomic and biochemical data together
into quantitative and predictive models of
biochemistry and physiology - Computer scientists, mathematicians, and
statisticians will ALL be an integral and
critical part of this effort.
30Informatics Biologists Expectations
- Representation, Organization, Manipulation,
Distribution, Maintenance, and Use of
information, particularly in digital form. - Functional aspect of bioinformatics
Representation, Storage, and Distribution of
data. - Intelligent design of data formats and databases
- Creation of tools to query those databases
- Development of user interfaces or visualizations
that bring together different tools to allow the
user to ask complex questions or put forth
testable hypotheses.
31Informatics Biologists Expectations
- Developing analytical tools to discover knowledge
in data - Levels at biological information is used
- comparing sequences predict function of a newly
discovered gene - breaking down known 3D protein structures into
bits to find patterns that can help predict how
the protein folds - modeling how proteins and metabolites in a cell
work together to make the cell function.
32Finally.What does informatics mean to
biologists?
- The ultimate goal of analytical bioinformaticians
is to develop predictive methods that allow
biomedical researchers and scientists to model
the function and phenotype of an organism based
only on its genomic sequence. This is a grand
goal, and one that will be approached only in
small steps, by many scientists from different
but allied disciplines working cohesively.
33Biology Data Structures
- Four broad categories
- Strings To represent DNA, RNA, amino acid
sequences of proteins - Trees To represent the evolution of various
organisms (Taxonomy) or structured knowledge
(Ontologies) - Sets of 3D points and their linkages To
represent protein structures - Graphs To represent metabolic, regulatory, and
signaling networks or pathways
34Biology Data Structures
- Biologists are also interested in
- Substrings
- Subtrees
- Subsets of points and linkages, and
- Subgraphs.
Beware Biological data is often characterized by
huge size, the presence of laboratory errors
(noise), duplication, and sometimes unreliability.
35Support Complex Queries A typical demand
- Get me all genes involved in or associated with
brain development that are differentially
expressed in the Central Nervous System. - Get me all genes involved in brain development in
human and mouse that also show iron ion binding
activity. - For this set of genes, what aspects of function
and/or cellular localization do they share? - For this set of genes, what mutations are
reported to cause pathological conditions?
36Model Organism Databases Common Issues
- Heterogeneous Data Sets - Data Integration
- From Genotype to Phenotype
- Experimental and Consensus Views
- Incorporation of Large Datasets
- Whole genome annotation pipelines
- Large scale mutagenesis/variation projects
(dbSNP) - Computational vs. Literature-based Data
Collection and Evaluation (MedLine) - Data Mining
- extraction of new knowledge
- testable hypotheses (Hypothesis Generation)
37Human Genome Project Data Deluge
No. of Human Gene Records currently in NCBI
29413 (excluding pseudogenes, mitochondrial genes
and obsolete records). Includes 460 microRNAs
NCBI Human Genome Statistics as on February12,
2008
38The Gene Expression Data Deluge
Till 2000 413 papers on microarray!
Problems Deluge! Allison DB, Cui X, Page GP,
Sabripour M. 2006. Microarray data analysis from
disarray to consolidation and consensus. Nat Rev
Genet. 7(1) 55-65.
39Information Deluge..
- 3 scientific journals in 1750
- Now - gt120,000 scientific journals!
- gt500,000 medical articles/year
- gt4,000,000 scientific articles/year
- gt16 million abstracts in PubMed derived from
gt32,500 journals
40Data-driven Problems..
- Generally, the names refer to some feature of the
mutant phenotype - Dickies small eye (Thieler et al., 1978, Anat
Embryol (Berl), 155 81-86) is now Pax6 - Gleeful "This gene encodes a C2H2 zinc finger
transcription factor with high sequence
similarity to vertebrate Gli proteins, so we have
named the gene gleeful (Gfl)." (Furlong et al.,
2001, Science 293 1632)
Whats in a name!
Rose is a rose is a rose is a rose!
Gene Nomenclature
- Disease names
- Mobius Syndrome with Polands Anomaly
- Werners syndrome
- Downs syndrome
- Angelmans syndrome
- Creutzfeld-Jacob disease
- Accelerin
- Antiquitin
- Bang Senseless
- Bride of Sevenless
- Christmas Factor
- Cockeye
- Crack
- Draculin
- Dickies small eye
- Draculin
- Fidgetin
- Gleeful
- Knobhead
- Lunatic Fringe
- Mortalin
- Orphanin
- Profilactin
- Sonic Hedgehog
41Rose is a rose is a rose is a rose.. Not Really!
What is a cell?
- any small compartment
- (biology) the basic structural and functional
unit of all organisms they may exist as
independent units of life (as in monads) or may
form colonies or tissues as in higher plants and
animals - a device that delivers an electric current as a
result of chemical reaction - a small unit serving as part of or as the nucleus
of a larger political movement - cellular telephone a hand-held mobile
radiotelephone for use in an area divided into
small sections, each with its own short-range
transmitter/receiver - small room in which a monk or nun lives
- a room where a prisoner is kept
Image Sources Somewhere from the internet and
Google Images
42Foundation Model Explorer
43- COLORECTAL CANCER 3-BP DEL, SER45DEL
- COLORECTAL CANCER SER33TYR
- PILOMATRICOMA, SOMATIC SER33TYR
- HEPATOBLASTOMA, SOMATIC THR41ALA
- DESMOID TUMOR, SOMATIC THR41ALA
- PILOMATRICOMA, SOMATIC ASP32GLY
- OVARIAN CARCINOMA, ENDOMETRIOID TYPE, SOMATIC
SER37CYS - HEPATOCELLULAR CARCINOMA SOMATIC SER45PHE
- HEPATOCELLULAR CARCINOMA SOMATIC SER45PRO
- MEDULLOBLASTOMA, SOMATIC SER33PHE
The REAL Problems
Many disease states are complex, because of many
genes (alleles ethnicity, gene families, etc.),
environmental effects (life style, exposure,
etc.) and the interactions.
44The REAL Problems
45Methods for Integration
- Link driven federations
- Explicit links between databanks.
- Warehousing
- Data is downloaded, filtered, integrated and
stored in a warehouse. Answers to queries are
taken from the warehouse. - Others.. Semantic Web, etc
46Link-driven Federations
- Creates explicit links between databanks
- query get interesting results and use web links
to reach related data in other databanks - Examples NCBI-Entrez, SRS
47http//www.ncbi.nlm.nih.gov/Database/datamodel/
48http//www.ncbi.nlm.nih.gov/Database/datamodel/
49http//www.ncbi.nlm.nih.gov/Database/datamodel/
50http//www.ncbi.nlm.nih.gov/Database/datamodel/
51http//www.ncbi.nlm.nih.gov/Database/datamodel/
52Link-driven Federations
- Advantages
- complex queries
- Fast
- Disadvantages
- require good knowledge
- syntax based
- terminology problem not solved
53Data Warehousing
Data is downloaded, filtered, integrated and
stored in a warehouse. Answers to queries are
taken from the warehouse.
- Advantages
- Good for very-specific, task-based queries and
studies. - Since it is custom-built and usually
expert-curated, relatively less error-prone.
- Disadvantages
- Can become quickly outdated needs constant
updates. - Limited functionality For e.g., one
disease-based or one system-based.
54Algorithms in Bioinformatics
- Finding similarities among strings
- Detecting certain patterns within strings
- Finding similarities among parts of spatial
structures (e.g. motifs) - Constructing trees
- Phylogenetic or taxonomic trees evolution of an
organism - Ontologies structured/hierarchical
representation of knowledge - Classifying new data according to previously
clustered sets of annotated data
55Algorithms in Bioinformatics
- Reasoning about microarray data and the
corresponding behavior of pathways - Predictions of deleterious effects of changes in
DNA sequences - Computational linguistics NLP/Text-mining.
Published literature or patient records - Graph Theory Gene regulatory networks,
functional networks, etc. - Visualization and GUIs (networks, application
front ends, etc.)
56Disease Gene Identification and Prioritization
Hypothesis Functionally similar or related genes
cause similar disease.
- Functional Similarity Common/shared features
- Gene Ontology term
- Pathway
- Phenotype
- Chromosomal location
- Expression
- Cis regulatory elements (Transcription factor
binding sites) - miRNA regulators
- Interactions
- Other features..
57PPI - Predicting Disease Genes
- Direct proteinprotein interactions (PPI) are one
of the strongest manifestations of a functional
relation between genes. - Hypothesis Interacting proteins lead to same or
similar diseases when mutated. - Several genetically heterogeneous hereditary
diseases are shown to be caused by mutations in
different interacting proteins. For e.g.
Hermansky-Pudlak syndrome and Fanconi anaemia.
Hence, proteinprotein interactions might in
principle be used to identify potentially
interesting disease gene candidates.
58- Prioritize candidate genes in the interacting
partners of the disease-related genes - Training sets disease related genes
- Test sets interacting partners of the training
genes
5915
342
2469
60(No Transcript)
61(No Transcript)
62(No Transcript)
63PubMed
OMIM
64http//sbw.kgi.edu/