Title: http:creativecommons'orglicensesbysa2'5ca
1http//creativecommons.org/licenses/by-sa/2.5/ca/
http//tinyurl.com/3cw4ql
2CAN SCIENTISTS CURE CANCER WITH COMPUTERS?
- February 12th, 2008
- Francis Ouellette francis_at_oicr.on.ca
- Associate Director, Informatics and Biocomputing,
Ontario Institute for Cancer Research
3CAN SCIENTISTS CURE CANCER WITH COMPUTERS?
4Take two bytes and call me in the morning!
5CAN SCIENTISTS CURE CANCER WITHout COMPUTERS?
6Byte my Genes
- Using computers to understand our DNA
7Bioinformatics
- Computational biology
- Biocomputing
- Theoretical biology
- Biometry
- Statistical Genomics
8What is Bioinformatics?
9Bioinformatics is about integrating biological
themes together with the help of computer tools
and biological databases, and gaining new
knowledge about the system in study.
10National Center for Biotechnology Information
(NCBI)
httpncbi.nlm.nih.gov
11Computers
Laboratory
Maytag cycle
12The problem not reinventing the wheel!
- Pegasys A workflow management tool
- Atlas a data warehouse
- Already available Apollo, NCBI toolkit
Apollo
Atlas
Pegasys
gamexml
parser
ASN.1
http//www.fruitfly.org/annot/apollo/
(BDGP-EBI)
FASTA file
13http//www.cytoscape.org/
14BLAST Result
- Basic
- Local
- Alignment
- Search
- Tool
15Comparative Analysis in Biology
Jim Ostell
Human
Dog
16http//upload.wikimedia.org/wikipedia/en/5/5b/Evol
ution_pl.pngCreated by Jerry Crimson Mann 0625,
2 August 2005 (UTC).
17Comparative Analysis of Genes
Jim Ostell
Human 638 RHACVEVQDEIAFIPNDVYFEKDKQMFHIITGPNMGGKS
TYIRQTGVIVLMAQIGCFVPC 697 Yeast 657
RHPVLEMQDDISFISNDVTLESGKGDFLIITGPNMGGKSTYIRQVGVISL
MAQIGCFVPC 716 E.coli 584 RHPVVEQVLNEPFIANPLNLSPQR
R-MLIITGPNMGGKSTYMRQTALIALMAYIGSYVPA 642
Colon cancer gene sequence
18Mark Boguski, NCBI
Comparative Analysis of Genomes
Tout ce qui est vrai pour le Colibacille est
vrai pour l'éléphant Jacques Monod, 1972
19Comparative Genomics Humans vs Rodents
Chris Ponting http//www.stats.ox.ac.uk/hei
n/HumanGenome/Ponting1.ppt
Human and mouse c-kit mutations show similar
phenotypes. The utility of mouse as a biomedical
model for human disease is enhanced when
mutations in orthologous genes give similar
phenotypes in both organisms. In a visually
striking example of this, the same pattern of
hypopigmentation is seen in (a) a patient with
the piebald trait and (b) a mouse with dominant
spotting, both resulting from heterozygous
mutations of the c-kit proto-oncogene.
20Why is there Bioinformatics?
Fiona Brinkman
Sequencing technology!
- Lots of new sequences being added
- Automated sequencers
- Genome Projects
- EST sequencing
- Microarray studies
- Proteomics
- Metagenomics (Metagenomics describes the
functional and sequence-based analysis of the
collective microbial genomes contained in an
environmental sample) - Whole genome sequencing and WGAS (whole genome
association studies) - Patterns in datasets that can only be analyzed
using computers
21High Throughput sequencing
John McPherson
- Illumina/Solexa GA
- 25-35 bases
- 40,000,000 - 60,000,000 reads
- 1,500,000,000 bases (0.25x genome coverage)
- 3 day run time
- OICR has two of these
- 4 on order.
22Next-generation Sequencing
John McPherson
- Applied Biosystems SOLiD
- 25-35 bases
- 80,000,000 reads
- 2,500,000,000 bases (0.4x genome coverage)
- 3 day run time
- OICR has two of these 4 on order
23Ramp up April 2008
John McPherson
24 billion nucleotides in 3 days ? 1 human
genome /day
24Genomes
-
Number of base
pairs - __________________________________________________
_________ - 1971 First published DNA sequence
12 - 1977 ?X174
5,375 - 1982 ?
48,502 - 1992 Saccharomyces cerevisiae Chromosome III
316,613 - 1995 Haemophilus influenza
1,830,138 - 1996 Saccharomyces cerevisiae
12,068,000 - 1998 Caenorhabditis elegans
97,000,000 - 2000 Drosophila melanogaster
120,000,000 - 2001 Homo sapiens (draft)
2,600,000,000 - 2003 Homo sapiens
2,850,000,000
25- Genbank doubles every 14 months
(from the National Centre for Biotechnology
Information)
Shorter than Moores law (computer power doubling
every 20 months!)
26About Sequences ...
ACGT
271000 base pairs
GCAGCGCACGACAGCTGTGCTATCCCGGCGAGCCCGTGGCAGAGGACCTC
GCTTGCGAAA GCATCGAGTACCGCTACAGAGCCAACCCGGTGGACAAAC
TCGAAGTCATTGTGGACCGAA TGAGGCTCAATAACGAGATTAGCGACCT
CGAAGGCCTGCGCAAATATTTCCACTCCTTCC CGGGTGCTCCTGAGTTG
AACCCGCTTAGAGACTCCGAAATCAACGACGACTTCCACCAGT GGGCCC
AGTGTGACCGCCACACTGGACCCCATACCACTTCTTTTTGTTATTCTTAA
ATAT GTTGTAACGCTATGTAATTCCACCCTTCATTACTAATAATTAGCC
ATTCACGTGATCTCA GCCAGTTGTGGCGCCACACTTTTTTTTCCATAAA
AATCCTCGAGGAAAAGAAAAGAAAAA AATATTTCAGTTATTTAAAGCAT
AAGATGCCAGGTAGATGGAACTTGTGCCGTGCCAGAT TGAATTTTGAAA
GTACAATTGAGGCCTATACACATAGACATTTGCACCTTATACATATAC A
CACAAGACAAAACCAAAAAAAATATGACTCTACAAGAATCTGATAAATTT
GCTACCAAG GCCATTCATGCCGGTGAACATGTGGACGTTCACGGTTCCG
TGATCGAACCCATTTCTTTG TCCACCACTTTCAAACAATCTTCTCCAGC
TAACCCTATCGGTACTTACGAATACTCCAGA TCTCAAAATCCTAACAGA
GAGAACTTGGAAAGAGCAGTTGCCGCTTTAGAGAACGCTCAA TACGGGT
TGGCTTTCTCCTCTGGTTCTGCCACCACCGCCACAATCTTGCAATCGCTT
CCT CAGGGCTCCCATGCGGTCTCTATCGGTGATGTGTACGGTGGTACCC
ACAGATACTTCACC AAAGTCGCCAACGCTCACGGTGTGGAAACCTCCTT
CACTAACGATTTGTTGAACGATCTA CCTCAATTGATAAAGGAAAACACC
AAATTGGTCTGGATCGAAACCCCAACCAACCCAACT
282,000 base pairs
GCAGCGCACGACAGCTGTGCTATCCCGGCGAGCCCGTGGCAGAGGACCTC
GCTTGCGAAA GCATCGAGTACCGCTACAGAGCCAACCCGGTGGACAAAC
TCGAAGTCATTGTGGACCGAA TGAGGCTCAATAACGAGATTAGCGACCT
CGAAGGCCTGCGCAAATATTTCCACTCCTTCC CGGGTGCTCCTGAGTTG
AACCCGCTTAGAGACTCCGAAATCAACGACGACTTCCACCAGT GGGCCC
AGTGTGACCGCCACACTGGACCCCATACCACTTCTTTTTGTTATTCTTAA
ATAT GTTGTAACGCTATGTAATTCCACCCTTCATTACTAATAATTAGCC
ATTCACGTGATCTCA GCCAGTTGTGGCGCCACACTTTTTTTTCCATAAA
AATCCTCGAGGAAAAGAAAAGAAAAA AATATTTCAGTTATTTAAAGCAT
AAGATGCCAGGTAGATGGAACTTGTGCCGTGCCAGAT TGAATTTTGAAA
GTACAATTGAGGCCTATACACATAGACATTTGCACCTTATACATATAC A
CACAAGACAAAACCAAAAAAAATATGACTCTACAAGAATCTGATAAATTT
GCTACCAAG GCCATTCATGCCGGTGAACATGTGGACGTTCACGGTTCCG
TGATCGAACCCATTTCTTTG TCCACCACTTTCAAACAATCTTCTCCAGC
TAACCCTATCGGTACTTACGAATACTCCAGA TCTCAAAATCCTAACAGA
GAGAACTTGGAAAGAGCAGTTGCCGCTTTAGAGAACGCTCAA TACGGGT
TGGCTTTCTCCTCTGGTTCTGCCACCACCGCCACAATCTTGCAATCGCTT
CCT CAGGGCTCCCATGCGGTCTCTATCGGTGATGTGTACGGTGGTACCC
ACAGATACTTCACC AAAGTCGCCAACGCTCACGGTGTGGAAACCTCCTT
CACTAACGATTTGTTGAACGATCTA CCTCAATTGATAAAGGAAAACACC
AAATTGGTCTGGATCGAAACCCCAACCAACCCAACT
GCAGCGCACGACAGCTGTGCTATCCCGGCGAGCCCGTGGCAGAGGACCT
CGCTTGCGAAA GCATCGAGTACCGCTACAGAGCCAACCCGGTGGACAAA
CTCGAAGTCATTGTGGACCGAA TGAGGCTCAATAACGAGATTAGCGACC
TCGAAGGCCTGCGCAAATATTTCCACTCCTTCC CGGGTGCTCCTGAGTT
GAACCCGCTTAGAGACTCCGAAATCAACGACGACTTCCACCAGT GGGCC
CAGTGTGACCGCCACACTGGACCCCATACCACTTCTTTTTGTTATTCTTA
AATAT GTTGTAACGCTATGTAATTCCACCCTTCATTACTAATAATTAGC
CATTCACGTGATCTCA GCCAGTTGTGGCGCCACACTTTTTTTTCCATAA
AAATCCTCGAGGAAAAGAAAAGAAAAA AATATTTCAGTTATTTAAAGCA
TAAGATGCCAGGTAGATGGAACTTGTGCCGTGCCAGAT TGAATTTTGAA
AGTACAATTGAGGCCTATACACATAGACATTTGCACCTTATACATATAC
ACACAAGACAAAACCAAAAAAAATATGACTCTACAAGAATCTGATAAATT
TGCTACCAAG GCCATTCATGCCGGTGAACATGTGGACGTTCACGGTTCC
GTGATCGAACCCATTTCTTTG TCCACCACTTTCAAACAATCTTCTCCAG
CTAACCCTATCGGTACTTACGAATACTCCAGA TCTCAAAATCCTAACAG
AGAGAACTTGGAAAGAGCAGTTGCCGCTTTAGAGAACGCTCAA TACGGG
TTGGCTTTCTCCTCTGGTTCTGCCACCACCGCCACAATCTTGCAATCGCT
TCCT CAGGGCTCCCATGCGGTCTCTATCGGTGATGTGTACGGTGGTACC
CACAGATACTTCACC AAAGTCGCCAACGCTCACGGTGTGGAAACCTCCT
TCACTAACGATTTGTTGAACGATCTA CCTCAATTGATAAAGGAAAACAC
CAAATTGGTCTGGATCGAAACCCCAACCAACCCAACT
29What about size?
base pairs x 2,000 cm of
paper 2,000 Small gene
1 5,000
Small virus
2.5 1,000,000 Small bacterial genome
500 5 5,000,000 Large
bacterial genome 2,500 25
13,000,000 Yeast genome
6,500 65 180,000,000 Fruit fly
genome 90,000 900
3,000,000,000 human genome 1,500,000
1,500
Printing all of the nucleotide sequences at the
NCBI, would now be 9.5 km high
30Top Ten Challenges for Bioinformatics
Chris Burge, Ewan Birney, Jim Fickett. Genome
Technology, issue No. 17, January, 2002
- Precise, predictive model of transcription
initiation and termination ability to predict
where and when transcription will occur in a
genome - Precise, predictive model of RNA
splicing/alternative splicing ability to predict
the splicing pattern of any primary transcript in
any tissue - Precise, quantitative models of signal
transduction pathways ability to predict
cellular responses to external stimuli - Determining effective protein DNA, proteinRNA
and proteinprotein recognition codes - Accurate ab initio protein structure prediction
- Rational design of small molecule inhibitors of
proteins - Mechanistic understanding of protein evolution
understanding exactly how new protein functions
evolve - Mechanistic understanding of speciation
molecular details of how speciation occurs - Continued development of effective gene
ontologies - systematic ways to describe the
functions of any gene or protein - Education development of appropriate
bioinformatics curricula for secondary,
undergraduate and graduate education
311- Precise, predictive model of transcription
initiation and termination ability to predict
where and when transcription will occur in a
genome
http//tinyurl.com/2t9c6y
- Understanding the parts list is critical for
biologist to plan their experiments and to grasp
the context of the biological problem they are
workig with. - Understanding how these parts are different in
healthy and cancerous cells is also critical - Knowing what these parts are is obviously very
important
9 - Continued development of effective gene
ontologies, systematic ways to describe the
functions of any gene or protein
32http//tinyurl.com/2t9c6y
- 1- Precise, predictive model of transcription
initiation and termination ability to predict
where and when transcription will occur in a
genome -
334- Determining effective protein DNA,
proteinRNA and proteinprotein recognition codes
- Formalizing data is something bioinformatics
people like to do. - There are hundreds of databases that define
protein-protein interaction databases, and
protein-RNA, protein-DNA and protein- small
molecules - Understanding and capturing this information for
healthy and cancerous cells is also necessary.
34Christopher Hogue
35Christopher Hogue
36Christopher Hogue and Gary Bader
373 - Precise, quantitative models of signal
transduction pathways ability to predict
cellular responses to external stimuli
- Pathways are the end product of gene expression
they are the result of complexes coming together,
networks of all of the cells parts and their
coordinated orchestration into the expression of
a biological state. - When pathways breakdown the cells die or become
very sick. - Cancer can be studied by studying the pathways of
the cell gone bad. - Formalization of pathway data is very
complicated, but is being done. - There are several database projects whose goal it
is to represent our biological knowledge of
pathways.
38Reactome http//reactome.org/
- Reactome is to develop a curated resource of core
pathways and reactions in human biology. - Understanding these in normal and cancerous cells
will provide insights on the biology of cancer. - Databases like this one are labor intensive and
require the input of bioinformaticians and
biologist alike.
39(No Transcript)
40Pathways are inter-linked
Signalling pathway
Genetic network
STIMULUS
Metabolic pathway
4110- Education development of appropriate
bioinformatics curricula for secondary,
undergraduate and graduate education
- Ther have been in the last 5 years a number of
new programs, new courses and several workshops
in bioinformatics offered here in Ontario, in
Canada and world-wide. - There is still a critical need for many
bioinformaticians - We need to continue supporting many of the
existing programs
4210- Development of appropriate bioinformatics
curricula for secondary, undergraduate and
graduate education
- In the last 5 years a number of programs, courses
and workshops have been established in Ontario,
Canada and the world. - There is still a shortage of skilled
bioinformatics people world-wide. - There is still a need for bioinformatics
workshops - http//bioinformatics.ca
43http//bioinformatics.ca
44New 2008 CBW workshops
http//bioinformatics.ca
- Putting the Web to Work Tools to Accelerate Life
Science Research - Interpreting Gene Lists from OMICS Studies
- Informatics on High Throughput Sequencing Data
- Systems Network Biology
- Essential Statistics in Biology Getting the
Numbers Right
45(No Transcript)
46Doing Science in a reproducible, predictable,
repeatable, efficient way will require
- Open Source
- Public and private sector
- New business models
- Open Access
- All biomolecular data
- Clinical data
- Scientific publications
- Methods will need to be represented and
delivered in a way that will allow anybody to
reproduce, use and modify.
47Open Access to Data
- DNA sequences in GenBank
- It is now part of the scientific culture and
expectations to submit DNA and protein sequences
to GenBank. This is now expected for gene
expression, protein structures and protein
interactions. - The motivating agents for getting people to put
their sequences in GenBank was not for the good
of humankind, but rather the publishers and the
funding agencies.
48Open Access has to be mandated
- It is now, by
- CIHR
- GenomeCanada
- NIH
- Welcome/MRC
- OICR
- Also needs buy-in from he Universities and all
provincial funding agencies - By the Presidents, Deans of academic and publicly
funded research institutions.
49Open Access of Publications Definition
- An open access publication is a peer reviewed
publication that can be downloaded free online to
any user worldwide.
50How do Grantees Make Pubs Open Access?
- Publish in Open Access journal or Journal that
has delayed Open Access (e.g. within 6 months). - Publish anywhere - but self-archive. Put the
peer-reviewed manuscript in PubMed Central and/or
an institutional repository within (e.g.) 6
months of publication.
51http//www.cihr-irsc.gc.ca/e/32005.html
52http//bioinformatics.ca/links_directory/
The Bioinformatics Links Directory features
curated links to molecular resources, tools and
databases.
53Bioinformatics Links Directory
54Things not on the Top 10 list
- Whole Genome Association studies.
- Systems biology and data integration
- High throughput genome sequencing and the
consequences of that data. - Information technology challenges
- Health informatics
55Rationale - OICR Blueprint
56(No Transcript)
57(No Transcript)
58(No Transcript)
59(No Transcript)
60(No Transcript)
61OiCR
62Power point slides taken from these people
- Fiona Brinkman, SFU
- Mark Boguski, NCBI/NIH
- Jim Ostell, NCBI/NIH
- Andy Baxevanis, NHGRI/NIH
- Christopher Hogue
- Garry Bader, University of Toronto
- Chris Ponting, Oxford University
63http//www.oicr.on.ca/research/ouellette.htm
64Funding provided by the Government of Ontario
65(No Transcript)