Title: Bioinformatics Resources for Computer Science Educators
1Bioinformatics Resources for Computer Science
Educators
- Debra T. Burhans, Ph.D.
- Director, Bioinformatics Program
- Canisius College
- burhansd_at_canisius.edu
- Gary R. Skuse, Ph.D.
- Director, Bioinformatics Program
- Rochester Institute of Technology
- gary_at_bioinformatics.rit.edu
- SIGCSE Workshop, March 2004
2Outline
- General resources
- NCBI
- Bioinformatics data and software
- Other useful resources
- Summary
3General Resources
4Comprehensive Web Sites
- NCBI National Center for Biotechnology
Information - http//www.ncbi.nih.gov/
- Bioinformatics.ca Central bioinformatics site
in Canada - http//www.bioinformatics.ca/
- EBI European Bioinformatics Institute
- http//www.ebi.ac.uk/
- TIGR The Institute for Genomic Research
- http//www.tigr.org/
- DDBJ DNA Databank of Japan
- http//www.ddbj.nig.ac.jp/
- Bioinformatics.org resources in bioinformatics
- http//www.bioinformatics.org/
5Comprehensive Web Sites
- Canadian Bioinformatics Resources
- http//cbrmain.cbr.nrc.ca8080/cbr/servlet/ListCLA
ppsServlet?type8langeng - CCR at SUNY Buffalo (some public, some member
resources) http//bioinformatics.ccr.buffalo.edu - National Library of Medicine
- http//www.nlm.nih.gov/nlmhome.html
- Bio-IT World
- http//www.bio-itworld.com
- National Science Digital Library
- http//www.nsdl.org
- CITIDEL
- http//www.citidel.org/
- Pevsner, Bioinformatics and Functional Genomics,
Wiley 2003 (book) ch 1 URLs overview - http//www.bioinfbook.org/chapt1.htm
6NCBI
7Entrez
8Just how much information?
- GenBank, the primary gene sequence database at
the NCBI (National Center for Biotechnology
Infrastructure at the National Institutes of
Health), Release 135, April 2003, comprises - 24,027,936 records
- 31,099,264,455 nucleotides
- 120,000 species
- 114 GB
9Genbank Growth
10Usage Statistics
July 2001
11Bioinformatics Data and Software
12What sort of data?
- Sequence
- DNA/gene (genome)
- Amino acid/protein (proteome)
- Structure
- Protein structure
- Expression
- Gene and protein expression
- Interaction
- Biological pathways (metabolic pathways
metabolome) - Evolutionary biology
- Molecular phylogenetics
- Disease patterns and inheritance
- OMIM (on-line Mendelian inheritance in man)
- Biomedical literature (biobibliome)
13Sequence
14Sequence Data - DNA
- DNA is double stranded
- Think of DNA sequence information as a complex
language - We know the alphabet (A, C, G, T)
- We know that subsequences correspond to
functional components of the genome - Genes (average size is 3000 bases)
- Regulatory sequences
- We dont know how to identify all of these
components we dont know all of the words of
the language - Complex code with a 4 letter alphabet
15DNA Sequence
- 1 aaaaaggaag cgttcgccga gatcgcagcg
gctgcgccgg ggtatgcgga acgggctcgt - 61 gtggctgctg caccccgcgc tgcccggcac
cttgcgctcc atcctcggcg cccgcccgcc - 121 gcccgcgaag cgactgtgtg gattcccaaa
acagacttac agcacaatga gtaatccggc - 181 catccagaga atagaagacc aaattgtcaa
gtctcctgaa gacaaacggg aataccgtgg - 241 actagagctg gccaatggca tcaaagtgct
tctcatcagc gatcccacca cagacaagtc - 301 ctcagcggcc ctcgatgtgc acataggttc
actgtcagac cctccaaata ttcctggctt - 361 aagtcatttt tgtgaacata tgctgttttt
gggaaccaag aaatatccta aagaaaatga - 421 atatagccag tttctcagtg aacatgctgg
aagttcaaat gcattcacca gtggagaaca - 481 caccaattat tatttcgatg tttcccatga
acacttggaa ggagccctgg acaggtttgc - 541 gcagtttttc ctgtgcccct tgtttgatgc
aagttgtaaa gacagagagg tgaacgctgt - 601 cgattcagaa catgagaaga atgtgatgaa
cgatgcctgg agactcttcc agctggaaaa - 661 ggctacgggg aaccccaaac accccttcag
caaatttggg acaggaaaca aatatactct - 721 agagactcgg cccaaccaag aaggcatcga
cgtaagggaa gagctcttga aatttcactc - 781 tacgtattat tcgtccaatc tgatggcgat
ttgtgtttta ggtcgagaat ccttagacga
16Reading Frames
From http//www.ebi.ac.uk/help/frames_frame.html
17ORFs
- Open reading frames
- A random piece of DNA has 6 different reading
frames associated with it 3 in the forward
direction and 3 in the reverse - Different reading frames produce different amino
acid sequences - ORF finder (NCBI)
- http//www.ncbi.nih.gov/gorf/gorf.html
- Good student exercise! Write an ORF finding
program
18Sequence Data - Protein
- A protein is a sequence of amino acids
- There are 20 amino acids
- Table to the right lists one and three letter
abbreviations with names (http//bioinformatics.or
g/tutorial/1-3.html) - Protein sequences are represented using the one
letter code - Web site about biology and alphabets at Brandeis
- http//ocelot.bio.brandeis.edu/pages/classes/Inter
pGenes/Project/bit8.htm
19Protein Sequence
- gtgi42718017refNP_976037.1 retinoblastoma
binding protein 8 isoform b CTBP-interacting
protein retinoblastoma-interacting myosin-like
Homo sapiens MNISGSSCGSPNSADTSSDFKDLWTKLKECHDREV
QGLQVKVTKLKQERILDAQRLEEFFTKNQQLREQQ
KVLHETIKVLEDRLRAGLCDRCAVTEEHMRKKQQEFENIRQQNLKLITEL
MNERNTLQEENKKLSEQLQQ KIENDQQHQAAELECEEDVIPDSPITAFS
FSGVNRLRRKENPHVRYIEQTHTKLEHSVCANEMRKVSKSS
THPQHNPNENEILVADTYDQSQSPMAKAHGTSSYTPDKSSFNLATVVAET
LGLGVQEESETQGPMSPLGD ELYHCLEGNHKKQPFEESTRNTEDSLRFS
DSTSKTPPQEELPTRVSSPVFGATSSIKSGLDLNTSLSPSL
LQPGKKKHLKTLPFSNTCISRLEKTRSKSEDSALFTHHSLGSEVNKIIIQ
SSNKQILINKNISESLGEQN RTEYGKDSNTDKHLEPLKSLGGRTSKRKK
TEEESEHEVSCPQASFDKENAFPFPMDNQFSMNGDCVMDKP
LDLSDRFSAIQRQEKSQGSETSKNKFRQVTLYEALKTIPKGFSSSRKASD
GNCTLPKDSPGEPCSQECII LQPLNKCSPDNKPSLQIKEENAVFKIPLR
PRESLETENVLDDIKSAGSHEPIKIQTRSDHGGCELASVLQ
LNPCRTGKIKSLQNNQDVSFENIQWSIDPGADLSQYKMDVTVIDTKDGSQ
SKLGGETVDMDCTLVSETVL LKMKKQEQKGEKSSNEERKMNDSLEDMFD
RTTHEEYESCLADSFSQAADEEEELSTATKKLHTHGDKQDK
VKQKAFVEPYFKGDESIMQICQQKKEKRNWLPAQDTDSATFHPTHQRIFG
KLVFLPLRLVWKEVILRKIL ILVLVQKDVSLTTQYFLQKARSRRHRR
20Software/Computing Tools
- Sequence alignment (Pauls talk)
- BLAST (NCBI)
- http//www.ncbi.nlm.nih.gov/BLAST/
- FASTA
- http//fasta.bioch.virginia.edu/
- ClustalW (EBI) multiple alignment
- http//www.ebi.ac.uk/clustalw/
- EBI sequence analysis tools
- http//www.ebi.ac.uk/Tools/sequence.html
- Gene Boy
- http//www.dnai.org/geneboy/index.html
21Software/Computing Tools
- Gene Finding/Gene structure
- GLIMMER (bacterial and archea primarily)
- http//www.tigr.org/salzberg/glimmer.html
- Database searching, profile building for protein
sequence analysis - HMMR (Hidden Markov Models)
- http//hmmer.wustl.edu/
22Data and Formats
- Data storage and formatting
- FASTA format (raw sequence)
- GenBank records (e.g. fly database)
- Can display in many different formats
- XML (e.g. retinoblastoma binding protein)
- There are many good problems in parsing and
database design that can be illustrated using
this data - Pevsner Chapter 2 sequence data URLs overview
- http//www.bioinfbook.org/chapt2.htm
- NCBI FTP site includes data repository and tools
- http//www.ncbi.nlm.nih.gov/Ftp/index.html
23Structure
24Protein Representation
- Proteins have structure at different levels
- Primary (sequence)
- Secondary (local folding)
- Tertiary (global folding)
- Quarternary (interactions)
- Protein Structure viewing tools
- CN3D
- ftp//ftp.ncbi.nih.gov/cn3d/
- Rasmol
- http//www.chemistry.wustl.edu/edudev/rasdir.html
- Protein Explorer
- http//molvis.sdsc.edu/protexpl/frntdoor.htm
25Protein Structure Prediction
- This is a critical problem that attracts the
efforts of many laboratories around the world - Protein structures can be studied directly using
x-ray crystallography - There are many more protein sequences than known
structures for them - See Paul Craigs slides from the RIT workshop on
Predicting and Visualizing Protein Structure for
more information - Protein data is available in a variety of formats
including flat files whose data can be input to a
3-d modeling program
26Expression
27Expression Data
- The context (e.g. tissue type, stage of growth of
an organism, etc) of a cell determined its
pattern of gene and protein expression - Expression patterns may be measured using
microarrays - Each spot on a microarray attracts and binds
particular sequences - The amount of sequence bound to a spot can be
quantified (though there are problems with this) - Data is available in a variety of formats, for
example - Spreadsheet
- image
28Microarray Data
29Microarray Data
30Microarray Data in Spreadsheet
31Resources for Expression Data - 1
- NCBI Gene Expression Omnibus
- http//www.ncbi.nlm.nih.gov/geo/
- Microarray data and tools at EBI
- http//www.ebi.ac.uk/microarray/
- Stanford Microarray database
- http//genome-www5.stanford.edu/
- Affymetrix
- http//www.affymetrix.com/index.affx
- Wake Forest Gene Expression Technology Group
Links - http//www.wfubmc.edu/physpharm/genetech/genetechl
inks.html
32Resources for Expression Data - 2
- Gene Expression Page at EBI
- http//industry.ebi.ac.uk/alan/MicroArray/
- Microarray links from U Berlin
- http//www.bioinf.mdc-berlin.de/schober/ArrayLink
s.htm - Rockefeller University Gene Array Resources
- http//www.rockefeller.edu/genearray/software
33Interaction
34Biological Pathways
- Determine genes that are expressed together
- Determine how different proteins interact in
complex metabolic pathways
35Pathways Resources
- BIND Database/BluePrint
- http//www.blueprint.org/bind/bind.php
- Comprehensive Web site with links to pathways
resources - http//www.hgmp.mrc.ac.uk/GenomeWeb/prot-interacti
on.html
36Evolutionary Biology
37Comparative Genomics
- Comparative Genomics is the analysis of
molecular data from multiple species. - Biological applications the field of
systematics and the tools of molecular biology
have combined to form molecular phylogenetics. - Biologists use molecular phylogenetics to
reconstruct evolutionary trees based on DNA or
protein sequences. - Comprehensive page at Penn State with links
- http//posnania.biotec.psu.edu/tools/resources.htm
lphylogeny
38(No Transcript)
39Disease Patterns and Inheritance
40The link to medicine
- By understanding the genetic code we gain an
understanding of disease - By understanding how we are related to other
organisms we can understand better how model
organisms relate to us and which model organisms
might be appropriate to use when studying disease - If we know exactly what has caused disease
(mutations) we might be able to fix it (gene
therapy) - Subtyping of diseases based on expression
patterns has already improved disease treatment
for leukemia - This is a thriving and important area of
bioinformatics research
41Your genome and health
- Richard A. Young from the Whitehead Institute
imagines a health-care system in which, shortly
after a baby is born, doctors take a tiny piece
of tissue and test its genes to predict the
baby's medical future. (Boston Globe/February 17,
2004, Carlene Hempel) - OMIM Online Mendelian Inheritance in Man,
database maintained at Johns Hopkins University
and available at NCBI - http//www.ncbi.nih.gov/entrez/query.fcgi?dbOMIM
- NCBI Genes and Disease link off main page
(right hand side) - http//www.ncbi.nih.gov
42Biomedical Literature
43Literature Resources
- Electronic repository of biomedical journals,
including abstracts and (for many articles) full
text - Available through PubMed at NCBI
- http//www.ncbi.nlm.nih.gov/entrez/query.fcgi?dbP
ubMed - The MedLine database currently comprises nearly
500 files of 30,000 lines each (baseline data
approx. 40 GB) - Can be downloaded from the National Library of
Medicine (NLM) - Database resource page at the NLM
- http//www.nlm.nih.gov/databases/databases.html
- NCBI Bookshelf on-line access and downloadable
full text - http//www.ncbi.nih.gov/entrez/query.fcgi?dbBooks
44Other Useful Resources
45Ethical, Legal and Social Implications (ELSI)
- ELSI page Human Genome Project (DOE)
- http//www.ornl.gov/sci/techresources/Human_Genome
/elsi/elsi.shtml - ELSI Institute, Dartmouth college
- http//www.dartmouth.edu/ethics/programs.html
46Ontology Resources
- Ontologies terminologies arranged hierarchically
- Allow for standardization of terms
- Gene Ontology Consortium
- http//www.geneontology.org/
- Open Biological Ontologies
- http//obo.sourceforge.net/
- UMLS at the National Library of Medicine
- http//www.nlm.nih.gov/research/umls/umlsmain.html
- Robert Stevens U Manchester Ontology Page
- http//www.cs.man.ac.uk/stevensr/ontology.html
47Programming Language Resources
- Perl is very popular
- Perl, Python and Java have special modules for
bioinformatics - Active State - Perl, Python and other languages
- http//www.activestate.com/
- CPAN Perl Archive
- http//www.cpan.org/
- BioPerl
- http//www.bioperl.org/
- BioPython
- http//www.biopython.org/
- BioJava
- http//www.biojava.org/
48Educational Resources
- Human Genome Project at DOE
- http//www.doegenomes.org/
- NCBI Education Site
- http//www.ncbi.nih.gov/Education/index.html
- Geospiza
- http//www.geospiza.com/outreach/
- EBI 2can
- http//www.ebi.ac.uk/2can/home.html
- Dolan DNA Learning Center
- http//www.dnalc.org
49Academic Programs
- Bio-It World overview
- http//www.bio-itworld.com/careers/biotrain/index.
html - Check this site out, if you find omissions or
errors email them to Bio-It World help to
create a comprehensive list of bioinformatics
programs that is complete and correct
50Summary
- Enormous amount of data
- Multitude of formats
- An important problem is translation among formats
- Proprietary vs. open source
- Communities of biologists are banding together to
create important web-based resources - There are a large number of resources on the WWW
- There are research issues involved for computer
scientists and biologists
51Lincoln Stein on Bioinformatics
- Lincoln Stein's keynote at the O'Reilly
Bioinformatics Technology Conference was
provocatively titled "Bioinformatics Gone in
2012." Despite the title, Stein is optimistic
about the future for people doing bioinformatics.
But he explained that "the field of
bioinformatics will be gone by 2012. The field
will be doing the same thing but it won't be
considered a field.