Title: Computers and Programming for Biologists
1Computers and Programmingfor Biologists
2What is Bioinformatics?
- The use of information technology to collect,
analyze, and interpret biological data. - An ad hoc collection of computing tools that are
used by molecular biologists to manage research
data. - Computational algorithms
- Database schema
- Statistical methods
- Data visualization tools
3The Human Genome Project
4A Genome Revolution in Biology and Medicine
- We are in the midst of a "Golden Era" of biology
- The Human Genome Project has produced a huge
storehouse of data that will be used to change
every aspect of biological research and medicine - The revolution is about treating biology as an
information science, not about specific
biochemical technologies.
5The job of the biologist is changing
As more biological information becomes available
and laboratory equipment becomes more automated
...
- The biologist will spend more time using
computers - on experimental design and data analysis (and
less time doing tedious lab biochemistry) - Biology will become a more quantitative science
(think how the periodic table affected chemistry)
6What are the Tools?
- Alignment
- Similarity string matching
- Pattern search
- Hash tables and substitution matrices
- Clustering
- Genome assembly and annotation
7Align by hand
- GATGCCATAGAGCTGTAGTCGTACCCT lt
- gt CTAGAGAGC-GTAGTCAGAGTGTCTTTGAGTTCC
Somebody should make a computer program for this
kind of thing
8Global vs. Local Alignments
9BLAST Algorithm
10 gtZFISH9GNL-TI fi72b02.y1 Length
724 Score 307 bits (786), Expect 8e-82
Identities 145/200 (72), Positives 166/200
(82), Gaps 1/200 (0) Frame 3 Query 45
VLLKEYRVILPVSVDEYQVGQLYSVAEASKNXXXXXXXXXXXXXXPYEK-
DGEKGQYTHK 103 LKERLPVSVEYQVGQLYS
VAEASKN PYEK DGEKGQYTHK Sbjct 123
MLIKEFRIVLPVSVEEYQVGQLYSVAEASKNETGGGDGVEVLKNEPYEKE
DGEKGQYTHK 302 Query 104 IYHLQSKVPTFVRMLAPEGALNI
HEKAWNAYPYCRTVITNEYMKEDFLIKIETWHKPDLG 163
IY LQSKVPFVRLAP AL IHEKAWNAYPYCRTVTNEYMKF
LI IETWHKPDLG Sbjct 303 IYRLQSKVPSFVRLLAPSSALIIHE
KAWNAYPYCRTVLTNEYMKDNFLIMIETWHKPDLG 482 Query
164 TQENVHKLEPEAWKHVEAVYIDIADRSQVLSKDYKAEEDPAKFKSI
KTGRGPLGPNWKQE 223 QENVH L E WK VE
IDIADRSQV KDYK EDPA FKS KTGRGPLGPWKE Sbjct
483 EQENVHNLDSERWKQVEVIHIDIADRSQVDTKDYKPDEDPATFKSQ
KTGRGPLGPDWKKE 662 Query 224 LVNQKDCPYMCAYKLVTVK
F 243 L DCPMCAYK VTV F Sbjct 663
LPQKRDCPHMCAYKXVTVNF 722
11(No Transcript)
12Clustering (Phylogenetics)
13Genome Assembly
14Raw Genome Data
15UCSC
16The Challenge of New Data Types
- Gene expression microarrays
- thousands of genes, imprecise measurements
- huge images, private file formats
- Proteomics
- high-throughput Mass Spec
- protein chips protein-protein interactions
- Genotyping
- thousands of alleles, thousands of individuals
17cDNA spotted microarrays
18(No Transcript)
19High-Throughput Genotyping
20BioinformaticsBeyond Using Websites
- You can do a lot of sophisticated bioinformatics
using public websites - But at some point you may be faced with a LOT of
data - thousands of searches, annotations, etc. - The only solution is to have your own
bioinformatics computer, database, and custom
programs. - Needs more processor power and more hard drive
space than a typical desktop personal computer
21(No Transcript)
22(No Transcript)
23Bioinformatics Requires Powerful Computers
- One definition of bioinformatics is "the use of
computers to analyze biological problems. - As biological data sets have grown larger and
biological problems have become more complex, the
requirements for computing power have also grown. - Computers that can provide this power generally
use the Unix operating system - so you must learn
Unix be a computational biologist
24Stable and Efficient
- Unix is very stable - computers running Unix
almost never crash - Unix is very efficient
- it gets maximum number crunching power out of
your processor (and multiple processors) - it can smoothly manage extremely huge amounts of
data - it can give a new life to otherwise obsolete Macs
and PCs - Most new bioinformatics software is created for
Unix - its easy for the programmers
25Open Source Bioinformatics
- Almost all of the bioinformatics software that
you need to do complex analyses is free for UNIX
computers - The Open Source software ethic is very strong
among biologists - Bioinformatics.org
- Bioperl.org
- Open-bio.org
- New algorithms generally appear first as free
software (a publication requirement)
26Free Software
- Linux operating system, mySYQL database
- Perl - programming language
- Blast and Fasta - similarity search
- Clustal - multiple alignment
- Phylip - phylogenetics
- Phred/Phrap/Consed - sequence assembly and SNP
detection - EMBOSS - a complete sequence analysis package
created by the EMBL (like GCG)
27Computer Hardware is not Free
- However, you can build a powerful Linux cluster
for 20-50K (depending on how much power you
need) - The real cost is for a person to manage the
machines, install the software, and train
scientists to use it. - Small schools can join together or affiliate with
a larger neighbor.
28Do Biologists have to become Programmers?
- No, but it can give you a big advantage.
- More and more of biology is becoming computer
aided design of experiments, automated equipment,
and computational analysis of the results. - I just want to say one word to you ...
Databases
29Why teach bioinformatics in undergraduate
education?
- Demand for trained graduates from the biomedical
industry - Bioinformatics is essential to understand current
developments in all fields of biology - We need to educate an entire new generation of
scientists, health care workers, etc. - Use bioinformatics to enhance the teaching of
other subjects genetics, evolution, biochemistry
30Genomics in Medical Education
- The explosion of information about the new
genetics will create a huge problem in health
education. Most physicians in practice have had
not a single hour of education in genetics and
are going to be severely challenged to pick up
this new technology and run with it." - Francis Collins
31Becoming a Unix Power User
- Learn more Unix commands
- Use the shell to execute simple programs
- Write scripts - automate repetitive tasks
- Download and install the latest bioinformatics
software - Drive your system manager crazy or get your own
Unix machine - (Linux on an Intel machine or Mac OS-X)
32BioPerl
- Why re-invent the wheel?
- Lots of common bioinformatics tasks have already
been programmed as modules in Perl. - Grab sequences from GenBank, extract e-values and
annotation from Blast results, etc. - Download from www.bioperl.org
33Resources
- Notes for Lincoln Steins course on
- Genome Informatics
- http//stein.cshl.org/genome_informatics/index.htm
l - BioPerl.org http//bio.perl.org/
- PERL for biologists (Kurt Stüber)
- http//caliban.mpiz-koeln.mpg.de/stueber/perl/
- Why Biologists Want to Program Computers
- by James Tisdall http//www.oreilly.com/news/perl
bio_1001.html
34Resources for Bio-Computing
35Stuart M. Brown, Ph.D.stuart.brown_at_med.nyu.eduww
w.med.nyu/rcr
Bioinformatics A Biologist's Guide to
Biocomputing and the Internet
Essentials of Medical Genomics
36(No Transcript)