Title: CS 177 Introduction to Bioinformatics Fall 2005
1CS 177 Introduction to BioinformaticsFall 2005
- Instructor Anna Panchenko (hcnap2003_at_yahoo.com)
- Instructor Tom Madej (tom_ncbi_at_yahoo.com)
2- Lecture 1 Introduction
- Instructors
- Course goals
- Grading policy
- Motivating problem
- Course overview
- Molecular basis of cellular processes
- Historical timeline
3Course Goals
- The student will be introduced to the fundamental
problems and methods of bioinformatics. - The student will become thoroughly familiar with
on-line public bioinformatics databases and their
available software tools. - The student will acquire a background knowledge
of biological systems so as to be able to
interpret the results of database searches, etc. - The student will also acquire a general
understanding of how important bioinformatics
algorithms/software tools work, and how the
databases are organized.
4Grading Policy
- Homework 50, weekly assignments
- Mid-term exam 20
- Final exam 30
All examinations, papers, and other graded work
products and assignments are to be completed in
conformance with The George Washington
University Code of Academic Integrity.
5Important!
- Please get computer accounts for Tompkins 405 by
filling out a form in the TA room on the 4th
floor of Tompkins. - Office hours AP available before class, TM
available after class. If you want to see AP or
TM before class, please ask in advance. - We will also accept questions by email, although
we may not be able to reply immediately.
6Homework
- Homework assignments are due by the start of the
next class period (330 pm Monday). - For an assignment turned in up to one week late
20 penalty. - Homework more than one week late No credit!
- Assignments/exams are to be done individually, no
copying of assignments is allowed!
7(No Transcript)
8NCBI Books
- NCBI home page http//www.ncbi.nlm.nih.gov
- Follow the Books link.
- 45 books available (currently).
- Many specialty topics.
- Also useful general references.
- Searchable!
- Exercise search the books with phylogenetic
tree.
9What is Bioinformatics?
- A merger of biology, computer science, and
information technology. - Enables the discovery of new biological insights
and unifying principles. - Born from necessity, because of the massive
amount of information required to describe
biological organisms and processes.
10Severe Acute Respiratory Syndrome (SARS)
- SARS is a respiratory illness caused by a
previously unrecognized coronavirus first
appeared in Southern China in Nov. 2002. - Between Nov. 2002 and July 2003, there were 8,098
cases worldwide and 774 fatalities (WHO). - The global outbreak was over by late July 2003.
A few new cases have arisen sporadically since
then in China. - There is currently no vaccine or cure available.
11(No Transcript)
12(No Transcript)
13(No Transcript)
14Fig. 2 from Rota et al.
15Phylogenetic analysis of coronavirus proteins
Fig. 2 from Rota et al.
16(No Transcript)
17Conserved motifs in coronavirus S proteins.
Fig. 2 from Rota et al.
18- Exercise!
- Look up the SARS genome on the NCBI website
www.ncbi.nlm.nih.gov - Notice that you get 2 hits on the Genome
database!
19The (ever expanding) Entrez System
20NCBI Databases
- Databases are indexed for quick and efficient
searching. - Databases are cross-linked to each other.
21Exercise!
- Search the Entrez Protein database with the
keyword interleukin. - Follow the link, then look at the different
report formats. - Also try a search of Protein with interleukin
AND human orgn.
22Course Overview
- Lecture 1 Introduction
- Instructors
- Grading policy
- Motivating problem
- Course overview
- Molecular basis of cellular processes
- Historical timeline
23- Lecture 2 General principles of DNA/RNA
structure and stability - Physico-chemical properties of nucleic acids
- RNA folding and structure prediction
- Gene identification
- Genome analysis
- Lecture 3 General principles of protein
structure and stability - Physico-chemical properties of proteins
- Prediction of protein secondary structure
- Protein domains and prediction of domain
boundaries - Protein structure-function relationships
24- Lecture 4 Sequence alignment algorithms
- The alignment problem
- Pairwise sequence alignment algorithms
- Multiple sequence alignment algorithms
- Sequence profiles and profile alignment methods
- Alignment statistics
25- Lecture 5 Computational aspects of protein
structure, part I - Protein folding problem
- Problem of protein structure prediction
- Homology modeling
- Protein design
- Prediction of functionally important sites
- Lecture 6 Computational aspects of protein
structure, part II - Structure-structure alignment algorithms
- Significance of structure-structure similarity
- Protein structure classification
26- Lecture 7 Bioinformatics databases
- Sequence and sequence alignment formats, data
exchange - Public sequence databases
- Sequence retrieval and examples
- Public protein structure databases
- Lab exercises
- Lecture 8 Bioinformatics database search tools
- Sequence database search tools
- Structure database search tools
- Assessment of results, ROC analysis
- Lab exercises
27- Lecture 9 Phylogenetic analysis, part I
- Molecular basis of evolution
- Taxonomy and phylogenetics
- Phylogenetic trees and phylogenetic inference
- Software tools for phylogenetic analysis
- Lecture 10 Phylogenetic analysis, part II
- Accuracies and statistical tests of phylogenetic
trees - Genome comparisons
- Protein structure evolution
28- Lecture 11 Experimental techniques for
macromolecular analysis - Sequencing, PCR
- Protein crystallography
- Mass spectroscopy
- Microarrays
- RNA interference
29- Lecture 12 Systems biology
- Genomic circuits
- Modeling complex integrated circuits
- Protein-protein interaction
- Metabolic networks
Lecture 13 Review
30Molecular Biology Background
- Cells general structure/organization
- Molecules that make up cells
- Cellular processes what makes the cell alive
31Two Cell Organizations
- Prokaryotes lack nucleus, simpler internal
structure, generally quite smaller - Eukaryotes with nucleus (containing DNA) and
various organelles
32(No Transcript)
33(No Transcript)
34Selected organelles
- Nucleus contains chromosomes/DNA
- Mitochondria generate energy for the cell,
contains mitochrondrial DNA - Ribosomes where translation from mRNA to
proteins take place (protein synthesis machinery) - Lysosomes where protein degradation takes place
35Cells can become specialized
36 Three domains of life
- Prokarya
- Bacteria
- Archaea
- Eukarya
- Eukaryotes
37Universal phylogenetic tree.
Fig. 1 from N.R. Pace, Science 276 (1997)
734-740.
38Molecules in the cell
- Proteins catalyze reactions, form structures,
control membrane permeability, cell signaling,
recognize/bind other molecules, control gene
function - Nucleic acids DNA and RNA encode information
about proteins - Lipids make up biomembranes
- Carbohydrates energy sources, energy storage,
constituents of nucleic acids and surface
membranes - Other small molecules e.g. ATP, water, ions,
etc.
39- Exercise!
- Retrieve a protein structure from the SARS
coronavirus from the NCBI website you can use
www.ncbi.nlm.nih.gov/Structure/ - Look at the structure for the SARS protease
using Cn3D.
40The Central Dogma of Molecular Biology
41(No Transcript)
42(No Transcript)
43Timeline
- 1859 Darwin publishes On the Origin of Species
- 1865 Mendels experiments with peas show that
hereditary traits are passed on to offspring in
discrete units. - 1869 Meischer isolates DNA.
- 1895 Rontgen discovers X-rays.
- 1902 Sutton proposes the chromosome theory of
heredity.
44Timeline (cont.)
- 1911 Morgan and co-workers establish the
chromosome theory of heredity, working with fruit
flies. - 1943 Astbury observes the first X-ray pattern of
DNA. - 1944 Avery, MacLeod, and McCarty show that DNA
transmits heritable traits (not proteins!). - 1951 Pauling and Corey predict the structure of
the alpha-helix and beta-sheet.
45Timeline (cont.)
- 1953 Watson and Crick propose the double helix
model for DNA based on X-ray data from Franklin
and Wilkins. - 1955 Sanger announces the sequence of the first
protein to be analyzed, bovine insulin. - 1955 Kornberg and co-workers isolate the enzyme
DNA polymerase (used for copying DNA, e.g. in
PCR). - 1958 The first integrated circuit is constructed
by Kilby at Texas Instruments.
46Timeline (cont.)
- 1960 Perutz and Kendrew obtain the first X-ray
structures of proteins (hemoglobin and
myoglobin). - 1961 Brenner, Jacob, and Meselson discover that
mRNA transmits the information from the DNA in
the nucleus to the cytoplasm. - 1965 Dayhoff starts the Atlas of Protein Sequence
and Structure. - 1966 Nirenberg, Khorana, Ochoa and colleagues
crack the genetic code! - 1970 The Needleman-Wunsch algorithm for sequence
comparison is published.
47Timeline (cont.)
- 1972 Dayhoff develops the Protein Sequence
Database (PSD). - 1972 Berg and colleagues create the first
recombinant DNA molecule. - 1973 Cohen invents DNA cloning.
- 1975 Sanger and others (Maxam, Gilbert) invent
rapid DNA sequencing methods.
48Timeline (cont.)
- 1980 The first complete gene sequence for an
organism (Bacteriophage FX174) is published. The
genome consists of 5,386 bases coding 9 proteins. - 1981 The Smith-Waterman algorithm for sequence
alignment is published. - 1981 IBM introduces its Personal Computer to the
market. - 1982 The GenBank sequence database is created at
Los Alamos National Laboratory.
49Timeline (cont.)
- 1983 Mullis and co-workers describe the PCR
reaction. - 1985 The FASTP algorithm is published by Lipman
and Pearson. - 1986 The SWISS-PROT database is created.
- 1986 The Human Genome Initiative is announced by
DOE. - 1988 The National Center for Biotechnology
Information (NCBI) is established at the National
Library of Medicine in Bethesda.
50Timeline (cont.)
- 1992 Human Genome Systems, in Gaithersburg, MD,
is founded by Haseltine. - 1992 The Institute for Genomic Research (TIGR) is
established by Venter in Rockville, MD. - 1995 The Haemophilus influenzea genome is
sequenced (1.8 Mb). - 1996 Affymetrix produces the first commercial DNA
chips.
51Timeline (cont.)
- 1988 The FASTA algorithm for sequence comparison
is published by Pearson and Lipman. - 1990 Official launch of the Human Genome Project.
- 1990 The BLAST program by Altschul et al., is
published. - 1991 The CERN research institute in Geneva
announces the creation of the protocols which
make up the World Wide Web.
52Timeline (cont.)
- 1996 The yeast genome is sequenced the first
complete eukaryotic genome. - 1996 Human DNA sequencing begins.
- 1997 The E. coli genome is sequenced (4.6 Mb,
approx. 4k genes). - 1998 The C. elegans genome is sequenced (97 Mb,
approx. 20k genes) the first genome of a
multicellular organism.
53Timeline (cont.)
- 1998 Venter founds Celera in Rockville, MD.
- 1998 The Swiss Institute of Bioinformatics is
established in Geneva. - 1999 The HGP completes the first human chromosome
(no. 22). - 2000 The Drosophila genome is completed.
54Timeline (cont.)
- 2000 Human chromosome no. 21 is completed.
- 2001 A draft of the entire human genome (3,000
Mb) is published. - 2003 The Human Genome is completed! Approx.
30,000 genes (estimated).
55(No Transcript)