CS 177 Introduction to Bioinformatics Fall 2004 - PowerPoint PPT Presentation

About This Presentation
Title:

CS 177 Introduction to Bioinformatics Fall 2004

Description:

Eukaryotes with nucleus (containing DNA) and various organelles ... Mitochondria generate energy for the cell, contains mitochrondrial DNA ... – PowerPoint PPT presentation

Number of Views:116
Avg rating:3.0/5.0
Slides: 65
Provided by: mad81
Category:

less

Transcript and Presenter's Notes

Title: CS 177 Introduction to Bioinformatics Fall 2004


1
CS 177 Introduction to BioinformaticsFall 2004
  • Instructor Anna Panchenko (panch_at_ncbi.nlm.nih.gov
    )
  • Instructor Tom Madej (madej_at_ncbi.nlm.nih.gov)
  • Co-Instructor Rahul Simha (simha_at_gwu.edu)

2
  • Lecture 1 Introduction
  • Instructors
  • Course goals
  • Grading policy
  • Motivating problem
  • Course overview
  • Molecular basis of cellular processes
  • Historical timeline

3
Course Goals
  • The student will be introduced to the fundamental
    problems and methods of bioinformatics.
  • The student will become thoroughly familiar with
    on-line public bioinformatics databases and their
    available software tools.
  • The student will acquire a background knowledge
    of biological systems so as to be able to
    interpret the results of database searches, etc.
  • The student will also acquire a general
    understanding of how important bioinformatics
    algorithms/software tools work, and how the
    databases are organized.

4
Grading Policy
  • Homework 50, weekly assignments
  • Final exam 50

All examinations, papers, and other graded work
products and assignments are to be completed in
conformance with The George Washington
University Code of Academic Integrity.
5
(No Transcript)
6
Optional Texts
P.E. Bourne and H. Weissig (2003), Structural
Bioinformatics, Wiley Sons.
7
What is Bioinformatics?
  • A merger of biology, computer science, and
    information technology.
  • Enables the discovery of new biological insights
    and unifying principles.
  • Born from necessity, because of the massive
    amount of information required to describe
    biological organisms and processes.

8
Severe Acute Respiratory Syndrome (SARS)
  • SARS is a respiratory illness caused by a
    previously unrecognized coronavirus first
    appeared in Southern China in Nov. 2002.
  • Between Nov. 2002 and July 2003, there were 8,098
    cases worldwide and 774 fatalities (WHO).
  • The global outbreak was over by late July 2003.
    A few new cases have arisen sporadically since
    then in China.
  • There is currently no vaccine or cure available.

9
(No Transcript)
10
(No Transcript)
11
(No Transcript)
12
Fig. 2 from Rota et al.
13
Phylogenetic analysis of coronavirus proteins
Fig. 2 from Rota et al.
14
(No Transcript)
15
Conserved motifs in coronavirus S proteins.
Fig. 2 from Rota et al.
16
  • Exercise!
  • Look up the SARS genome on the NCBI website
    www.ncbi.nlm.nih.gov

17
The (ever expanding) Entrez System
18
Course Overview
  • Lecture 1 Introduction
  • Instructors
  • Grading policy
  • Motivating problem
  • Course overview
  • Molecular basis of cellular processes
  • Historical timeline

19
  • Lecture 2 General principles of DNA/RNA
    structure and stability
  • Physico-chemical properties of nucleic acids
  • RNA folding and structure prediction
  • Gene identification
  • Genome analysis
  • Lecture 3 General principles of protein
    structure and stability
  • Physico-chemical properties of proteins
  • Prediction of protein secondary structure
  • Protein domains and prediction of domain
    boundaries
  • Protein structure-function relationships

20
  • Lecture 4 Sequence alignment algorithms
  • The alignment problem
  • Pairwise sequence alignment algorithms
  • Multiple sequence alignment algorithms
  • Sequence profiles and profile alignment methods
  • Alignment statistics

21
  • Lecture 5 Computational aspects of protein
    structure, part I
  • Protein folding problem
  • Problem of protein structure prediction
  • Homology modeling
  • Protein design
  • Prediction of functionally important sites
  • Lecture 6 Computational aspects of protein
    structure, part II
  • Structure-structure alignment algorithms
  • Significance of structure-structure similarity
  • Protein structure classification

22
  • Lecture 7 Bioinformatics databases
  • Sequence and sequence alignment formats, data
    exchange
  • Public sequence databases
  • Sequence retrieval and examples
  • Public protein structure databases
  • Lab exercises
  • Lecture 8 Bioinformatics database search tools
  • Sequence database search tools
  • Structure database search tools
  • Assessment of results, ROC analysis
  • Lab exercises

23
  • Lecture 9 Phylogenetic analysis, part I
  • Molecular basis of evolution
  • Taxonomy and phylogenetics
  • Phylogenetic trees and phylogenetic inference
  • Software tools for phylogenetic analysis
  • Lecture 10 Phylogenetic analysis, part II
  • Accuracies and statistical tests of phylogenetic
    trees
  • Genome comparisons
  • Protein structure evolution

24
  • Lecture 11 Experimental techniques for
    macromolecular analysis
  • Sequencing, PCR
  • Protein crystallography
  • Mass spectroscopy
  • Microarrays
  • RNA interference

25
  • Lecture 12 Systems biology
  • Genomic circuits
  • Modeling complex integrated circuits
  • Protein-protein interaction
  • Metabolic networks

Lectures 13, 14 To be decided
26
Molecular Biology Background
  • Cells general structure/organization
  • Molecules that make up cells
  • Cellular processes what makes the cell alive

27
Two Cell Organizations
  • Prokaryotes lack nucleus, simpler internal
    structure, generally quite smaller
  • Eukaryotes with nucleus (containing DNA) and
    various organelles

28
CS 177 DNA, RNA, protein overview
29
(No Transcript)
30
(No Transcript)
31
Selected organelles
  • Nucleus contains chromosomes/DNA
  • Mitochondria generate energy for the cell,
    contains mitochrondrial DNA
  • Ribosomes where translation from mRNA to
    proteins take place (protein synthesis machinery)
  • Lysosomes where protein degradation takes place

32
Cells can become specialized
33
Three domains of life
  • Prokarya
  • Bacteria
  • Archaea
  • Eukarya
  • Eukaryotes

34
Universal phylogenetic tree.
Fig. 1 from N.R. Pace, Science 276 (1997)
734-740.
35
Molecules in the cell
  • Proteins catalyze reactions, form structures,
    control membrane permeability, cell signaling,
    recognize/bind other molecules, control gene
    function
  • Nucleic acids DNA and RNA encode information
    about proteins
  • Lipids make up biomembranes
  • Carbohydrates energy sources, energy storage,
    constituents of nucleic acids and surface
    membranes
  • Other small molecules e.g. ATP, water, ions,
    etc.

36
The Central Dogma of Molecular Biology
37
(No Transcript)
38
(No Transcript)
39
  • Exercise!
  • Retrieve a protein structure from the SARS
    coronavirus from the NCBI website you can use
    www.ncbi.nlm.nih.gov/Structure/
  • Look at the structure for the SARS protease
    using Cn3D.

40
Timeline
  • 1859 Darwin publishes On the Origin of Species
  • 1865 Mendels experiments with peas show that
    hereditary traits are passed on to offspring in
    discrete units.
  • 1869 Meischer isolates DNA.
  • 1895 Rontgen discovers X-rays.
  • 1902 Sutton proposes the chromosome theory of
    heredity.

41
Timeline (cont.)
  • 1911 Morgan and co-workers establish the
    chromosome theory of heredity, working with fruit
    flies.
  • 1943 Astbury observes the first X-ray pattern of
    DNA.
  • 1944 Avery, MacLeod, and McCarty show that DNA
    transmits heritable traits (not proteins!).
  • 1951 Pauling and Corey predict the structure of
    the alpha-helix and beta-sheet.

42
Timeline (cont.)
  • 1953 Watson and Crick propose the double helix
    model for DNA based on X-ray data from Franklin
    and Wilkins.
  • 1955 Sanger announces the sequence of the first
    protein to be analyzed, bovine insulin.
  • 1955 Kornberg and co-workers isolate the enzyme
    DNA polymerase (used for copying DNA, e.g. in
    PCR).
  • 1958 The first integrated circuit is constructed
    by Kilby at Texas Instruments.

43
Timeline (cont.)
  • 1960 Perutz and Kendrew obtain the first X-ray
    structures of proteins (hemoglobin and
    myoglobin).
  • 1961 Brenner, Jacob, and Meselson discover that
    mRNA transmits the information from the DNA in
    the nucleus to the cytoplasm.
  • 1965 Dayhoff starts the Atlas of Protein Sequence
    and Structure.
  • 1966 Nirenberg, Khorana, Ochoa and colleagues
    crack the genetic code!
  • 1970 The Needleman-Wunsch algorithm for sequence
    comparison is published.

44
Timeline (cont.)
  • 1972 Dayhoff develops the Protein Sequence
    Database (PSD).
  • 1972 Berg and colleagues create the first
    recombinant DNA molecule.
  • 1973 Cohen invents DNA cloning.
  • 1975 Sanger and others (Maxam, Gilbert) invent
    rapid DNA sequencing methods.

45
Timeline (cont.)
  • 1980 The first complete gene sequence for an
    organism (Bacteriophage FX174) is published. The
    genome consists of 5,386 bases coding 9 proteins.
  • 1981 The Smith-Waterman algorithm for sequence
    alignment is published.
  • 1981 IBM introduces its Personal Computer to the
    market.
  • 1982 The GenBank sequence database is created at
    Los Alamos National Laboratory.

46
Timeline (cont.)
  • 1983 Mullis and co-workers describe the PCR
    reaction.
  • 1985 The FASTP algorithm is published by Lipman
    and Pearson.
  • 1986 The SWISS-PROT database is created.
  • 1986 The Human Genome Initiative is announced by
    DOE.
  • 1988 The National Center for Biotechnology
    Information (NCBI) is established at the National
    Library of Medicine in Bethesda.

47
Timeline (cont.)
  • 1992 Human Genome Systems, in Gaithersburg, MD,
    is founded by Haseltine.
  • 1992 The Institute for Genomic Research (TIGR) is
    established by Venter in Rockville, MD.
  • 1995 The Haemophilus influenzea genome is
    sequenced (1.8 Mb).
  • 1996 Affymetrix produces the first commercial DNA
    chips.

48
Timeline (cont.)
  • 1988 The FASTA algorithm for sequence comparison
    is published by Pearson and Lipman.
  • 1990 Official launch of the Human Genome Project.
  • 1990 The BLAST program by Altschul et al., is
    published.
  • 1991 The CERN research institute in Geneva
    announces the creation of the protocols which
    make up the World Wide Web.

49
Timeline (cont.)
  • 1996 The yeast genome is sequenced the first
    complete eukaryotic genome.
  • 1996 Human DNA sequencing begins.
  • 1997 The E. coli genome is sequenced (4.6 Mb,
    approx. 4k genes).
  • 1998 The C. elegans genome is sequenced (97 Mb,
    approx. 20k genes) the first genome of a
    multicellular organism.

50
Timeline (cont.)
  • 1998 Venter founds Celera in Rockville, MD.
  • 1998 The Swiss Institute of Bioinformatics is
    established in Geneva.
  • 1999 The HGP completes the first human chromosome
    (no. 22).
  • 2000 The Drosophila genome is completed.

51
Timeline (cont.)
  • 2000 Human chromosome no. 21 is completed.
  • 2001 A draft of the entire human genome (3,000
    Mb) is published.
  • 2003 The Human Genome is completed! Approx.
    30,000 genes (estimated).

52
DNA, RNA, protein overview
Questions about the genome in an organism How
much DNA, how many nucleotides? How many genes
are there? What types of proteins appear to be
coded by these genes?
Questions about the proteome What proteins are
present? Where are they? When are they present
- under what conditions?
DNA   RNA    Mutations Amino acids, protein
structure
53
DNA overview
DNA
deoxyribonucleic acid
4 bases
Pyrimidine (C4N2H4)
Purine (C5N4H4)
Pyrimidine (C4N2H4)
A T C G
Thymine
Cytosine
Nucleoside
Nucleotide
base
sugar (deoxyribose)
base
sugar
phosphate
DNA   RNA    Mutations Amino acids, protein
structure
Numbering of carbons?
54
Linking nucleotides
Hydrogen bonds
N-H------N N-H------O
Linking nucleotides The 3-OH of one nucleotide
is linked to the 5-phosphate of the next
nucleotide
What next?
Thymine
Adenine
Cytosine
DNA   RNA    Mutations Amino acids, protein
structure
Guanine
55
Base pairing
A
T
Base pairing (Watson-Crick) A/T (2 hydrogen
bonds) G/C (3 hydrogen bonds)
C
G
Always pairing a purine and a pyrimidine yields a
constant width
A
T
DNA base composition A G T C (Chargaffs
rule)
T
A
DNA   RNA    Mutations Amino acids, protein
structure
C
G
56
DNA   RNA    Mutations Amino acids, protein
structure
57
DNA conventions
1. DNA is a right-handed helix
2. The 5 end is to the left by convention
5 3
-ATCGCAATCAGCTAGGTT-
sense (forward)
antisense (reverse)
-TAGCGTTAGTCGATCCAA-
3 5
5 -ATCGCAATCAGCTAGGTT- 3 3 -TAGCGTTAGTCGATCCAA-
5
DNA   RNA    Mutations Amino acids, protein
structure
58
DNA structure
Some more facts
1. Forces stabilizing DNA structure
Watson-Crick-H-bonding and base stacking
(planar aromatic bases overlap geometrically and
electronically ? energy gain)
2. Genomic DNAs are large molecules Eschericia
coli 4.7 x 106 bp 1 mm contour length Human
3.2 x 109 bp 1 m contour length
3. Some DNA molecules (plasmids) are circular and
have no free ends mtDNA bacterial DNA (only
one circular chromosome)
4. Average gene of 1000 bp can code for average
protein of about 330 amino acids
5. Percentage of non-coding DNA varies greatly
among organisms
Organism Base pairs Genes
Non-coding DNA
small virus 4 x 103 3
very little
typical virus 3 x 105 200
very little
DNA   RNA    Mutations Amino acids, protein
structure
bacterium 5 x 106 3000
10 - 20
yeast 1 x 107 6000 gt
50
human
3.2 x 109 30,000? 99
amphibians lt 80 x 109 ?
?
plants lt 900 x 109 23,000 - gt50,000
gt 99
59
3 major types of RNA
messenger RNA (mRNA) template for protein
synthesis transfer RNA (tRNA) adaptor molecules
that decode the genetic coderibosomal RNA
(rRNA) catalyzing the synthesis of proteins
ribonucleic acid
4 bases
A U C G
DNA   RNA    Mutations Amino acids, protein
structure
60
Base interactions in RNA
Base pairing U/A/(T) (2 hydrogen bonds) G/C
(3 hydrogen bonds)
RNA base composition A G U C
/ Chargaffs rule does not apply
(RNA usually prevails as single strand)
RNA structure - usually single stranded - many
self-complementary regions ? RNA commonly
exhibits an intricate secondary structure
(relatively short, double helical segments
alternated with single stranded regions) -
complex tertiary interactions fold the RNA in its
final three dimensional form - the folded RNA
molecule is stabilized by interactions (e.g.
hydrogen bonds and base stacking)
DNA   RNA    Mutations Amino acids, protein
structure
61
RNA structure
Primary structure
formed by unpaired nucleotides
Secondary structure
double helical RNA (A-form with 11 bp per turn)
duplex bridged by a loop of unpaired nucleotides
nucleotides not forming Watson-Crick base pairs
unpaired nucleotides in one strand,other strand
has contiguous base pairing
DNA   RNA    Mutations Amino acids, protein
structure
three or more duplexes separated by
singlestranded regions
tertiary interaction between bases of hairpin
loopand outside bases
62
RNA structure
Primary structure
Secondary structure
Tertiary structure
C
D
G
E
F
B
DNA   RNA    Mutations Amino acids, protein
structure
A
63
RNA structure
How to predict RNA secondary/tertiary structure?
Probing RNA structure experimentally - physical
methods (single crystal X-ray diffraction,
electron microscopy) - chemical and enzymatic
methods - mutational analysis (introduction of
specific mutations to test change in some
function or protein-RNA interaction)
Thermodynamic prediction of RNA structure - RNA
molecules comply to the laws of thermodynamics,
therefore it should be possible to deduce RNA
structure from its sequence by finding the
conformation with the lowest free energy -
Pros only one sequence required no difficult
experiments does not rely on alignments -
Cons thermodynamic data experimentally
determined, but not always accurate possible
interactions of RNA with solvent, ions, and
proteins
Comparative determination of RNA structure -
basic assumption secondary structure of a
functional RNA will be conserved in the
evolution of the molecule (at least more
conserved than the primary structure) when a
set of homologous sequences has a certain
structure in common, this structure can be
deduced by comparing the structures possible from
their sequences - Pros very powerful in finding
secondary structure, relatively easy to use, only
sequences required, not affected by
interactions of the RNA and other molecules -
Cons large number of sequences to study
preferred, structure constrains in fully
conserved regions cannot be inferred, extremely
variable regions cause problems with alignment
DNA   RNA    Mutations Amino acids, protein
structure
64
Amino acids/proteins
The central dogma of modern biology DNA ? RNA
? protein
Getting from DNA to protein Two parts 1.
Transcription in which a short portion of
chromosomal DNA is used to make a RNA
molecule small enough to leave the nucleus.
2. Translation in which the RNA code is
used to assemble the protein at the
ribosome
The genetic code
- The code problem 4 nucleotides in RNA, but 20
amino acids in proteins
- Bases are read in groups of 3 ( a codon)
- The code consists of codons
64 (43 64)
- All codons are used in protein synthesis - 20
amino acids - 3 stop codons
- AUG (methionine) is the start codon (also used
internally)
- The code is non-overlapping and punctuation-free
DNA   RNA    Mutations Amino acids, protein
structure
- The code is degenerate (but NOT ambiguous)
each amino acid is specified by at least one
codon
- The code is universal (virtually all organisms
use the same code)
65
The genetic code
methionine and tryptophan
five proline, threonine,valine, alanine,
glycine
DNA   RNA    Mutations Amino acids, protein
structure
AUG
Write a Comment
User Comments (0)
About PowerShow.com