Title: Eric C' Rouchka, D'Sc'
1- Eric C. Rouchka, D.Sc.
- Vogt Building, Room 205
- (502) 852-0467
- eric.rouchka_at_uofl.edu
- http//kbrin.a-bldg.louisville.edu/CECS694/
2Course Overview
- Syllabus
- Structure of Class
- Expectations
3Contact Information
- INSTRUCTOR
- Dr. Eric Rouchka
- Phone 852-0467 or 852-3835
- Email eric.rouchka_at_louisville.edu
-
- OFFICE HOURS
- Vogt Building
- Room 205
- T, Th 130 300 pm
- or by appointment
-
- http//kbrin.a-bldg.louisville.edu/CECS694/
4Required Texts
- Bioinformatics Sequence and Genome Analysis.
David Mount. 2001. ISBN 9-87969-608-7. -
- Biological Sequence Analysis Probabilistic
models of proteins and nucleic acids. R. Durbin,
S. Eddy, A. Krogh and G. Mitchison. 1998. ISBN
0-521-62971-3. -
- In addition, a number of journal articles will be
handed out in class.
5Required Texts
Image Source http//www.amazon.com/
6Other Bioinformatics Books
Image Source http//www.amazon.com/
7Other Reference Books
Image Source http//www.amazon.com/
8Tentative Schedule of Topics
- Overview of molecular biology
- Pairwise sequence alignment
- Multiple sequence alignment
- Sequence Databases
- Database searching
- Construction of phylogenetic trees
- RNA secondary structure prediction Microarray
image analysis - Sequence assembly techniques
- Gene Prediction
- Protein Folding Prediction
9Course Assignments
- 4-5 written homework assignments, 3-4 programming
assignments, a midterm test, and a final project,
and bioinformatics seminars. -
- Homework assignments must be turned in at the
beginning of class on the date they are due.
Late homework assignments will not be accepted,
since the solutions will be posted to the course
website. -
- Programming assignments are due at the beginning
of class on the date they are due. The programs
may be written in the language of your choice.
Late programming assignments will be accepted,
with a 10 per day deduction for a maximum of two
days. -
- Reading assignments from the two selected texts
and journal articles will be assigned. -
10Grading
- Programming Projects (3-4) 25 of final grade
- Homework (4-5) 15 of final grade
- Midterm Test 25 of final grade
- Final Project 25 of final grade
- One page seminar reports (3) 10 of final grade
-
- Final grades will be given using a plus/minus
scale. The cutoffs for grades will be roughly as
follows -
- 90-100 A
- 80-89 B
- 70-79 C
- 60-69 D
- 0-59 F
11Class Structure
- Introduction of a Topic
- Description of algorithms
- Available tools
- Make sure to ask questions!
12What is Bioinformatics/ Computational Biology?
- Bioinformatics collection and storage of
biological information - Computational biology development of algorithms
and statistical models to analyze biological data - Bioinformatics/Computational Biology will be
interchanged
13What is Bioinformatics?
Source http//ccb.wustl.edu/
14Why should I care?
- SmartMoney ranks Bioinformatics as 1 among next
HotJobs - Business Week 50 Masters of Innovation
- Jobs available, exciting research potential
- Important information waiting to be decoded!
http//smartmoney.com/consumer/index.cfm?storywor
king-june02
15Why is bioinformatics hot?
- Supply/demand few people adequately trained in
both biology and computer science - Genome sequencing, microarrays, etc lead to large
amounts of data to be analyzed - Leads to important discoveries
- Saves time and money
16What skills are needed?
- Well-grounded in one of the following areas
- Computer science
- Molecular biology
- Statistics
- Working knowledge and appreciation in the others!
17Where Can I Learn More?
- ISCB http//www.iscb.org/
- NBCI http//ncbi.nlm.nih.gov/
- http//www.bioinformatics.org/
- Journals
- Conferences (ISMB, RECOMB, PSB)
18Overview of Molecular Biology
- Cells
- Chromosomes
- DNA
- RNA
- Amino Acids
- Proteins
- Genome/Transcriptome/Proteome
19Cells
- Complex system enclosed in a membrane
- Organisms are unicellular (bacteria, bakers
yeast) or multicellular - Humans
- 60 trillion cells
- 320 cell types
Example Animal Cell www.ebi.ac.uk/microarray/
biology_intro.htm
20Organisms
- Classified into two types
- Eukaryotes contain a membrane-bound nucleus and
organelles (plants, animals, fungi,) - Prokaryotes lack a true membrane-bound nucleus
and organelles (single-celled, includes bacteria) - Not all single celled organisms are prokaryotes!
21Chromosomes
- In eukaryotes, nucleus contains one or several
double stranded DNA molecules organized as
chromosomes - Humans
- 22 Pairs of autosomes
- 1 pair sex chromosomes
Human Karyotype http//avery.rutgers.edu/WSSP/Stu
dentScholars/ Session8/Session8.html
22Image source www.biotec.or.th/Genome/whatGenome.h
tml
23What is DNA?
- DNA Deoxyribonucleic Acid
- Single stranded molecule (oligomer,
polynucleotide) chain of nucleotides - 4 different nucleotides
- Adenosine (A)
- Cytosine (C)
- Guanine (G)
- Thymine (T)
24Nucleotide Bases
- Purines (A and G)
- Pyrimidines (C and T)
- Difference is in base structure
Image Source www.ebi.ac.uk/microarray/
biology_intro.htm
25DNA
- Can be thought of as an alphabet with 4
characters - 4 letter alphabet with sufficiently long words
contains information to create complex organisms - Not unlike a computer with a small alphabet
26DNA polynucleotides(oligomers)
- Different nucleotides are strung together to form
polynucleotides - Ends of the polynucleotide are different
- A directionality is present
- Convention is to label the coding strand from 5
to 3
http//www.emc.maricopa.edu/faculty/farabee/BIOBK/
BioBookDNAMOLGEN.html
27Single Strand Polynucleotide
- Example polynucleotide
- 5 G?T?A?A?A?G?T?C?C?C?G?T?T?A?G?C 3
28Double Stranded DNA
- DNA can be single-stranded or double-stranded
- Double stranded DNA second strand is the
reverse complement strand - Reverse complement runs in opposite direction and
bases are complementary - Complementary bases
- A, T
- C, G
29Double Stranded Sequence
- Example double stranded polynucleotide
- 5 G?T?A?A?A?G?T?C?C?C?G?T?T?A?G?C 3
-
- 3 C?A?T?T?T?C?A?G?G?G?C?A?A?T?C?G 5
http//www.emc.maricopa.edu/faculty/farabee/BIOBK/
BioBookDNAMOLGEN.html
30Double Stranded DNA
Source unknown
31Double Helix
- Two complementary DNA strands form a stable DNA
double helix - Spring 2003 marked the 50th anniversary of its
discovery
Image source www.ebi.ac.uk/microarray/
biology_intro.htm
32RNA
- Ribonucleic Acid
- Similar to DNA
- Thymine (T) is replaced by uracil (U)
- RNA can be
- Single stranded
- Double stranded
- Hybridized with DNA
33RNA
- RNA is generally single stranded
- Forms secondary or tertiary structures
- RNA folding will be discussed later
- Important in a variety of ways, including protein
synthesis
34RNA secondary structure
- E. coli Rnase P RNA secondary structure
Image source www.mbio.ncsu.edu/JWB/MB409/lecture/
lecture05/lecture05.htm
35mRNA
- Messenger RNA
- Linear molecule encoding genetic information
copied from DNA molecules - Transcription process in which DNA is copied
into an RNA molecule
36mRNA processing
- Eukaryotic genes can be pieced together
- Exons coding regions
- Introns non-coding regions
- mRNA processing removes introns, splices exons
together - Processed mRNA can be translated into a protein
sequence
37mRNA Processing
Image source http//departments.oxy.edu/biology/S
tillman/bi221/111300/processing_of_hnrnas.htm
38ESTs
- Expressed Sequence Tags
- Basically sequence of processed mRNA
39tRNA
- Transfer RNA
- Well-defined three-dimensional structure
- Critical for creation of proteins
40tRNA structure
Source http//www.tulane.edu/biochem/nolan/lectu
res/rna/frames/trnabtx2.htm
41tRNA
- Amino acid attached to each tRNA
- Determined by 3 base anticodon sequence
(complementary to mRNA) - Translation process in which the nucleotide
sequence of the processed mRNA is used in order
to join amino acids together into a protein with
the help of ribosomes and tRNA
42Genetic Code
- 4 possible bases (A, C, G, U)
- 3 bases in the codon
- 4 4 4 64 possible codon sequences
- Start codon AUG
- Stop codons UAA, UAG, UGA
- 61 codons to code for amino acids (AUG as well)
- 20 amino acids redundancy in genetic code
4320 Amino Acids
- Glycine (G, GLY)
- Alanine (A, ALA)
- Valine (V, VAL)
- Leucine (L, LEU)
- Isoleucine (I, ILE)
- Phenylalanine (F, PHE)
- Proline (P, PRO)
- Serine (S, SER)
- Threonine (T, THR)
- Cysteine (C, CYS)
- Methionine (M, MET)
- Tryptophan (W, TRP)
- Tyrosine (T, TYR)
- Asparagine (N, ASN)
- Glutamine (Q, GLN)
- Aspartic acid (D, ASP)
- Glutamic Acid (E, GLU)
- Lysine (K, LYS)
- Arginine (R, ARG)
44Amino Acids
- building blocks for proteins (20 different)
- vary by side chain groups
- Hydrophilic amino acids are water soluable
- Hydrophobic are not
- Linked via a single chemical bond (peptide bond)
- Peptide Short linear chain of amino acids (lt 30)
polypeptide long chain of amino acids (which
can be upwards of 4000 residues long).
45Proteins
- Polypeptides having a three dimensional
structure. -
- Primarysequence of amino acids constituting the
polypeptide chain - Secondarylocal organization into secondary
structures such as ? helices and ? sheets - Tertiary three dimensional arrangements of the
amino acids as they react to one another due to
the polarity and resulting interactions between
their side chains - Quaternarynumber and relative positions of the
protein subunits
46Protein Structure
Image source www.ebi.ac.uk/microarray/biology_int
ro.html
47Central Dogma
Image source unknown
48Central Dogma
49What is a Gene?
- the physical and functional unit of heredity that
carries information from one generation to the
next - DNA sequence necessary for the synthesis of a
functional protein or RNA molecule
50Genome
- chromosomal DNA of an organism
- number of chromosomes and genome size varies
quite significantly from one organism to another - Genome size and number of genes does not
necessarily determine organism complexity
51Genome Comparison
52Transcriptome
- complete collection of all possible mRNAs
(including splice variants) of an organism. - regions of an organisms genome that get
transcribed into messenger RNA. - transcriptome can be extended to include all
transcribed elements, including non-coding RNAs
used for structural and regulatory purposes.
53Proteome
- the complete collection of proteins that can be
produced by an organism. - can be studied either as static (sum of all
proteins possible) or dynamic (all proteins found
at a specific time point) entity
54Brief History of Sequencing
- Discovery of Complementary Bases
- Erwin Chargaff, 1950
- Discovery of DNA Double Helix
- 1953 only 50 years ago
- James Watson
- Francis Crick
- Rosland Franklin
Image www.simr.org.uk/pages/biotechnology/
biotechnology_2.html
55History Of Genetic Code
- Genetic Code Completely uncovered (1965)
- Marshall Nierenberg
56Genetic Code
- 4 possible bases (A, C, G, U)
- 4 4 4 64 possible codon sequences
- Start codon AUG
- Stop codons UAA, UAG, UGA
- 61 codons to code for amino acids (AUG as well)
- 20 amino acids redundancy in genetic code
57Brief History of Sequencing
- First Protein Sequence
- 1955 Bovine Insulin (Fred Sanger)
- First DNA Sequence
- 1965 yeast alanine tRNA (77 bases)
- Development of DNA sequencing
- Maxam-Gilbert and Sanger Methods (1977)
58Sanger Sequencing Method
- (Quicktime Movie)
- SOURCE Molecular Cell Biology
59Improving Sangers Method
- Dideoxynucleosides fluorescently labeled (1986)
- Reaction cut by ¼
- Sequencing Automated by machine (1986)
- Laser detects fluorescence
60Image Source plantbio.berkeley.edu/
bruns/tour3.html
61(No Transcript)
62Genetic Mapping
- Sex-linked genes studied since early 1900s
- Gene mapping takes off in late 1970s
- David Botstein (RFLPs 1978)
- 1979 579 Genes Mapped
- 2003 30,000 Genes Mapped
- Mapping of Huntingtons Disease (First Diseased
Gene) - Triplet Repeat
- 1983
- Nancy Wexler
63Mapping of Markers
- Sequence Tagged Sites (STS)
- Sequences occurring only once in the human genome
- Help to map locations
- 52,000 STS in Humans
- 1 every 62,000 bases
64Cloning Techniques
- Plasmid Cloning Introduced (1973)
- Region of Interest duplicated by inclusion
- YAC Chromosomes described (1987)
- BACs introduced (1992)
- 30,000 to 100,000 bases can be cloned
65Hierarchical (Clone-based) Approach
- Know location of 30,000 100,000 bp region
- Break into 500-700 bp fragments
- Sequence Fragments
- Assemble based on similarity
- 8-10x coverage
- Current Price 0.09 / base
66Hierarchical (clone-based) approach
- generate overlapping set of clones
- select a minimum tiling path
- shotgun sequence each clone
67Hierarchical (clone-based) approach
- MINUS
- map generation requires resources, time and money
- Some regions not cloned
- PLUS
- easier to assemble smaller pieces
- less chance for assembly error
68Shotgun Sequencing Approach
- Developed 1991 TIGR
- Craig Venter, Hamilton Smith
- Break genome into millions of pieces
- Sequence each piece
- Reassemble into full genomes
69Whole Genome Shotgun Approach
- reads generated directly from a whole-genome
library
- assemble the genome all at once
70Whole Genome Shotgun Approach
- MINUS
- more prone to assembly error
- computationally intensive
- cannot effectively handle repeats
- PLUS
- Less overhead time up front
71Base calling and Assembly Software
- PHRED and PHRAP Developed (1988)
- PHRED Base calling software
- PHRAP Assists in assembly of sequenced data
72Available Assemblers
- SEQAID (Peltola et al., 1984)
- CAP (Huang, 1992)
- PHRAP (Green, 1994)
- TIGR Assembler (Sutton et al., 1995)
- AMASS (Kim et al., 1999)
- CAP3 (Huang and Madan, 1999)
- Celera Assembler (Myers et al., 2000)
- EULER (Pevzner et al., 2001)
- ARACHNE (Batzoglou et al., 2002)
73History of Genome Projects
- First Genome Sequence
- FX174 Phage 5,386 bp 9 proteins (1980)
- Haemophilus Influenzea Sequenced
- First non-viral genome (1.8 MB) (1995)
74History of Genome Projects
- Saccharomyces cereviseae sequenced
- First eukaryotic genome (12.1 MB) (1996)
- Caenorhabditis elegans sequence released
- First animal genome 200 MB (1998)
75History of Genome Projects
- Arabidopsis thaliana sequence released
- First publicly available plant genome (1999)
- Rough Draft of Human Genome Reported (2001)
- Finished 2003
76Human Genome Project
- Began in 1990 (US DOE 15 years)
- Identify all genes in human DNA
- Determine sequence of human genome
- Develop faster sequencing technologies
- Develop tools for data analysis
- ELSI
77Microbial Genomes
- 122 Complete Genomes in CMR
- http//www.tigr.org/tigr-scripts/CMR2/CMR_Content.
spl
78Genomes
- Fruit Fly
- Mouse
- Rat
- Rice
- Zebra fish
- Puffer fish
- Chicken
- Dog
- Frog
79Growth of GenBank
- 1982 600,000 Bases
- 2002 28.5 Billion Bases
Image source www.ncbi.nlm.nih.gov
80Other Notables
- Dayhoff ATLAS Database of Proteins (1960s)
- Sequence Comparison Algorithms
- 1970, Needleman-Wunch (global alignment)
- Protein Databank
- Brookhaven PDB (1973)
81Other Notables
- NMR for protein structure identification (1980)
- IntelliGenetics Founded
- DNA and Protein sequence analysis (1980)
82Other Notables
- Smith-Waterman algorithm
- Local sequence alignment (1981)
- GenBank Database created (1982)
- Genetics Computer Group Founded
- GCG suite (1982)
- PCR First Described (1985)
83Other Notables
- FASTP Algorithm
- Protein database searching (1985)
- SWISS-PROT
- Protein Database (1986)
84Other Notables
- PERL Programming Language
- Allows for sequence manipulation (1987)
- NCBI Established (1988)
- Human Genome Initiative (1988)
85Other Notables
- FASTA Program released (1988)
- DNA and Protein sequence database searches
- BLAST Program released (1990)
- Allows for quick database searches
- Informax Founded (1990)
- Human Genome Project Begins (1990)
86Other Notables
- Creation and Use of ESTs Described (1991)
- Incyte Pharmaceuticals Founded (1991)
- TIGR Established (1992)
- Shotgun sequencing methods
87Other Notables
- Affymetrix founded (1993)
- PRINTS protein motif database (1994)
88Other Notables
- First Commercial Microarray chips produced (1996)
- Dolly Cloned (1997)
- Capillary Sequencing machines introduced (1997)
89Other Notables
- Celera Genomics Formed (1998)
90More Detailed Histories
- http//www.netsci.org/Science/Bioinform/feature06.
html - http//www.dhgp.de/intro/history/history.html
91Microarrays
- Microarray
- New Technology (less than 10 years old)
- Allows study of thousands of genes at same time
- Study genes under different conditions
- Glass slide of DNA molecules
- Molecule string of bases
- uniquely identifies gene or unit to be studied
92Microarray Image Analysis
- Microarrays detect gene interactions 4 colors
- Green high control
- Red High sample
- Yellow Equal
- Black None
- Problem is to quantify image signals