Title: Bioinformatics
1Bioinformatics
- Science Honors Program - Computer Modeling and
Visualization in Chemistry
2What is Bioinformatics?
- Its at the intersection of biotechnology and
computer science, to analyze the enormous amount
of sequence and structural data that we have
generated over the past decades. - Computational tools to mine this enormous
amount of data.
3Bioinformatics is multidisciplinary
Mathematics/computer science
Genomics
Molecular biology Biomedicine
Bioinformatics
Biophysics
Ethical, legal, and social implications
Molecular evolution
4- Biological data
- Huge data sets
- Complexity of biological systems
5What are we trying to Find out?
KEGG Kyoto Encyclopedia of Genes and Genomes.
A grand challenge in the post genomic era is a
complete computer representation of the cell and
the organism, which will enable computational
prediction of higher-level complexity of cellular
processes and organism behaviors from genomic
information. http//www.genome.jp/kegg/
6Where Does the Data Come From?
- Primary Protein Structure Determination a
variety of chemical techniques - Nucleic Acid Sequencing
- PCR (polymerase chain reaction)
- 3D Structure
- X-Ray Crystallography
- Nuclear Magnetic Resonance (NMR)
7Biological information From genes to proteins
Gene
DNA
Transcription
genomics molecular biology
RNA
Translation
structural biology biophysics
Protein
Protein folding
8Eukaryotic Genome DNA
Structure
Nucleotides (bases) Adenine (A) Cytosine
(C) Guanine (G) Thymine (T)
Sequence data Strings of letters
triplet codons genetic code
20 amino acids (A, L, V, S etc.)
9Three-dimensional protein structure atomic
coordinates in 3D space
Measured in Angstrom
Conversion into metric measurement Unit Angstrom
x 10-8 cm x 0.1 nm
10Proteins Prediction of biochemical function
- Relationships between
-
-
- DNA or amino acid
- sequence 3D structure
protein function - Use of this knowledge for prediction of function,
molecular modelling, and design (e.g., new
therapies)
CGCCAGCTGGACGGGCACACCATGAGGCTGCTGACCCTCCTGGGCCTTCT
G TDQAAFDTNIVTLTRFVMEQGRKARGTGEMTQLLNSLCTAVKAI
STAVRKAGIAHLYGIAGSTNVTGDQVKKLDVLSNDLVINVLKSSFATCVL
VTEEDKNAIIVEPEKRGKYVVCFDPLDGSSNIDCLVSIGTIFGIYRKNST
DEPSEKDALQPGRNLVAAGYALYGSATML
11DNA Sequence Gene Protein
Sequence Function
12Sequence Databases
- Protein Sequences
- ExPASy Molecular Biology Server (SWISS-PROT)
http//expasy.ch - Protein Information Resource (PIR)
http//pir.georgetown.edu - Protein Research Foundation (PRF)
http//www.prf.or.jp/en - NR from NCBI www.ncbi.nlm.nih.gov
- OMIM Online Mendelian Inheritance in Man.
Genetic diseases http//www.ncbi.nlm.nih.gov/
entrez/query.fcgi?dbOMIM
13Other Databases
- GeneBank (Nucleotide) www.ncbi.nlm.nih.gov/Ge
nbank - NCBI Entrez, an integrated, text-based search and
retrieval system used at NCBI for the major
databases, including PubMed, Nucleotide and
Protein Sequences, Protein Structures, Complete
Genomes, Taxonomy, and others.
http//www.ncbi.nlm.nih.gov/gquery/gquery.fcgi\ - PDB Protein Databank. Protein Structures.
www.rcsb.org/pdb - KEGG Kyoto Encyclopedia of Genes and Genomes
http//www.genome.jp - PubMed Literature References
http//www.ncbi.nlm.nih.gov/entrez/query.fcgi?dbP
ubMed - NCBI primers bottom of page. This page has
education resources. http//www.ncbi.nih.gov/
Education
14Genome sequencing and analysis (genomics)
Genomics generates a vast amount of DNA sequence
data. Sophisticated algorithms are used to
predict gene regions. Only 3 of the vertebrate
genome codes for proteins.
- Genbank hold sequences from over 800 organisms.
There are currently 113 complete genomes. - The completion of a "working draft" of the human
genome was announced in June 2001. - Estimates of 38 - 120,000 genes (40, 000)
15Explore the following databases
- KEGG Kyoto Encyclopedia of Genes and Genomes
- Main Map http//www.genome.jp/kegg/pathway/map/m
ap01100.html - Main Page http//www.genome.jp/kegg/
- Genbank
- What is it? An annotated collection of all
publicly available DNA sequences - www.ncbi.nlm.nih.gov/Genbank
- PDB The Protein Databank
- What is it?A database of 3D structures of
proteins (and some DNA or other molecules) - www.rcsb.org/pdb
16What is homology?
- Homology common ancestry
- Why homology
- much easier to get a DNA or protein sequence than
to experimentally determine structure or function
of a biological molecule - Rapid expansion of databases of sequences (DNA,
protein, RNA) far greater than structural or
functional databases. - Develop computational methods to infer
biologically relevant information from sequence
alone. - E.g., If know that two protein sequences are
homologous, then we can infer that the two
proteins may share the same protein fold, active
site, and even function. - In simple terms, much of todays class involves
Darwinian evolution discussed at the microscopic
genetic level.
17Stochastic Evolutionary Forces Act on Genomes
- Forces that alter a genetic sequence
- Random Mutation
- Natural Selection
- Genetic Drift
- Comparison of protein sequences can be used to
infer evolutionary events that happened possibly
billions of years ago. - If find a homologous protein sequence to a given
protein, often from very divergent organisms,
then a common ancestral protein must have
existed. - Homologous protein share similar 3-D structure.
- Homologous proteins can have seemingly very
different sequences.
18EvolutionaryTree
19Modes of Evolution.
Globin Family Evolution
Orthologous differ because of
speciation Paralogous differ because of gene
duplication
20Orthologous Sequences Cytochrome c Family
21Sequence Alignment
Optimal alignments of human myoglobin and human
hemoglobin (alpha chain) Algorithms
Needleman-Wunsch, Smith-Waterman Heuristic
Algorithms Pairwise Alignment
BLAST (www.ncbi.nlm.nih.gov) Multiple
Sequence Alignment ClustalW
(www.ebi.ac.uk) PSI-BLAST
22Structural Alignment
Comparison of dihydrofolate reductases from
Mycobacterium tuberculosis (1DF7) and Esherichia
coli (1DRE) with 41 sequence identity. After
the sequences of these two proteins were aligned,
the alpha-carbons of the backbone were
structurally aligned.
23Amino Acid Similarity Matrix
A similarity matrix incorporates information
about the likelihood that one amino-acid will be
mutated into another over evolutionary time.
Shown here is the PAM250 matrix. Another common
one is BLOSUM50
24Sequence AlignmentNeedleman-Wunsch
25Global Alignment
Needleman-Wunsch Algorithm
26Global Alignment Example
27Local Alignment
Smith-Waterman Algorithm
28Local Alignment Example
29Multliple Sequence Alignment
30Gene Doping
31Gene Therapy
32Gene Doped Mice
HEAVY WORKOUT. This rat, injected with a
muscle-enhancing gene, boosts its strength by
lugging weights up a ladder. http//www.sciencene
ws.org/articles/20041030/bob9.asp