Title: Outline
1Introduction
2Outline
- Topics for this class
- Course logistics
- A very little bit of background
- Course topic overview
- 482/682 will be noticeably different from
previous years - The instructor has changed and they are
specialized in different areas in bioinformatics
?
3COURSE LOGISTICS
4Course staff
- Instructor Bin Ma (DC 3345, binma_at_uwaterloo.ca,
http//www.cs.uwaterloo.ca/binma) - TA Xi Han
- Course webpage monod.uwaterloo.ca/cs482
- Prerequisites For undergraduates the two most
important prereqs are CS 341 and STAT 231.
5Marking
- For undergrads
- 4 assignments (40)
- In-class midterm (20)
- Final exam or final project (40)
- The midterm will happen on 27 October.
- For grad students
- 4 assignments (40)
- 1 final project, done by yourself (60)
- A proposal (due Nov. 1)
- A final report.
- A presentation in class.
- Undergraduates can do projects too. Will earn
40 marks.
6Textbooks, notes
- Textbook R. Durbin, S. Eddy, A. Krogh, G.
Mitchison, Biological sequence analysis
Probabilistic models of proteins and nucleic
acids, Cambridge University Press, 1999 , ISBN
0521629713. - This is a classic book in this area.
- Another book that is useful, although not
required, is - Dan Gusfield, Algorithms on Strings, Trees and
Sequences Computer Science and Computational
Biology, Cambridge University Press, 1997, ISBN
0521585198. - Many other books are either too specialized or
low quality. - Much material lacks text support.
- Notes
- Notes serve as an outline of the material
lectured. Cannot replace the lecturing. - Notes will appear on the web soon after they are
presented in class, with corrections (!)
7BRIEF REVIEW OF BIOLOGY
8A brief review of biology
- Modern molecular biology studies a few types of
biologically important molecules DNA, RNA,
protein, lipid, glycan - Bioinformatics has mostly studied DNA, then RNA
and protein, and less lipid and glycan. - The first three have their primary structures as
sequences.
9DNA
3
5
G-C is stronger than A-T base pair.
5
3
10DNA
- Three reasons for DNAs popularity in
bioinformatics - The most important information carrying molecule
that passes information to children - responsible to many genetic diseases.
- The simplest to model in a computer
- DNA is modeled as a string over A,C,G,T
- In bioinformatics sequence is more often used
than string. Why? - Data is the cheapest to obtain
- It is predicted that a humans complete genome
(3Gbps) can be sequenced with lt1000 dollars in a
day in the near future. - Bioinformatics played a key role
- Google donated a X-prize (http//www.xprize.org/).
11RNA
- RNA was less studied before but is now becoming
more and more important. - The structure is important to RNAs function.
Not a simple string anymore.
12Protein
Primary structure is a sequence. 20 frequent
amino acids. Fold into a complex 3D structure.
13Protein
- Protein is the most important molecule for the
living of an organism - Structural components
- Participate in almost all chemical reactions in
cells as enzymes (catalyst). Allow the organism
to react to the environment through sophisticated
signal pathway. - Directly responsible to most diseases (genetic or
not) and is the main drug target for diseases
including Alzheimer and cancer. - Protein has become extremely popular in
bioinformatics - Post-genome era
- Genomics v.s. Proteomics
- Hard to study partially because structure is
significant to the function - And its more expensive to get the data until
recently.
14An example
HER2 is a proto-oncogene found on chromosome 17.
It encodes a protein and functions as a cell
membrane receptor.
Normal epithelial cells express low levels of
HER2 receptor on the cell surface. While some
types of breast cancer cells, over express this
gene. This signals the tumor cells to
proliferate (grow).
15An example
16Read more by yourself
- If you did not have much biology background, read
the following articles (and other related
articles) from wikipedia - Protein, DNA, RNA, gene, genome, genetic code,
tRNA. - We will briefly review the necessary biology
knowledge when needed.
17COURSE TOPICS
- Keywords algorithm, sequence, phylogeny, protein
sequencing
18Keyword 1 algorithm
- This is a bioinformatics course focusing on
biological sequence analysis algorithms. - How bioinformatics is used in biology
- Sample ? data ? software ? discovery
- Bioinformatics research cycle
- biological problem ? math model ? algorithm ?
software ? biology - Normally the data is too large or the model is
too complex so that efficient algorithm is
needed. - polynomial is no good any more.
- some times even linear is not good enough.
19An examlpe role of bioinformatics
mass spectrometry
protein sample
data
- Interesting protein information includes
- Protein identity
- Protein quantity
- PTM on proteins
- These are useful for disease study and drug
development.
bioinformatics
protein information
20Keyword 2 sequence
- Fundamental information storage method in living
cells DNA sequences. - Central dogma of molecular biology DNA ? RNA ?
protein - Hence, to understand an organism, it helps to
start out by understanding DNA sequences. - We can treat DNA sequences as strings.
- ACCGATTGAGCCGTACC
- So were going to spend most of the course
learning about algorithms for strings and
sequences.
21Keyword 3 phylogeny
- Darwins theory of evolution told us that all
species share the same ancestor. - Knowing only the currently living species,
especially the DNA sequences, reconstruct this
tree. - Without digging the fossil
22Keyword 4 protein sequencing
- We will also talk about protein sequencing.
- Proteins is the construction material and the
controls of a living organism. It determines
the phenotype (compared to genotype) - Consider genes as source codes and proteins as
running programs (processes). - We will study how to read the sequence
information of a protein from biological sample. - Very interesting algorithms.
They have the same genome!
23Bioinformatics General Topics
- This is not a general course in bioinformatics,
which has become a very broad area - Genome sequencing.
- Sequence comparison
- Gene prediction and annotation
- Gene expression and biomarker
- Motif finding
- Regulatory network
- Protein structure comparison and prediction (CS
483/683) - Protein-protein interaction
- Protein id and quantification with mass
spectrometry - RNA structure, RNA gene prediction, RNAi.
- Glycans and Lipids
- Genetic variations SNPs, alternative splicing,
and diseases. - Phylogeny
- Genome evolution
- Medical/Cell image processing Molecular
simulation (bioinformatics? health
informatics?) - DNA computing. (a different area than
bioinformatics)
24Specific topics
- Pairwise alignment Which part of two sequences
are surprisingly similar to each other, if
theyve been evolving away from each other? - Phylogenetic reconstruction How do I build
evolutionary trees? How do I know theyre the
right ones? - Multiple alignment Like 1, only with multiple
sequences. How to make this useful in context of
evolution? - Gene finding Which part of a DNA sequence is
actually part of the process of producing
proteins? - Protein sequencing How to identify the protein
sequence from biological samples (wet lab ?
data)?
25Summary
- We talked about
- course logistics
- basic biology (wikipedia good resource)
- course topics
- Next time sequence alignment