Title: Introduction to Bioinformatics 20120
1Introduction to Bioinformatics20120
- Gianluca Pollastri
- office CS A1.07
- email gianluca.pollastri_at_ucd.ie
2Credits
- Richard Lathrop and Pierre Baldis Bioinformatics
courses at University of California _at_ Irvine.
3Course overview
- Context DNA, RNA, proteins
- Resources GenBank, PDB, etc.
- Algorithms for sequence comparison.
- Phylogenetics.
- Structural bioinformatics protein structure
prediction.
4Lecture notes
- http//gruyere.ucd.ie/2007_courses/20120/
- confidential..
5Recommended/useful readings
- No book is actually required
- Introduction to Bioinformatics
- Lesk
- Introduction to Computational Molecular Biology
- Setubal, Meidanis
- Bioinformatics the Machine Learning approach
- Baldi, Brunak
6- CS 20120, Introduction to Bioinformatics
- Assignment 1, 29 January 2007
- 10 of the overall mark
- To hand in by midnight of February 12
- 1. identify your favourite pet
- 2. get the protein sequence for one of its genes
on - a. http//www.ncbi.nlm.nih.gov/entrez/
- 3. BLAST your sequence against UniProt at
- a. http//www.ebi.ac.uk/blast2/index.html?UniProt
- 4. If you get less than 6 results from 6
different organisms, go back to 2 and choose
another protein - 5. Select 6 sequences returned by BLAST, from 6
different organisms (ticking the appropriate
boxes and downloading them in fasta format will
give you the right input format for the next
step) - 6. Run clustalW on them using the page (be
patient, might take time) - a. http//www.ebi.ac.uk/clustalw/index.html
- 7. Draw a phylogenetic tree for your guide tree
(.dnd) using an online viewer, e.g. - a. http//bioweb.pasteur.fr/seqanal/interfaces/dra
wtree.html - 8. email me (gianluca.pollastri_at_ucd.ie)
- a. your protein sequence UniProt record
7Public tools summary
- DNA/RNA databases GenBANK/EMBL - 3-way
consortium - Protein sequences UniProt - SWISS-PROT, TrEMBL,
etc. - Protein structures Protein Data Bank (PDB)
- A massive amount of portals, servers, boutique
databases, etc., some good, some a bit less. - To sort out the above, a lot of servers
benchmarking other servers (e.g. EVA, LiveBench,
etc.)
8Some goals of Bioinformatics
- Understand biology based on sequences
- Interrelate sequence, structure, expression
(presence/absence), function understand the
system - Use current sequence data to travel back in
time.. - Use all this knowledge to produce technologies
(for health, agriculture, etc.)
9More specific goals
- Given a sequence, find sequences in a database
that are similar to it (sequence comparison). - Given a structure, find structures in a database
that are similar to it (structure comparison). - Given a sequence, find its structure (protein
structure prediction). - Given a structure, find a sequence whose
structure is similar to it (protein synthesis)
10Data size
- 1011 letters in DNA repositories a decent-sized
hard disk - 6 complete years of issues of the NY Times (which
has notoriously large weekend supplements..) - A formidable increase rate..
11(No Transcript)
12Computer Science needed
- Given the size and nature of molecular biology
data, a set of specific computer science
technologies are especially crucial for
bioinformatics - Fast algorithms for comparing strings, 3D
structures - Efficient data structures
- Data mining/machine learning
13Sequence Comparison
14Sequence comparison
- Most important primitive operation in
computational biology. Almost everything else we
can do relies on it. - Similarity between two sequences (DNA, protein)
how much they look alike (are they likely to be
evolutionarily related?). - Alignment how we place a sequence vs another to
maximise matches between the two, often to
measure similarity.
15Sequence comparison (2)
- Similarity between two sequences
- of identical letters in the same positions
- OK, but this works only if all letters are
equally dissimilar.. - evolutionary distance
- great, but how do we compute that?
- something in between?
- any ideas?
- DADLAKKNNCIACHQVETKVVGPALKDIAAKYADKDDAATYLAGKIKGGS
SGVWGQIPMPPNVNVSDADAKALADWILTLK - ...LYAEKACAGCHSTDSRLVGPSYKGLFGSTRGVIADENYIRKSILQPT
AQVVKGYPMPSQGQLSDDEINALIEYIKTLK
16Scenario 1
- 2 sequences over the same alphabet (e.g. both
proteins, or both DNA), roughly the same length.
Small differences. We want to find them
ACCTGGGCTACGTGACTTA-AACT ACCTG-GCTACGAGACTTATAACT
d i
17Scenario 1 (2)
- E.g. multiple labs sequencing the same gene, or
protein. There may be small differences due to
natural variations and sequencing errors.
ACCTGGGCTACGTGACTTA-AACT ACCTG-GCTACGAGACTTATAACT
d i
18Scenario 2
- 2 sequences over the same alphabet. We want to
figure out if the suffix of one is similar to a
prefix of another
..ACCCGACCTGGGCTACGTGACTTA-AA
ACCTG-GCTACGAGACTTATAACTTCAA
...
19Scenario 3
- Same as 2 (compare the end of one sequence vs the
beginning of the other) but now we have hundreds
of sequences. - We also know that many of these sequences are
likely to be unrelated, i.e. many/most pairs of
sequences dont match. Two problems in one
finding which sequences match, and
quantify/qualify the match. - E.g. genome sequencing. DNA cut in different
places by enzymes, sequenced, and then
reassembled.
20Scenario 4
- Find substrings of two sequences that are similar
to each other. All the surrounding stuff in both
sequences can be different.
..ACCCGACCTGGGCTACGTGACTTA-AAGGACGC
TTTGTACCTG-GCTACGAGACTTATAACTTCAA
-----...------
21Scenario 4
- E.g. finding a common motif in two regions,
finding a common domain (functional unit) between
two different proteins.
..ACCCGACCTGGGCTACGTGACTTA-AAGGACGC
TTTGTACCTG-GCTACGAGACTTATAACTTCAA
-----...------
22Scenario 5
- Same as 4, but now we have 1 sequence A vs
thousands of sequences. Most of these sequences
are NOT similar to A. Two problems in one find
which sequences are similar, and gauge their
similarity to A. - We have a motif and want to find all the
sequences that include it - We have a protein domain and want to find all the
proteins that include it - We have a gene and ...
- In general when we are trying to find biological
similarity we are in scenario 4 or 5
23Global sequence comparison
- The two sequences below look similar. We are
looking for an algorithm to detect this, and
align the sequences optimally (including gaps).
ACCTGGGCTACGTGACTTA-AACT ACCTG-GCTACGAGACTTATAACT
d i
24Global sequence comparison (2)
- We want to compare whole sequences we are not
interested in small regions of local similarity
in the middle of irrelevant/random/unrelated
information. - E.g. we have a gene A, and want to compare it to
a list of genes L. We are sure that A and all
sequences in L are actual genes. - We may not know much or anything about A, but we
have information about the elements of L. If A
looks like an element B of L then some of what we
know about B may be transferred to A.
25Alignments
- We want to insert arbitrary spaces (gaps) in both
sequences so that we end up with the max number
of matches (same letter in the same position) - random deletions and insertions are common during
evolution - without allowing gaps we wouldnt be
able to detect many evolutionary relationships - Spaces at the end of the sequences are OK
- perhaps a sequence is longer than the other one,
and insertions at the end of a sequence tend to
be more neutral that insertions in the middle - Only illegal (silly) thing two spaces in the
same position.
26OK
---TGGGCTACGTGACTTA-AACT ACCTG-GCTACGAGACTTATAACT
27NO! (silly)
---TGGGCTA-CGTGACTTA-AACT ACCTG-GCTA-CGAGACTTATAAC
T
28Visualising sequence similarity using dotplots
example
- ACCTGGGCCACGT
- ACCAGGCTACGA
29(No Transcript)
30(No Transcript)
31Score
- How do we score an alignment?
- For instance
- 1 match
- -1 mismatch
- -2 gap
- (Not at all the only way, e.g. for proteins there
are much better ways, as we will see)
32Score
ACCTGGGCTACGTGACTTA-AACT ACCAG-GCTACGAGACTTATAGCT
--- 19 matches, 3
mismatches, 2 gaps -gt score 19x1 3x(-1)
2x(-2) 12
33Score (2)
...TGGGCTACGTGACTTA-AACT ACCAG-GCTACGAGACTTATAGCT
...--- 16 matches, 3
mismatches, 2 gaps, 3 gaps at the beginning of a
seq -gt score 16x1 3x(-1) 2x(-2) 3x0 9
34Algorithms for sequence comparison
- Generating all possible alignments and picking
the best one impossibly slow. - Dynamic programming (here programming has
nothing to do with computers) solving a problem
by splitting it dynamically into subparts. - We build up a solution based on similarity
between prefixes of the two sequences..