Title: Bioinformatics Overview Problems and Algorithms
1Bioinformatics OverviewProblems and Algorithms
- Debra T. Burhans, Ph.D.
- Director, Bioinformatics Program
- Canisius College
- burhansd_at_canisius.edu
- SIGCSE Workshop, February 2005
2Outline
- Overview of bioinformatics
- Problems and Algorithms
3Overview of Bioinformatics
4What is Bioinformatics?
- There are many definitions of Bioinformatics
- The field is so new that it is still in the
process of being defined - Bioinformatics involves the application of
computational techniques to the representation
and analysis of biological data - Bioinformatics is intertwined with a number of
different sciences - According to the National Institutes of Health,
bioinformatics is research, development, or
application of computational tools and approaches
for expanding the use of biological, medical,
behavioral or health data, including those to
acquire, store, organize, archive, analyze, or
visualize such data.
5Bioinformatics is Interdisciplinary
- Biology complex systems, source of interesting
problems - Chemistry biochemistry underlies molecular
biology - Physics physical phenomena at molecular level
- Mathematics modeling
- Statistics understand large data sets,
biostatistics - Computer Science formulate and solve problems,
information representation, integration, storage,
management - Informatics
- Medicine/Health Professions patient
information, disease data
6Bio Informatics
- Biologists are generating an enormous amount of
data with new high-throughput laboratory
technologies - A single experiment can now yield thousands of
data points - Sequencing experiments
- Microarrays (gene and protein expression arrays)
- There are thousands of journals, most of which
are available electronically - There is no way for a scientist to
analyze/understand this data without the aid of
computational tools and statistical analyses - Bio source of data and problems
- Informatics storing, retrieving, analyzing,
understanding data with the use of computers
7A Few Problems and Algorithms
8Sequences
- There are three different sequence types of
interest - DNA
- 4 letter alphabet A, C, G, T a 5th letter N
is added for unknown - Nucleotides adenine, cytosine, guanine, thymine
- RNA
- A, C, G, U
- Like DNA but uracil instead of thymine
- Protein
- 20 letter alphabet A, R, N, D, C, E, Q, G, H, I,
L, K, M, F, P, S, T, W, Y, V a 21st letter X is
added to represent an unknown aa - Amino acids (aa)
- Alanine (A), Arginine (R), Asparagine (N),
Aspartic acid (D), Cysteine (C), Glutamic acid
(E), Glutamine (Q), Glycine (G), Histidine (H),
Isoleucine (I), Leucine (L), Lysine (K),
Methionine (M), Phenylalanine (F), Proline (P),
Serine (S), Threonine (T), Tryptophan (W),
Tyrosine (Y), Valine (V), Unknown (X)
9The Central Dogma involves Mappings
- DNA -gt RNA
- RNA is transcribed from DNA with the following
nucleotide match-ups A-gtU, T-gtA, C-gtG, G-gtC - RNA -gt Protein
- Protein is synthesized from RNA as groups of
three letters of the RNA code (codons) are mapped
onto single amino acids - There are only 20 amino acids yet 64 possible
codons, the code is redundant - Good problems arise with regular expressions
- Create a regular expression that describes the
set of codons that code for each amino acid
10Sequence Alignment
- Problem of matching sequences (DNA, RNA, protein)
- Helps to identify unknown sequences by comparing
them to large databases of sequences whose
function may be understood - Alignments may be global or local
- Alignments may be gapped or ungapped
- Alignments are rarely perfect in biology
11Simple Alignment Problem
- What is the relationship between two sequences,
for example - ACTTA
- AGGTACTAGACTTATTATATACTTAACTATATACTTAAAA
- Overlap and containment
- In biology the problem is much more complex due
to insertions, deletions and changes in
individual bases - Could be mutations (biological)
- Could be errors in sequencing or data entry
- Alignments are scored number of matches,
mismatches - Pairwise vs. multiple alignments (consensus
sequence)
12Alignment with Gaps
- Insertions and deletions in sequences (indels)
lead to the notion of a gapped alignment - In addition to match and mismatch scores include
a gap penalty - A C G T - - T
- A - T T T T T
- With match score of 2, mismatch score of -2, and
gap penalty of -1, this alignment scores 2 -1
-2 2 -1 -1 2 1 - The scoring parameters are adjusted to reflect
the underlying biology
13Multiple Sequence Alignment
Alignment of protein sequences
14A look at scoring matrices
- BLOSUM62
- Created from multiple sequence alignments
- Larger numbered matrices reflect more closely
related sequences - PAM250
- 1 PAM is the amount of evolutionary change that
yields, on average, one substitution in 100 amino
acid (aa) residues - A PAM matrix is a matrix of similarity scores for
all possible pairs of residues (protein) - The matrix was derived from aa replacements
occuring in related proteins
15Local vs. Global Alignment
- Global alignment is concerned with aligning two
sequences end to end - Local alignment seeks the highest scoring
alignment between subsequences - BLAST Basic Local Alignment and Search Tool
- The primary tool available to score alignment
- Parameters can be set in web interface
- Did you BLAST your sequence?
- Many flavors, including
- Nucleotide-nucleotide
- Protein-protein
- Protein-nucleotide
16BLAST Exercise(handout)
17Dynamic Programming
- Alignments can be computed using dynamic
programming - Needleman-Wunsch (global alignment)
- Smith-Waterman (local alignment)
- Good programming exercise
18Alignment Matrix
Choose a scoring metric, e.g. Score 1 for
match Score 0 for gap penalty Score 0 for mismatch
Three steps in dynamic programming Initialization
Matrix fill (scoring) Traceback (alignment)
For each position, Mi,j is defined to be the
maximum score at position i,j i.e. Mi,j
MAXIMUM Mi-1, j-1 Si,j
(match/mismatch in the diagonal), Mi,j-1 w
(gap in sequence 1), Mi-1,j w (gap in
sequence 2)
19Fragment Reassembly
- Shotgun sequencing involves chopping up DNA into
small pieces, sequencing those pieces, then
figuring out how they all fit together - The original structure is reconstructed via
fragment reassembly - This was the was the technique used by J. Craig
Venter to revolutionize the sequencing of the
human genome - Only possible due to computational power
20Fragment Reassembly Example
- Reassemble a set of sequences into a single
sequence - ACCT
- CTTAG
- TAGTAGTAG
- AGGTC
- Construct overlap graph
- Problem repetitive regions may be collapsed
- Not minimal superstring!
- In reality the sequences are hundreds of bases
long and there are thousands of them
21Pioneers are Revolutionizing Biology
Leaving colleagues and rivals to comb through the
finished human code in search of individual
genes, he has decided to sequence the genome of
Mother Earth. "My greatest success is that I
managed to get hated by both worlds," Venter told
me on St. Barts.
What separates him from your average 58-year-old
nude beachcomber is that he's in the midst of a
scientific enterprise as ambitious as anything
he's ever done.
22Pioneers are Revolutionizing Biology
He wanted to play God, so he cracked the human
genome. Now he wants to play Darwin and collect
the DNA of everything on the planet.
In March of 2004, he announced that his Sargasso
team had discovered at least 1,800 new species
and more than 1.2 million new genes.
23Gene Finding
- Many programs and approaches are used
- http//www.binf.ku.dk/users/krogh/genefinding.html
- HMMs
- Search for particular features associated with
genes - Start codon ATG
- Stop codon TAA, TAG, TGA
- ORFs (open reading frames)
- Areas of high complexity (a lot of nucleotide
variation) - Splice sites
- Protein binding sites
- etc.
24Phylogenetic Trees
- Relationships between species or sequences
25The Tree of Life
26Ribosomal Small Subunit RNA Tree of Life
DOMAIN BACTERIA
DOMAIN ARCHAEA
DOMAIN EUKARYA
Gram-positive bacteria
Green sulfur bacteria
Methanobacterium
Methanococcus
Thermococcus
Thermoproteus
Archaeoglobus
Dinoflagellates
Methanopyrus
Flavobacteria
Purple bacteria
Cyanobacteria
Trypanosoma
Entamoebae
Slime molds
Brown algae
Thermotoga
Pyrodictium
Green algae
Halococcus
Sulfolobus
Red algae
Animals
Thermus
Euglena
Diatoms
Ciliates
Fungi
Aquifex
Giardia
Plants
pJP27
pSL17
pJP78
pSL12
27Conclusion
- Lots of interesting data and real problems
- Can be incorporated into CS courses at all levels