Title: Bio Informatics and Machine Intelligence
1Bio Informatics and Machine Intelligence
- Nicholas Flann
- Computer Science
- (based on a bioinformatics class of Larry Hunter)
2Overview
3Molecular biology databases
- There is a tremendous amount of information about
bio-molecules in publicly available databases. - Computational Tools are needed to efficiently
find and process the data
4Example of data growth
5Data about Databases
- Nucleic Acids research publishes an annual
database issue. 2003 issue lists 348 editorially
selected databases - Small excerpt from the A's
- AARSDB Aminoacyl-tRNA synthetase sequences
- ABCdb ABC transporters
- AceDB C. elegans, S. pombe, and human sequences
and genomic information - ACTIVITY Functional DNA/RNA site activity
- ALFRED Allele frequencies and DNA polymorphisms
6What can be discovered about a gene by a database
search?
-
- Evolutionary information homologous genes,
taxonomic distributions, etc. - Genomic information chromosomal location,
regulatory regions, shared domains, etc. - Structural information associated protein
structures, fold types, structural domains - Expression information expression specific to
particular tissues, developmental stages,
phenotypes, diseases, etc. - Functional information enzymatic/molecular
function, pathway/cellular role, localization,
role in diseases
7Searching for informationabout genes and their
products
- Gene and gene product databases are often
organized by sequence - Genomic sequence encodes all traits of an
organism. - Gene products are uniquely described by their
sequences. - Similar sequences among bio-molecules indicates
both similar function and an evolutionary
relationship - Macromolecular sequences provide biologically
meaningful keys for searching databases
8Searching sequence databases
- Many kinds of input sequences
- Could be amino acid or nucleotide sequence
- Genomic or mRNA/cDNA or protein sequence
- Complete or fragmentary sequences
- Goal is to retrieve a set of similar sequences
9(No Transcript)
10BLAST Searching with a sequence
- Goals is to find other sequences that are more
similar to the query than would be expected by
chance. - Can start with nucleotide or amino acid sequence,
and search for either (or both)
11(No Transcript)
12Example sequence
- atgcacttgagcagggaagaaatccacaaggactcaccagtctcctggtc
tgcagagaagacagaatcaacatgagcacagcaggaaaagtaatcaaatg
caaagcagctgtgctatgggagttaaagaaacccttttccattgaggagg
tggaggttgcacctcctaaggcccatgaagttcgtattaagatggtggct
gtaggaatctgtggcacagatgaccacgtggttagtggtaccatggtgac
cccacttcctgtgattttaggccatgaggcagccggcatcgtggagagtg
ttggagaaggggtgactacagtcaaaccaggtgataaagtcatcccactc
gctattcctcagtgtggaaaatgcagaatttgtaaaaacccggagagcaa
ctactgcttgaaaaacgatgtaagcaatcctcaggggaccctgcaggatg
gcaccagcaggttcacctgcaggaggaagcccatccaccacttccttggc
atcagcaccttctcacagtacacagtggtggatgaaaatgcagtagccaa
aattgatgcagcctcgcctctagagaaagtctgtctcattggctgtggat
tttcaactggttatgggtctgcagtcaatgttgccaaggtcaccccaggc
tctacctgtgctgtgtttggcctgggaggggtcggcctatctgctattat
gggctgtaaagcagctggggcagccagaatcattgcggtggacatcaaca
aggacaaatttgcaaaggccaaagagttgggtgccactgaatgcatcaac
cctcaagactacaagaaacccatccaggaggtgctaaaggaaatgactga
tggaggtgtggatttttcatttgaagtcatcggtcggcttgacaccatga
tggcttccctgttatgttgtcatgaggcatgtggcacaagtgtcatcgta
ggggtacctcctgattcccaaaacctctcaatgaaccctatgctgctact
gactggacgtacctggaagggagctattcttggtggctttaaaagtaaag
aatgtgtcccaaaacttgtggctgattttatggctaagaagttttcattg
gatgcattaataacccatgttttaccttttgaaaaaataaatgaaggatt
tgacctgcttcactctgggaaaagtatccgtaccattctgatgttttgag
acaatacagatgttttcccttgtggcagtcttcagcctcctctaccctac
atgatctggagcaacagctgggaaatatcattaattctgctcatcacaga
ttttatcaataaattacatttgggggctttccaaagaaatggaaattgat
gtaaaattatttttcaagcaaatgtttaaaatccaaatgagaactaaata
aagtgttgaacatcagctggggaattgaagccaataaaccttccttctta
accatt
13(No Transcript)
14(No Transcript)
15(No Transcript)
16(No Transcript)
17(No Transcript)
18(No Transcript)
19(No Transcript)
20(No Transcript)
21(No Transcript)
22(No Transcript)
23(No Transcript)
24(No Transcript)
25(No Transcript)
26(No Transcript)
27(No Transcript)
28(No Transcript)
29The complexity of database search
- Several algorithms for exact methods
- exhaustive (linear time, constant space)
- indexed (log time, linear space)
- hash tables (constant time, linear space)
- Inexact search methods
- Dynamic programming (Smith-Waterman)
- BLAST
30Pairwise Sequence Alignment
- An alignment is a mapping from one sequence to
another, identifying elements that are likely to
have arisen from a common ancestor - Alignments are NOT exact matches.
31Similarity
- Similarity can be defined by counting positions
that are identical between two sequences - Gaps (insertions/deletions) can be important
abcdef abcdef abcdef
abceef acdef a-cdef
32How many alignments?
- Without gaps, there are are NxM possible
alignments between sequences of length N and M - Once we start allowing gaps, there are many
possible arrangements to consider abcbcd
abcbcd abcbcd
abc--d a--bcd ab--cd - Every possible pairing between elements gives NM
possible alignments.
33Finding the optimal alignment
- Dynamic programming identifies optimal alignments
in time proportional to the sum of the lengths of
the sequences
34Dynamic programming
- The key idea is to start aligning the sequences
left to right once a prefix is optimally
aligned, nothing about the remainder of the
alignment changes the alignment of the prefix. - We construct a matrix of possible alignment
scores (NxM2 calculations worst case) - Called Needleman-Wunch or Smith-Waterman
35Alignment matrix
- Create a matrix with each sequence to be aligned
along one edge and the score of the alignment of
each pair of elements in a cell. - Best local alignment is just the highest
scoring diagonal
36Dynamic programming matrix
- Each cell has the score for the best aligned
sequence prefix up to that position.
37The complexity of database search
- Several algorithms for exact methods
- exhaustive (linear time, constant space)
- indexed (log time, linear space)
- hash tables (constant time, linear space)
- Inexact search methods
- Dynamic programming (Smith-Waterman)
- BLAST
38Pairwise Sequence Alignment
- An alignment is a mapping from one sequence to
another, identifying elements that are likely to
have arisen from a common ancestor - Alignments are NOT exact matches.
39Similarity
- Similarity can be defined by counting positions
that are identical between two sequences - Gaps (insertions/deletions) can be important
abcdef abcdef abcdef
abceef acdef a-cdef
40How many alignments?
- Without gaps, there are are NxM possible
alignments between sequences of length N and M - Once we start allowing gaps, there are many
possible arrangements to consider abcbcd
abcbcd abcbcd
abc--d a--bcd ab--cd - Every possible pairing between elements gives NM
possible alignments.
41Finding the optimal alignment
- Dynamic programming identifies optimal alignments
in time proportional to the sum of the lengths of
the sequences
42Dynamic programming
- The key idea is to start aligning the sequences
left to right once a prefix is optimally
aligned, nothing about the remainder of the
alignment changes the alignment of the prefix. - We construct a matrix of possible alignment
scores (NxM2 calculations worst case) - Called Needleman-Wunch or Smith-Waterman
43Alignment matrix
- Create a matrix with each sequence to be aligned
along one edge and the score of the alignment of
each pair of elements in a cell. - Best local alignment is just the highest
scoring diagonal
44Dynamic programming matrix
- Each cell has the score for the best aligned
sequence prefix up to that position.
45Why predict protein structure?
- Neither crystallography nor NMR can keep pace
with genome sequencing efforts - Computer scientists love this problem
- Understandable with minimal biology
- Seems like a good discrimination task
- Understand the mechanisms of folding (?)
- First computational Nobel prize?
46Protein structure
- Amino acid sequence alone should be enough to
determine protein structure - However, the physics are daunting
- 20,000 protein atoms, plus equal amounts of
water - Can takes seconds (most chemical reactions take
place 1012 --1,000,000,000,000x faster) - Empirical determinations of protein structure are
advancing rapidly
47Protein structure cartoons
48Protein Structure Representations
- Differentvisualizationsshow variousaspects
ofstructure
49Competition
- Critical Assessment of Structure Prediction
competition since 1994 - Solved, but unpublished structures are posted in
May, predictions due in September - Various categories
- Relation to existing structures, ab initio,
homology, fold, etc. - Partial vs. Fully automated approaches
- Produces lots of information about what aspects
of the problems are hard, and ends arguments
about test sets. - Results showing steady improvement, and the value
of integrative approaches.
50Reconstructing Phylogenies
- The only universal biological fact is that all
species are related by descent. - The genealogical history of life can be
represented by an evolutionary tree (or
phylogeny), with contemporary organisms at the
leaves. - Goal of phylogenetic inference is to reconstruct
the order of splitting events (and perhaps the
distances between them).
51The phylogeny problem
- Input
- A set of contemporary species (S) whose
evolutionary relationship is to be reconstructed - A set of inheritable characteristics (C) that
describe each species. Characteristics can be - quantitative (continuous, e.g. size)
- qualitative (discrete, e.g. gene sequence)
- Output
- Tree (perhaps branch lengths) which best fits the
data. - Assumptions
- Common ancestor
52Why do it?
- Resolve evolutionary history
- Constructing vaccines
- Want to assure that vaccine is constructed to
address diverse strains of the disease (e.g.
influenza) - Epidemiology
- Reconstruct paths of infection, either of an
individual or generally. (e.g. Crandall's
Evolution of HIV)
53Tree reconstruction as optimization
- Number of possible trees is 2S. Considering
branch lengths makes the problem harder. - Want the tree that maximizes some quality score.
Score based on either - Characters (e.g. sites in a sequence) directly,
or - Distances between character sets (e.g. alignment
scores) - Another NP-hard optimization problem...
54Managing the Biomedical Literature
- Challenges are huge
- 700,000 articles/yearaccelerating 10
year!16,048 last week - Large number of relevant articles for most topics
- (Sub-)Disciplinary boundaries are breaking down
- Computational tools are crucial
- Information retrieval (finding the right
articles) - Organization dissemination of articles data
- Information extraction
55Natural Language Processing
- NLP is many tools and technologies.
- Natural language understanding capture the
meaning and implications of texts. A mostly
unobtainable goal, but much useful theory - Using documents as evidence for gene function or
other kinds of inference - Information extraction capture the meaning and
implications of certain restricted aspects of
text - Document management reasoning about documents,
rather than the texts they contain - Information retrieval (including Q/A)
- Browsing and organizing large collections of
documents
56Summary
- Many Significant and Difficult Computational
Problems - Need Computer Scientists to Develop new efficient
algorithms and methodologies - Many different fields
- Graphics, Algorithms, Machine Intelligence,
Machine Vision, Optimization, Databases, etc.