Bio Informatics and Machine Intelligence - PowerPoint PPT Presentation

1 / 56

About This Presentation

Title:

Bio Informatics and Machine Intelligence

Description:

There is a tremendous amount of information about bio-molecules in ... Gaps (insertions/deletions) can be important. abcdef abcdef abcdef. abceef acdef a-cdef ... – PowerPoint PPT presentation

Number of Views:30

Avg rating:3.0/5.0

Slides: 57

Provided by: digita1

Category:

more less

Transcript and Presenter's Notes

Title: Bio Informatics and Machine Intelligence

1
Bio Informatics and Machine Intelligence

Nicholas Flann
Computer Science
(based on a bioinformatics class of Larry Hunter)

2
Overview
3
Molecular biology databases

There is a tremendous amount of information about
bio-molecules in publicly available databases.
Computational Tools are needed to efficiently
find and process the data

4
Example of data growth
5
Data about Databases

Nucleic Acids research publishes an annual
database issue. 2003 issue lists 348 editorially
selected databases
Small excerpt from the A's
AARSDB Aminoacyl-tRNA synthetase sequences
ABCdb ABC transporters
AceDB C. elegans, S. pombe, and human sequences
and genomic information
ACTIVITY Functional DNA/RNA site activity
ALFRED Allele frequencies and DNA polymorphisms

6
What can be discovered about a gene by a database
search?

Evolutionary information homologous genes,
taxonomic distributions, etc.
Genomic information chromosomal location,
regulatory regions, shared domains, etc.
Structural information associated protein
structures, fold types, structural domains
Expression information expression specific to
particular tissues, developmental stages,
phenotypes, diseases, etc.
Functional information enzymatic/molecular
function, pathway/cellular role, localization,
role in diseases

7
Searching for informationabout genes and their
products

Gene and gene product databases are often
organized by sequence
Genomic sequence encodes all traits of an
organism.
Gene products are uniquely described by their
sequences.
Similar sequences among bio-molecules indicates
both similar function and an evolutionary
relationship
Macromolecular sequences provide biologically
meaningful keys for searching databases

8
Searching sequence databases

Many kinds of input sequences
Could be amino acid or nucleotide sequence
Genomic or mRNA/cDNA or protein sequence
Complete or fragmentary sequences
Goal is to retrieve a set of similar sequences

9
(No Transcript)
10
BLAST Searching with a sequence

Goals is to find other sequences that are more
similar to the query than would be expected by
chance.
Can start with nucleotide or amino acid sequence,
and search for either (or both)

11
(No Transcript)
12
Example sequence

atgcacttgagcagggaagaaatccacaaggactcaccagtctcctggtc
tgcagagaagacagaatcaacatgagcacagcaggaaaagtaatcaaatg
caaagcagctgtgctatgggagttaaagaaacccttttccattgaggagg
tggaggttgcacctcctaaggcccatgaagttcgtattaagatggtggct
gtaggaatctgtggcacagatgaccacgtggttagtggtaccatggtgac
cccacttcctgtgattttaggccatgaggcagccggcatcgtggagagtg
ttggagaaggggtgactacagtcaaaccaggtgataaagtcatcccactc
gctattcctcagtgtggaaaatgcagaatttgtaaaaacccggagagcaa
ctactgcttgaaaaacgatgtaagcaatcctcaggggaccctgcaggatg
gcaccagcaggttcacctgcaggaggaagcccatccaccacttccttggc
atcagcaccttctcacagtacacagtggtggatgaaaatgcagtagccaa
aattgatgcagcctcgcctctagagaaagtctgtctcattggctgtggat
tttcaactggttatgggtctgcagtcaatgttgccaaggtcaccccaggc
tctacctgtgctgtgtttggcctgggaggggtcggcctatctgctattat
gggctgtaaagcagctggggcagccagaatcattgcggtggacatcaaca
aggacaaatttgcaaaggccaaagagttgggtgccactgaatgcatcaac
cctcaagactacaagaaacccatccaggaggtgctaaaggaaatgactga
tggaggtgtggatttttcatttgaagtcatcggtcggcttgacaccatga
tggcttccctgttatgttgtcatgaggcatgtggcacaagtgtcatcgta
ggggtacctcctgattcccaaaacctctcaatgaaccctatgctgctact
gactggacgtacctggaagggagctattcttggtggctttaaaagtaaag
aatgtgtcccaaaacttgtggctgattttatggctaagaagttttcattg
gatgcattaataacccatgttttaccttttgaaaaaataaatgaaggatt
tgacctgcttcactctgggaaaagtatccgtaccattctgatgttttgag
acaatacagatgttttcccttgtggcagtcttcagcctcctctaccctac
atgatctggagcaacagctgggaaatatcattaattctgctcatcacaga
ttttatcaataaattacatttgggggctttccaaagaaatggaaattgat
gtaaaattatttttcaagcaaatgtttaaaatccaaatgagaactaaata
aagtgttgaacatcagctggggaattgaagccaataaaccttccttctta
accatt

13
(No Transcript)
14
(No Transcript)
15
(No Transcript)
16
(No Transcript)
17
(No Transcript)
18
(No Transcript)
19
(No Transcript)
20
(No Transcript)
21
(No Transcript)
22
(No Transcript)
23
(No Transcript)
24
(No Transcript)
25
(No Transcript)
26
(No Transcript)
27
(No Transcript)
28
(No Transcript)
29
The complexity of database search

Several algorithms for exact methods
exhaustive (linear time, constant space)
indexed (log time, linear space)
hash tables (constant time, linear space)
Inexact search methods
Dynamic programming (Smith-Waterman)
BLAST

30
Pairwise Sequence Alignment

An alignment is a mapping from one sequence to
another, identifying elements that are likely to
have arisen from a common ancestor
Alignments are NOT exact matches.

31
Similarity

Similarity can be defined by counting positions
that are identical between two sequences
Gaps (insertions/deletions) can be important
abcdef abcdef abcdef
abceef acdef a-cdef

32
How many alignments?

Without gaps, there are are NxM possible
alignments between sequences of length N and M
Once we start allowing gaps, there are many
possible arrangements to consider abcbcd
abcbcd abcbcd
abc--d a--bcd ab--cd
Every possible pairing between elements gives NM
possible alignments.

33
Finding the optimal alignment

Dynamic programming identifies optimal alignments
in time proportional to the sum of the lengths of
the sequences

34
Dynamic programming

The key idea is to start aligning the sequences
left to right once a prefix is optimally
aligned, nothing about the remainder of the
alignment changes the alignment of the prefix.
We construct a matrix of possible alignment
scores (NxM2 calculations worst case)
Called Needleman-Wunch or Smith-Waterman

35
Alignment matrix

Create a matrix with each sequence to be aligned
along one edge and the score of the alignment of
each pair of elements in a cell.
Best local alignment is just the highest
scoring diagonal

36
Dynamic programming matrix

Each cell has the score for the best aligned
sequence prefix up to that position.

37
The complexity of database search

Several algorithms for exact methods
exhaustive (linear time, constant space)
indexed (log time, linear space)
hash tables (constant time, linear space)
Inexact search methods
Dynamic programming (Smith-Waterman)
BLAST

38
Pairwise Sequence Alignment

An alignment is a mapping from one sequence to
another, identifying elements that are likely to
have arisen from a common ancestor
Alignments are NOT exact matches.

39
Similarity

Similarity can be defined by counting positions
that are identical between two sequences
Gaps (insertions/deletions) can be important
abcdef abcdef abcdef
abceef acdef a-cdef

40
How many alignments?

Without gaps, there are are NxM possible
alignments between sequences of length N and M
Once we start allowing gaps, there are many
possible arrangements to consider abcbcd
abcbcd abcbcd
abc--d a--bcd ab--cd
Every possible pairing between elements gives NM
possible alignments.

41
Finding the optimal alignment

Dynamic programming identifies optimal alignments
in time proportional to the sum of the lengths of
the sequences

42
Dynamic programming

The key idea is to start aligning the sequences
left to right once a prefix is optimally
aligned, nothing about the remainder of the
alignment changes the alignment of the prefix.
We construct a matrix of possible alignment
scores (NxM2 calculations worst case)
Called Needleman-Wunch or Smith-Waterman

43
Alignment matrix

Create a matrix with each sequence to be aligned
along one edge and the score of the alignment of
each pair of elements in a cell.
Best local alignment is just the highest
scoring diagonal

44
Dynamic programming matrix

Each cell has the score for the best aligned
sequence prefix up to that position.

45
Why predict protein structure?

Neither crystallography nor NMR can keep pace
with genome sequencing efforts
Computer scientists love this problem
Understandable with minimal biology
Seems like a good discrimination task
Understand the mechanisms of folding (?)
First computational Nobel prize?

46
Protein structure

Amino acid sequence alone should be enough to
determine protein structure
However, the physics are daunting
20,000 protein atoms, plus equal amounts of
water
Can takes seconds (most chemical reactions take
place 1012 --1,000,000,000,000x faster)
Empirical determinations of protein structure are
advancing rapidly

47
Protein structure cartoons
48
Protein Structure Representations

Differentvisualizationsshow variousaspects
ofstructure

49
Competition

Critical Assessment of Structure Prediction
competition since 1994
Solved, but unpublished structures are posted in
May, predictions due in September
Various categories
Relation to existing structures, ab initio,
homology, fold, etc.
Partial vs. Fully automated approaches
Produces lots of information about what aspects
of the problems are hard, and ends arguments
about test sets.
Results showing steady improvement, and the value
of integrative approaches.

50
Reconstructing Phylogenies

The only universal biological fact is that all
species are related by descent.
The genealogical history of life can be
represented by an evolutionary tree (or
phylogeny), with contemporary organisms at the
leaves.
Goal of phylogenetic inference is to reconstruct
the order of splitting events (and perhaps the
distances between them).

51
The phylogeny problem

Input
A set of contemporary species (S) whose
evolutionary relationship is to be reconstructed
A set of inheritable characteristics (C) that
describe each species. Characteristics can be
quantitative (continuous, e.g. size)
qualitative (discrete, e.g. gene sequence)
Output
Tree (perhaps branch lengths) which best fits the
data.
Assumptions
Common ancestor

52
Why do it?

Resolve evolutionary history
Constructing vaccines
Want to assure that vaccine is constructed to
address diverse strains of the disease (e.g.
influenza)
Epidemiology
Reconstruct paths of infection, either of an
individual or generally. (e.g. Crandall's
Evolution of HIV)

53
Tree reconstruction as optimization

Number of possible trees is 2S. Considering
branch lengths makes the problem harder.
Want the tree that maximizes some quality score.
Score based on either
Characters (e.g. sites in a sequence) directly,
or
Distances between character sets (e.g. alignment
scores)
Another NP-hard optimization problem...

54
Managing the Biomedical Literature

Challenges are huge
700,000 articles/yearaccelerating 10
year!16,048 last week
Large number of relevant articles for most topics
(Sub-)Disciplinary boundaries are breaking down
Computational tools are crucial
Information retrieval (finding the right
articles)
Organization dissemination of articles data
Information extraction

55
Natural Language Processing

NLP is many tools and technologies.
Natural language understanding capture the
meaning and implications of texts. A mostly
unobtainable goal, but much useful theory
Using documents as evidence for gene function or
other kinds of inference
Information extraction capture the meaning and
implications of certain restricted aspects of
text
Document management reasoning about documents,
rather than the texts they contain
Information retrieval (including Q/A)
Browsing and organizing large collections of
documents

56
Summary

Many Significant and Difficult Computational
Problems
Need Computer Scientists to Develop new efficient
algorithms and methodologies
Many different fields
Graphics, Algorithms, Machine Intelligence,
Machine Vision, Optimization, Databases, etc.

Write a Comment

User Comments (0)