Bio Informatics and Machine Intelligence - PowerPoint PPT Presentation

1 / 56
About This Presentation
Title:

Bio Informatics and Machine Intelligence

Description:

There is a tremendous amount of information about bio-molecules in ... Gaps (insertions/deletions) can be important. abcdef abcdef abcdef. abceef acdef a-cdef ... – PowerPoint PPT presentation

Number of Views:30
Avg rating:3.0/5.0
Slides: 57
Provided by: digita1
Category:

less

Transcript and Presenter's Notes

Title: Bio Informatics and Machine Intelligence


1
Bio Informatics and Machine Intelligence
  • Nicholas Flann
  • Computer Science
  • (based on a bioinformatics class of Larry Hunter)

2
Overview
3
Molecular biology databases
  • There is a tremendous amount of information about
    bio-molecules in publicly available databases.
  • Computational Tools are needed to efficiently
    find and process the data

4
Example of data growth
5
Data about Databases
  • Nucleic Acids research publishes an annual
    database issue. 2003 issue lists 348 editorially
    selected databases
  • Small excerpt from the A's
  • AARSDB Aminoacyl-tRNA synthetase sequences
  • ABCdb ABC transporters
  • AceDB C. elegans, S. pombe, and human sequences
    and genomic information
  • ACTIVITY Functional DNA/RNA site activity
  • ALFRED Allele frequencies and DNA polymorphisms

6
What can be discovered about a gene by a database
search?
  • Evolutionary information homologous genes,
    taxonomic distributions, etc.
  • Genomic information chromosomal location,
    regulatory regions, shared domains, etc.
  • Structural information associated protein
    structures, fold types, structural domains
  • Expression information expression specific to
    particular tissues, developmental stages,
    phenotypes, diseases, etc.
  • Functional information enzymatic/molecular
    function, pathway/cellular role, localization,
    role in diseases

7
Searching for informationabout genes and their
products
  • Gene and gene product databases are often
    organized by sequence
  • Genomic sequence encodes all traits of an
    organism.
  • Gene products are uniquely described by their
    sequences.
  • Similar sequences among bio-molecules indicates
    both similar function and an evolutionary
    relationship
  • Macromolecular sequences provide biologically
    meaningful keys for searching databases

8
Searching sequence databases
  • Many kinds of input sequences
  • Could be amino acid or nucleotide sequence
  • Genomic or mRNA/cDNA or protein sequence
  • Complete or fragmentary sequences
  • Goal is to retrieve a set of similar sequences

9
(No Transcript)
10
BLAST Searching with a sequence
  • Goals is to find other sequences that are more
    similar to the query than would be expected by
    chance.
  • Can start with nucleotide or amino acid sequence,
    and search for either (or both)

11
(No Transcript)
12
Example sequence
  • atgcacttgagcagggaagaaatccacaaggactcaccagtctcctggtc
    tgcagagaagacagaatcaacatgagcacagcaggaaaagtaatcaaatg
    caaagcagctgtgctatgggagttaaagaaacccttttccattgaggagg
    tggaggttgcacctcctaaggcccatgaagttcgtattaagatggtggct
    gtaggaatctgtggcacagatgaccacgtggttagtggtaccatggtgac
    cccacttcctgtgattttaggccatgaggcagccggcatcgtggagagtg
    ttggagaaggggtgactacagtcaaaccaggtgataaagtcatcccactc
    gctattcctcagtgtggaaaatgcagaatttgtaaaaacccggagagcaa
    ctactgcttgaaaaacgatgtaagcaatcctcaggggaccctgcaggatg
    gcaccagcaggttcacctgcaggaggaagcccatccaccacttccttggc
    atcagcaccttctcacagtacacagtggtggatgaaaatgcagtagccaa
    aattgatgcagcctcgcctctagagaaagtctgtctcattggctgtggat
    tttcaactggttatgggtctgcagtcaatgttgccaaggtcaccccaggc
    tctacctgtgctgtgtttggcctgggaggggtcggcctatctgctattat
    gggctgtaaagcagctggggcagccagaatcattgcggtggacatcaaca
    aggacaaatttgcaaaggccaaagagttgggtgccactgaatgcatcaac
    cctcaagactacaagaaacccatccaggaggtgctaaaggaaatgactga
    tggaggtgtggatttttcatttgaagtcatcggtcggcttgacaccatga
    tggcttccctgttatgttgtcatgaggcatgtggcacaagtgtcatcgta
    ggggtacctcctgattcccaaaacctctcaatgaaccctatgctgctact
    gactggacgtacctggaagggagctattcttggtggctttaaaagtaaag
    aatgtgtcccaaaacttgtggctgattttatggctaagaagttttcattg
    gatgcattaataacccatgttttaccttttgaaaaaataaatgaaggatt
    tgacctgcttcactctgggaaaagtatccgtaccattctgatgttttgag
    acaatacagatgttttcccttgtggcagtcttcagcctcctctaccctac
    atgatctggagcaacagctgggaaatatcattaattctgctcatcacaga
    ttttatcaataaattacatttgggggctttccaaagaaatggaaattgat
    gtaaaattatttttcaagcaaatgtttaaaatccaaatgagaactaaata
    aagtgttgaacatcagctggggaattgaagccaataaaccttccttctta
    accatt

13
(No Transcript)
14
(No Transcript)
15
(No Transcript)
16
(No Transcript)
17
(No Transcript)
18
(No Transcript)
19
(No Transcript)
20
(No Transcript)
21
(No Transcript)
22
(No Transcript)
23
(No Transcript)
24
(No Transcript)
25
(No Transcript)
26
(No Transcript)
27
(No Transcript)
28
(No Transcript)
29
The complexity of database search
  • Several algorithms for exact methods
  • exhaustive (linear time, constant space)
  • indexed (log time, linear space)
  • hash tables (constant time, linear space)
  • Inexact search methods
  • Dynamic programming (Smith-Waterman)
  • BLAST

30
Pairwise Sequence Alignment
  • An alignment is a mapping from one sequence to
    another, identifying elements that are likely to
    have arisen from a common ancestor
  • Alignments are NOT exact matches.

31
Similarity
  • Similarity can be defined by counting positions
    that are identical between two sequences
  • Gaps (insertions/deletions) can be important
    abcdef abcdef abcdef
    abceef acdef a-cdef

32
How many alignments?
  • Without gaps, there are are NxM possible
    alignments between sequences of length N and M
  • Once we start allowing gaps, there are many
    possible arrangements to consider abcbcd
    abcbcd abcbcd
    abc--d a--bcd ab--cd
  • Every possible pairing between elements gives NM
    possible alignments.

33
Finding the optimal alignment
  • Dynamic programming identifies optimal alignments
    in time proportional to the sum of the lengths of
    the sequences

34
Dynamic programming
  • The key idea is to start aligning the sequences
    left to right once a prefix is optimally
    aligned, nothing about the remainder of the
    alignment changes the alignment of the prefix.
  • We construct a matrix of possible alignment
    scores (NxM2 calculations worst case)
  • Called Needleman-Wunch or Smith-Waterman

35
Alignment matrix
  • Create a matrix with each sequence to be aligned
    along one edge and the score of the alignment of
    each pair of elements in a cell.
  • Best local alignment is just the highest
    scoring diagonal

36
Dynamic programming matrix
  • Each cell has the score for the best aligned
    sequence prefix up to that position.

37
The complexity of database search
  • Several algorithms for exact methods
  • exhaustive (linear time, constant space)
  • indexed (log time, linear space)
  • hash tables (constant time, linear space)
  • Inexact search methods
  • Dynamic programming (Smith-Waterman)
  • BLAST

38
Pairwise Sequence Alignment
  • An alignment is a mapping from one sequence to
    another, identifying elements that are likely to
    have arisen from a common ancestor
  • Alignments are NOT exact matches.

39
Similarity
  • Similarity can be defined by counting positions
    that are identical between two sequences
  • Gaps (insertions/deletions) can be important
    abcdef abcdef abcdef
    abceef acdef a-cdef

40
How many alignments?
  • Without gaps, there are are NxM possible
    alignments between sequences of length N and M
  • Once we start allowing gaps, there are many
    possible arrangements to consider abcbcd
    abcbcd abcbcd
    abc--d a--bcd ab--cd
  • Every possible pairing between elements gives NM
    possible alignments.

41
Finding the optimal alignment
  • Dynamic programming identifies optimal alignments
    in time proportional to the sum of the lengths of
    the sequences

42
Dynamic programming
  • The key idea is to start aligning the sequences
    left to right once a prefix is optimally
    aligned, nothing about the remainder of the
    alignment changes the alignment of the prefix.
  • We construct a matrix of possible alignment
    scores (NxM2 calculations worst case)
  • Called Needleman-Wunch or Smith-Waterman

43
Alignment matrix
  • Create a matrix with each sequence to be aligned
    along one edge and the score of the alignment of
    each pair of elements in a cell.
  • Best local alignment is just the highest
    scoring diagonal

44
Dynamic programming matrix
  • Each cell has the score for the best aligned
    sequence prefix up to that position.

45
Why predict protein structure?
  • Neither crystallography nor NMR can keep pace
    with genome sequencing efforts
  • Computer scientists love this problem
  • Understandable with minimal biology
  • Seems like a good discrimination task
  • Understand the mechanisms of folding (?)
  • First computational Nobel prize?

46
Protein structure
  • Amino acid sequence alone should be enough to
    determine protein structure
  • However, the physics are daunting
  • 20,000 protein atoms, plus equal amounts of
    water
  • Can takes seconds (most chemical reactions take
    place 1012 --1,000,000,000,000x faster)
  • Empirical determinations of protein structure are
    advancing rapidly

47
Protein structure cartoons
48
Protein Structure Representations
  • Differentvisualizationsshow variousaspects
    ofstructure

49
Competition
  • Critical Assessment of Structure Prediction
    competition since 1994
  • Solved, but unpublished structures are posted in
    May, predictions due in September
  • Various categories
  • Relation to existing structures, ab initio,
    homology, fold, etc.
  • Partial vs. Fully automated approaches
  • Produces lots of information about what aspects
    of the problems are hard, and ends arguments
    about test sets.
  • Results showing steady improvement, and the value
    of integrative approaches.

50
Reconstructing Phylogenies
  • The only universal biological fact is that all
    species are related by descent.
  • The genealogical history of life can be
    represented by an evolutionary tree (or
    phylogeny), with contemporary organisms at the
    leaves.
  • Goal of phylogenetic inference is to reconstruct
    the order of splitting events (and perhaps the
    distances between them).

51
The phylogeny problem
  • Input
  • A set of contemporary species (S) whose
    evolutionary relationship is to be reconstructed
  • A set of inheritable characteristics (C) that
    describe each species. Characteristics can be
  • quantitative (continuous, e.g. size)
  • qualitative (discrete, e.g. gene sequence)
  • Output
  • Tree (perhaps branch lengths) which best fits the
    data.
  • Assumptions
  • Common ancestor

52
Why do it?
  • Resolve evolutionary history
  • Constructing vaccines
  • Want to assure that vaccine is constructed to
    address diverse strains of the disease (e.g.
    influenza)
  • Epidemiology
  • Reconstruct paths of infection, either of an
    individual or generally. (e.g. Crandall's
    Evolution of HIV)

53
Tree reconstruction as optimization
  • Number of possible trees is 2S. Considering
    branch lengths makes the problem harder.
  • Want the tree that maximizes some quality score.
    Score based on either
  • Characters (e.g. sites in a sequence) directly,
    or
  • Distances between character sets (e.g. alignment
    scores)
  • Another NP-hard optimization problem...

54
Managing the Biomedical Literature
  • Challenges are huge
  • 700,000 articles/yearaccelerating 10
    year!16,048 last week
  • Large number of relevant articles for most topics
  • (Sub-)Disciplinary boundaries are breaking down
  • Computational tools are crucial
  • Information retrieval (finding the right
    articles)
  • Organization dissemination of articles data
  • Information extraction

55
Natural Language Processing
  • NLP is many tools and technologies.
  • Natural language understanding capture the
    meaning and implications of texts. A mostly
    unobtainable goal, but much useful theory
  • Using documents as evidence for gene function or
    other kinds of inference
  • Information extraction capture the meaning and
    implications of certain restricted aspects of
    text
  • Document management reasoning about documents,
    rather than the texts they contain
  • Information retrieval (including Q/A)
  • Browsing and organizing large collections of
    documents

56
Summary
  • Many Significant and Difficult Computational
    Problems
  • Need Computer Scientists to Develop new efficient
    algorithms and methodologies
  • Many different fields
  • Graphics, Algorithms, Machine Intelligence,
    Machine Vision, Optimization, Databases, etc.
Write a Comment
User Comments (0)
About PowerShow.com