Tentative definition of bioinformatics - PowerPoint PPT Presentation

1 / 27
About This Presentation
Title:

Tentative definition of bioinformatics

Description:

Tentative definition of bioinformatics Bioinformatics, often also called genomics, computational genomics, or computational biology, is a new interdisciplinary field ... – PowerPoint PPT presentation

Number of Views:124
Avg rating:3.0/5.0
Slides: 28
Provided by: Winf2
Category:

less

Transcript and Presenter's Notes

Title: Tentative definition of bioinformatics


1
Tentative definition of bioinformatics
  • Bioinformatics, often also called genomics,
    computational genomics, or computational biology,
    is a new interdisciplinary field at the
    intersection of biology, computer science,
    statistics, and mathematics. Its subject matter
    is the extraction of biologically useful
    information from large sets of molecular data,
    such as DNA or protein sequence data or gene
    expression data. The term bioinformatics is
    currently used mainly to refer to the extraction
    of information from sequence data, while the
    creation and analysis of gene expression data is
    called functional genomics.

2
Biologys dilemma There is too much to know
about living things
  • Roughly 1.5 million species of organisms have
    been
  • described and given scientific names to date.
    Some
  • biologists estimate that the total number of all
    living
  • species may be several times higher. It is
    impossible to
  • learn everything about all these organisms.
    Biologists
  • solve the dilemma by focusing on some species,
    so-called
  • model organisms, and trying to find out as much
    as they
  • can about these model organisms.

3
Some important model organisms
  • Mammals Human, chimpanzee, mouse, rat
  • Fish Zebrafish, Pufferfish
  • Insects Fruitfly (Drosophila melanogaster)
  • Roundworms Ceanorhabditis elegans
  • Protista Malaria parasite (Plasmodium
    falciparum)
  • Fungi Bakers yeast (Saccharomyces cerevisiae)
  • Plants Thale cress (Arabidopsis thaliana),
    corn, rice
  • Bacteria Escherichia coli, Mycoplasma genitalis
  • Archea Methanococcus janaschii

4
Lets find out everything about some species
  • What would it mean to learn everything about a
    given
  • species? All available evidence indicates that
    the complete
  • blueprint for making an organism is encoded in
    the
  • organisms genome. Chemically, the genome
    consists of
  • one or several DNA molecules. These are long
    strings
  • composed of pairs of nucleotides. There are only
    four
  • different nucleotides, denoted by A, C, G, T.
    The
  • information about how to make the organism is
    encoded
  • by the order in which the nucleotides appear.

5
Some genome sizes
  • HIV2 virus
    9671 bp
  • Mycoplasma genitalis 5.8 105
    bp
  • Haemophilus influenzae 1.83 106 bp
  • Saccharomyces cerevisiae 1.21 107 bp
  • Caenorhabditis elegans 108
    bp
  • Drosophila melanogaster 1.65 108 bp
  • Homo sapiens 3.14 109
    bp
  • Some amphibians 8 1010
    bp
  • Amoeba dubia 6.7
    1011 bp

6
Sequencing Genomes
  • Contemporary technology makes it possible to
    completely
  • sequence entire genomes, that is, determine the
    sequence
  • of As, Cs, Gs, and Ts in the organisms
    genome. The
  • first virus was sequenced in the 1980s, the
    first
  • bacterium (Haemophilus influenzae) in 1995, the
    first
  • multicellular organism (Caenorhabditis elegans)
    in 1998.
  • A draft of the human genome was announced in
    2000.

7
Where to store all these data?
  • In databases of course. Some of the sequence
    data are
  • stored in proprietary data bases, but most of
    them are
  • stored in the public data base Genbank and an be
  • accessed via the World Wide Web. In fact, most
    relevant
  • journals require proof of submission to Genbank
    before an
  • article discussing sequence data will be
    published.
  • The URL for Genbank is
  • http//www.ncbi.nlm.nih.gov/Genbank/

8
Whats in the databases?
  • In 1981, Genbank contained less than 500,000 bp
    of info.
  • In 1986, Genbank contained 9,615,371 bp of info.
  • In 1991, Genbank contained 71,947,426 bp of info.
  • In 1996, Genbank contained 651,972,984 bp of
    info.
  • In 2001, Genbank contained 15,849,921,438 bp of
    info.
  • In 2004, Genbank contained 37,893,844,733 bp of
    info.
  • In 2009, Genbank contained 106,533,156,756 bp of
    info.

9
Whats in the databases?
  • On March 18, 2005 there were 1791 completely
    sequenced
  • viruses, 204 completely sequenced bacteria,
  • 21 completely sequenced archaea, and 9 complete
  • genomes of Eukaryotes, among them two yeasts, the
  • roundworm C. elegans, the fruitfly Drosophila
  • melanogaster, the mosquito A. gambiae, the
    malaria
  • parasite P. falciparum, and the plant Arabidopsis
    thaliana
  • (thale cress). There are also drafts of 11 other
    genomes
  • of eukaryotes, most notably of the human genome.

10
Whats in the databases?
  • On December 17, 2010 there were
  • 3518 completely sequenced viruses,
  • 952 completely sequenced bacteria,
  • 68 completely sequenced archaea,
  • and 73 complete genomes of Eukaryotes,
  • among them cow, wolf, horse, human, a
  • monkey, pig, chimpanzee.

11
First challengeSequencing large genomes
  • Currently, much of the sequencing process is
    automated.
  • However, contemporary sequencing machines can
    only
  • sequence stretches of DNA that are a few hundred
    base
  • pairs long at a time. The process of assembling
    these
  • stretches of sequence into a whole genome poses
    some
  • interesting mathematical problems.

12
First challengeSequencing large genomes
  • For example, the publicly financed Human Genome
    Project
  • uses an approach called genome mapping to
    facilitate
  • sequence assembly. Celera Genomics, a private
  • enterprise, announced that they will be able to
    complete
  • the sequencing of the entire human genome much
    faster
  • by using an approach called shotgun sequencing.
    There
  • was much debate over the feasibility of the
    latter
  • approach, but it apparently worked. At its core,
    this was a
  • debate over the mathematics of sequence assembly.

13
You have sequenced your genome - what do you do
with it?
  • This is known as genome analysis or sequence
    analysis.
  • At present, most of bioinformatics is concerned
    with
  • sequence analysis. Here are some of the
    questions
  • studied in sequence analysis
  • gene finding
  • protein 3D structure prediction
  • gene function prediction
  • prediction of important sites in proteins
  • reconstruction of phylogenies

14
Genes and proteins
  • The genome controls the making and workings of an
  • organism by telling the cell which proteins to
    manufacture
  • under which conditions. Proteins are the
    workhorses of
  • biochemistry and play a variety of roles.
  • A gene is a stretch of DNA that codes a given
    protein.

15
Where are the genes?
  • The objective of gene finding is to identify the
    regions of
  • DNA that are genes. Ideally, we want to make
    statements
  • like Positions 28,354 through 29,536 of this
    genome code
  • a protein.
  • The mathematical challenge here is to identify
    patterns in
  • DNA that reliably indicate where a gene starts
    and ends,
  • especially in eukaryotes.

16
Protein structure prediction
  • When a protein is manufactured in the cell, it
    assumes a
  • characteristic 3D structure or fold. It is very
    costly to
  • determine the 3D structure of a protein
    experimentally (by
  • NMR or X-ray crystallography). It would be much
    cheaper
  • if we could predict the 3D structure of a protein
    directly
  • from its primary structure, i.e., from the
    sequence of its
  • amino acids. This is known as the protein
    folding problem.
  • Many approaches have been proposed to develop
  • algorithms for solving this problem so far
    results are
  • mixed.

17
Prediction of protein function
  • Suppose you have identified a gene. What is its
    role in the
  • biochemistry of its organism? Sequence databases
    can
  • help us in formulating reasonable hypotheses.
  • Search the database for proteins with similar
    amino acid sequences in other organisms.
  • If the functions of the most similar proteins are
    known and if they tend to be the same function
    (e.g., enzyme involved in glucose metabolism),
    then it is reasonable to conjecture that your
    gene also codes an enzyme involved in glucose
    metabolism.

18
Prediction of protein function homology searches
  • Given a nucleotide or DNA sequence, searching the
    data
  • base(s) for similar sequences is known as
    homology
  • searches. The most popular software tool for
    performing
  • these searches is called BLAST therefore
    biologists often
  • speak of BLAST searches. There are two
    interesting
  • problems here
  • How to measure similarity of two sequences.
  • How much similarity constitutes evidence of
    biologically meaningful homology as opposed to
    random chance?

19
Prediction of important sites in proteins
  • Not all parts of a protein are equally important
    the
  • function of most of its amino acids is often just
    to maintain
  • an appropriate 3D structure, and mutations of
    those less
  • crucial amino acids often don't have much effect.
  • However, most proteins have crucial parts such as
  • binding sites. Mutations occurring at binding
    sites tend to
  • be lethal and will be weeded out by evolution.

20
How to predict binding sites from sequence data
  • Get a collection of proteins of similar amino
    acid sequences and analogous biochemical function
    from your database.
  • Align these sequences amino acid by amino acid.
  • Check which regions of the protein are highly
    conserved in the course of evolution.
  • The binding site should be in one of the highly
    conserved regions.

21
The importance of being aligned
  • DNA and protein molecules evolve mostly by three
  • processes point mutations (exchange of a single
    letter for
  • another), insertions, and deletions. If a group
    of
  • homologuous proteins from different organisms has
    been
  • identified, it is assumed that these proteins
    have evolved
  • from a common ancestor. The process of multiple
  • sequence alignment aims at identifying loci in
    the
  • individual molecules that are derived from a
    common
  • ancestral locus. These form the columns of the
    alignment.

22
Example of a multiple alignment
  • A T G - - T T C G G A C T
  • A C G A A T C C A G - C T
  • - C G A A T C C T A A C C
  • - T G A G C A C T A A C C

23
Reconstruction of phylogenetic trees
  • A phylogenetic tree depicts the evolutionary
    history of a
  • group of species. By observing similarities and
    differences
  • between species, we may be able to reconstruct
    their
  • phylogeny. Classically, the degree of similarity
    between
  • two species has been assessed from morphological
  • characters. By comparing genomic sequence data,
    we
  • actually can quantify the degree of similarity
    between any
  • two species, and use these degrees of similarity
    as a basis
  • for reconstructing phylogenetic trees.

24
Reconstruction of phylogenetic trees
  • The most common approach to using genomic data
    for
  • reconstruction of phylogenetic trees is to look
    at genes
  • with analogous function and thus supposedly
    common
  • ancestry and see how far the genes taken from the
    extant
  • organisms have diverged.
  • The observed differences in the amino acid
    composition
  • are then used to reconstruct the phylogeny. The
    current
  • partition of organisms into eubacteria, archaea
    and
  • eukaria was discovered in this way by analyzing
    rRNA.

25
The new frontier Functional genomics
  • It is fashionable nowadays to talk about
    functional
  • genomics. Many people use this term as if it
    were a new
  • discipline separate from bioinformatics, but I
    think it is
  • more appropriate to consider it a new subfield of
  • bioinformatics.
  • The ultimate aim of functional genomics is to
    understand
  • what genes do, when they do it, and how they do
    it.
  • Ideally, we would like to understand the cell, or
    organism,
  • as a giant network of chemical pathways that
    regulate
  • each other.

26
Microarrays (gene chips)
  • Microarrays or Gene Chips allow to monitor the
    level of
  • activity of all the gene represented on the chip
  • simultaneously under a variety of environmental
  • conditions, in various organs, and at various
    stages of
  • development.
  • There are two types of challenges here To
    determine
  • when a change in activity level detected by the
    chip is
  • statistically significant, and to use the data so
    obtained to
  • make inferences about gene regulation.

27
What do we do with all these data?
  • The bread and butter method of microarray data
  • analysis is clustering. This allows to identify,
    for
  • a sequence of experiments on the same set of
    genes
  • under various conditions, groups of genes that
    are
  • up- or down-regulated simultaneously. It is
    believed
  • that genes acting in the same chemical pathway
  • would normally belong to the same cluster. Some
  • algorithms for clustering will be discussed in
    this course.
Write a Comment
User Comments (0)
About PowerShow.com