Data-intensive Computing: Case Study Area 1: Bioinformatics - PowerPoint PPT Presentation

1 / 12
About This Presentation
Title:

Data-intensive Computing: Case Study Area 1: Bioinformatics

Description:

Human Genetics Genomics Human ... Micro-array db Expression db Enzyme db Disease db Molecular biology db * * Tools ... C O R R E L A T I O N S R ... – PowerPoint PPT presentation

Number of Views:96
Avg rating:3.0/5.0
Slides: 13
Provided by: bina
Learn more at: https://cse.buffalo.edu
Category:

less

Transcript and Presenter's Notes

Title: Data-intensive Computing: Case Study Area 1: Bioinformatics


1
Data-intensive Computing Case Study Area 1
Bioinformatics
  • B. Ramamurthy

2
Human Genetics
  • Genomics
  • Human Genome project
  • Proteomics
  • Diseasome
  • Tree of life project
  • Phylogenetics

3
Human cell
  • Base pair of DNA CG, AT
  • C cytosine, G guanine, A adenine , T -
    thymine
  • Each human cell contains approximately 3 billion
    base pairs.
  • The DNA of a single cell contains so much
    information that if it were represented in
    printed words, simply listing the first letter of
    each base would require over 1.5 million pages of
    text!
  • If laid end-to-end, the DNA strand measures about
    2 3 meters.
  • DNA is a single large molecule at the nucleus of
    cell
  • It is coiled a double helix
  • Each strand of the DNA molecule is made of A, C,
    G and T example AAAGTTCTTAATTA that will be
    matched on the other strand by the matching base
    TTTCAAGAATTAAT
  • These string of alphabets contain all the codes
    needed for the human functions
  • Ref text Bioinformatics Databases, tools and
    algorithms, by. O. Bosu and S.K. Thukral

4
More details
  • Sequence of base pairs are grouped to make sense
    genes
  • When a gene inside needs to be activated, the DNA
    molecule at the cell nucleus uncoils and unfurls
    to the right extent to expose that gene
  • From the exposed ends of the DNA a RNA is formed.
  • mRNA or messenger RNA is formed that carries with
    it the print of the open DNA section
  • RNA and DNA differ in one respect RNA does not
    contain T or thymine but it has uracil (U). RNA
    is short-lived
  • Once mRNA is formed open sections of the DNA
    close off.

5
Protein formation
  • mRNA travels to the cytoplasm where it meets the
    ribosome (rRNA)
  • Ribosome reads the code in the mRNA (codon) and
    forms the amino acids.
  • Twenty amino acids are prevalent in human cells.
    Ex codon GCU GCC GCA correspond to alanine
  • In effect ribosome is a process control computer
    that takes in as input codons and produces amino
    acids as output.
  • Amino acids polymerize and form polypeptide
    chains called proteins
  • Proteins fold and form the basic structures such
    as skin and hair.
  • Even though brain controls major human functions
    at the cell level it is the DNA that has the
    command and control.
  • DNA is fixed code for a given human. (WORM
    characteristics)

6
Lifes processes
  • DNA is program that controls functions,
    operations and structure of a cell and in turn
    that of our life processes.
  • Life processes are in fact dependent of the
    program in a DNA and the hundreds of millions of
    ribosomes.
  • Life in this context appears as an immense
    distributed system.

7
Bioinformatics
  • Can we study, understand and analyze the
    complexity of the immensely complex system? Its
    structure and programs?
  • University of Arizonas tree of life project
    (ToL) http//tolweb.org
  • Human Genome project (NIH and DOE) collecting
    approximately 30,000 genes in human DNA and
    determining the sequences three billion bases
    that make up the human DNA.
  • Out of the 30000 genes we do not know the
    functions of more than 50 of them.
  • 99.9 of the nucleotide sequence is same for all
    of us
  • 0.1 is attributed to individual differences such
    as race, color of skin, disposition to diseases
  • High throughput sequencing is generating ultra
    scale biological data how to analyze this data?
  • That is a data-intensive problem.

8
Existing solutions?
  • Traditional databases store, retrieve, analyze
    and/or predict huge biological data
  • Software tools for implementing algorithms, and
    developing applications for in-silico experiments
  • Visualization tools, user interfaces, web
    accessibility for search through data
  • Machine learning and data mining methodologies.

9
Databases
  • Taxonomy DB
  • Genomics
  • Sequence db
  • Structure db
  • Proteomic database (PDB)
  • Micro-array db
  • Expression db
  • Enzyme db
  • Disease db
  • Molecular biology db

10
Tools
  • Data analysis tools
  • MySQL
  • Perl
  • Prediction tools
  • Clustering
  • Modeling tools
  • Surface prediction, predicting area of interest,
    protein-protein interaction
  • Alignment tools
  • Many more http//galaxyproject.org/

11
How can we help?
  • How can we leverage our knowledge of large scale
    data management to address bioinformatics
    problems? DC methods.
  • Large number of tools and data how we
    standardize the efforts so that they are
    complementary or repetitive? Cloud computing.

12
Text Mining vs Genetic Sequence Mining (Dot plot)
  C O R R E L A T I O N S
R                        
E                        
L                        
A                        
T                        
I                        
O                        
N                        
S                        
H                        
I                        
P                        
  A C T C T A G G A G T C
G                        
A                        
T                        
A                        
A                        
T                        
T                        
C                        
G                        
A                        
T                        
C                        
Write a Comment
User Comments (0)
About PowerShow.com