Title: Data-intensive Computing: Case Study Area 1: Bioinformatics
1Data-intensive Computing Case Study Area 1
Bioinformatics
2Human Genetics
- Genomics
- Human Genome project
- Proteomics
- Diseasome
- Tree of life project
- Phylogenetics
3Human cell
- Base pair of DNA CG, AT
- C cytosine, G guanine, A adenine , T -
thymine - Each human cell contains approximately 3 billion
base pairs. - The DNA of a single cell contains so much
information that if it were represented in
printed words, simply listing the first letter of
each base would require over 1.5 million pages of
text! - If laid end-to-end, the DNA strand measures about
2 3 meters. - DNA is a single large molecule at the nucleus of
cell - It is coiled a double helix
- Each strand of the DNA molecule is made of A, C,
G and T example AAAGTTCTTAATTA that will be
matched on the other strand by the matching base
TTTCAAGAATTAAT - These string of alphabets contain all the codes
needed for the human functions - Ref text Bioinformatics Databases, tools and
algorithms, by. O. Bosu and S.K. Thukral
4More details
- Sequence of base pairs are grouped to make sense
genes - When a gene inside needs to be activated, the DNA
molecule at the cell nucleus uncoils and unfurls
to the right extent to expose that gene - From the exposed ends of the DNA a RNA is formed.
- mRNA or messenger RNA is formed that carries with
it the print of the open DNA section - RNA and DNA differ in one respect RNA does not
contain T or thymine but it has uracil (U). RNA
is short-lived - Once mRNA is formed open sections of the DNA
close off.
5Protein formation
- mRNA travels to the cytoplasm where it meets the
ribosome (rRNA) - Ribosome reads the code in the mRNA (codon) and
forms the amino acids. - Twenty amino acids are prevalent in human cells.
Ex codon GCU GCC GCA correspond to alanine - In effect ribosome is a process control computer
that takes in as input codons and produces amino
acids as output. - Amino acids polymerize and form polypeptide
chains called proteins - Proteins fold and form the basic structures such
as skin and hair. - Even though brain controls major human functions
at the cell level it is the DNA that has the
command and control. - DNA is fixed code for a given human. (WORM
characteristics)
6Lifes processes
- DNA is program that controls functions,
operations and structure of a cell and in turn
that of our life processes. - Life processes are in fact dependent of the
program in a DNA and the hundreds of millions of
ribosomes. - Life in this context appears as an immense
distributed system.
7Bioinformatics
- Can we study, understand and analyze the
complexity of the immensely complex system? Its
structure and programs? - University of Arizonas tree of life project
(ToL) http//tolweb.org - Human Genome project (NIH and DOE) collecting
approximately 30,000 genes in human DNA and
determining the sequences three billion bases
that make up the human DNA. - Out of the 30000 genes we do not know the
functions of more than 50 of them. - 99.9 of the nucleotide sequence is same for all
of us - 0.1 is attributed to individual differences such
as race, color of skin, disposition to diseases - High throughput sequencing is generating ultra
scale biological data how to analyze this data? - That is a data-intensive problem.
8Existing solutions?
- Traditional databases store, retrieve, analyze
and/or predict huge biological data - Software tools for implementing algorithms, and
developing applications for in-silico experiments - Visualization tools, user interfaces, web
accessibility for search through data - Machine learning and data mining methodologies.
9Databases
- Taxonomy DB
- Genomics
- Sequence db
- Structure db
- Proteomic database (PDB)
- Micro-array db
- Expression db
- Enzyme db
- Disease db
- Molecular biology db
10Tools
- Data analysis tools
- MySQL
- Perl
- Prediction tools
- Clustering
- Modeling tools
- Surface prediction, predicting area of interest,
protein-protein interaction - Alignment tools
- Many more http//galaxyproject.org/
11How can we help?
- How can we leverage our knowledge of large scale
data management to address bioinformatics
problems? DC methods. - Large number of tools and data how we
standardize the efforts so that they are
complementary or repetitive? Cloud computing.
12Text Mining vs Genetic Sequence Mining (Dot plot)
 C O R R E L A T I O N S
R Â Â Â Â Â Â Â Â Â Â Â Â
E Â Â Â Â Â Â Â Â Â Â Â Â
L Â Â Â Â Â Â Â Â Â Â Â Â
A Â Â Â Â Â Â Â Â Â Â Â Â
T Â Â Â Â Â Â Â Â Â Â Â Â
I Â Â Â Â Â Â Â Â Â Â Â Â
O Â Â Â Â Â Â Â Â Â Â Â Â
N Â Â Â Â Â Â Â Â Â Â Â Â
S Â Â Â Â Â Â Â Â Â Â Â Â
H Â Â Â Â Â Â Â Â Â Â Â Â
I Â Â Â Â Â Â Â Â Â Â Â Â
P Â Â Â Â Â Â Â Â Â Â Â Â
 A C T C T A G G A G T C
G Â Â Â Â Â Â Â Â Â Â Â Â
A Â Â Â Â Â Â Â Â Â Â Â Â
T Â Â Â Â Â Â Â Â Â Â Â Â
A Â Â Â Â Â Â Â Â Â Â Â Â
A Â Â Â Â Â Â Â Â Â Â Â Â
T Â Â Â Â Â Â Â Â Â Â Â Â
T Â Â Â Â Â Â Â Â Â Â Â Â
C Â Â Â Â Â Â Â Â Â Â Â Â
G Â Â Â Â Â Â Â Â Â Â Â Â
A Â Â Â Â Â Â Â Â Â Â Â Â
T Â Â Â Â Â Â Â Â Â Â Â Â
C Â Â Â Â Â Â Â Â Â Â Â Â