Title: Bioinformatics Methods and Applications
1BioinformaticsMethods and Applications
- Dr. Hongyu Zhang
- Ceres Inc.
2Goals of the talk
- The major battle fields in Bioinformatics
research - The most popular weapons used in the battle
3History
- Human genome project
- Overlapping with other branches
- Computational Biology
- Biocomputing
- Biostatistics
- Cheminfomatics
4The Central Dogma ofMolecular Biology
Transcription
Translation
DNA
RNA
Protein
5Major battle fields in bioinformatics
- DNA
- Genome sequencing
- Gene discovery
- mRNA
- Micro-array analysis
- Sequencing
- Protein
- Structure modeling and prediction
- Proteomics
6Major weapons
- Computational algorithm
- Hash method
- Dynamic algorithm
- String and Tree (binary, suffix)
- Clustering
- Probability and Statistical theory and methods
- Bayesian theorem, Markov chain (HMM), Principle
component - Monte Carlo simulation
- Neural Network
- Physical chemistry
- Functions to describe the physical chemistry
interactions in bio-molecules - Molecular mechanics, Molecular dynamics algorithm
- Data storage and access
- Database Oracle, MySQL etc.
- Web interface
7Genome sequencing Celera shotgun assemblyVenter
et al. 2001
8Gene discoverybased on sequence comparison
- Finding new genes based on their sequence
similarity and evolution relationship with known
genes - Methods
- Hash-based database search method, like BLAST
(PSI-BLAST), FASTA, BLAT etc. - Sequence alignment using Dynamic Programming
algorithm
9BLAST database search (http//www.ncbi.nih.gov/BL
AST/)
Query sequence
Database sequences
Query
database
10Sequence alignment
BLAST BLA-T
- Programs
- CLUSTALW
- DIALIGN
11Dynamics algorithm
Sequence A (A1, A2, , Ai, ..., Am) Sequence B
(B1, B2, , Bj, , An)
12Ab initio gene prediction methods
- Statistics based gene prediction
- Nucleotides distribution frequencies in the
coding regions - Exon/Intron boundary signal
- Examples
- GenScan, Burge and Karlin 1997
- Fgenesh, Solovyev and Salamov 1994
13Hybrid gene prediction method
- Example Celera Otto program
- BLAST against Refseq database
- BLAST against EST database, other genomic
sequences etc. - Genscan, Fgenesh
14Problems in Gene discovery
- Example Given a cDNA sequence, find its true
location in the genome map among lots of
alternatives
Query transcript/protein
Genomic component
15Two-step solution
- BLAST search of the cDNA sequence against the
whole genome map - Using an LIS algorithm to find the correct
genomic component hit
16Phylogenetic analysis
- Goal study the function and evolution
relationship among a group of genes - Divide homologous genes into function families
- Find the evolution relationship between the
ortholog genes belonging to different species
(e.g., the theory of Out of Africa) - Methods
- Hierarchical Clustering
- Neighbore-joining etc.
- PHYLIP program, Univ. of Washington
17(No Transcript)
18Micro-array analysis
- Expression-genomics
- Primary goals
- Look for the genes with different expression
levels between experiments, which are candidates
of functional genes - Look for the group of genes that have correlated
gene expression levels, which could suggest that
they are in the same biological pathway
19- Methods
- General probability and statistics methods
- Dimension reduction
- Principle components
- Lowess
- Clustering
- Tools
- S-plus, R
20Example
- Herbicide
- Plants was treated with herbicide to observe the
gene expression profiles in a series of time
steps. - The genes that appeared right before plant dies
(12 hours) are the possible death genes - If we knock down the death genes in the normal
plants, they could last longer time than the
herbs.
21Protein structure prediction
- Why is protein structure important?
- The functions of a gene depend on its translated
protein structure - Protein binding with its ligands
- Protein-protein interactions
- A protein molecule usually keeps one stable
structure under normal physiological conditions
(Anfinson, 1960es) - Drug design
- Docking and high throughput drug screening.
22Sequence
Bioinformatics
Protein structure
Function
23Protein structure prediction methods
24Homology modeling procedure
Protein sequence
Database search
Select template structure
Sequence alignment
Build conserved regions first
Loop modeling
Build side-chains
Optimizing
25Homology modeling programs
- Academic software
- MODELER, Sali A.
- COMPOSER, Blundell T.
- SWISS-MODEL
- Rasmol (graphics)
- Commercial software
- QUANTA, MSI inc.
- SYBYL, TRIPOS inc.
26Threading
- Find the best fold candidates among a limited
number of choices - Add 3D information to the score function of
dynamic programming
27Ab initio protein structure principle
28- Threading programs
- Topits, Eisenberg D.
- Threader, Jones D.
- ProSup, Sipple M
- 123D, Alexandra N.
- Ab initio programs
- Rosetta, David Baker
29Current status in the protein structure
prediction field
- Moult J., CASP (Critical Assessment of Techniques
for Protein Structure Prediction). - Homology modeling is very mature already
- Threading and Ab initio method have been used in
industry - Structure genomics
30Large scale computing platform
- Hardware
- Super-computers
- Cray/SGI
- DEC/Compaq
- Intel
- Linux clusters
- Blade
- Software
- Parallel computing (MPP, PVM etc.)
- Linux
- Grid computing the Globus Project
31Linux clusters
32Data storage and access
- Bioinformatics is producing huge amount of data
each day - How to organize and store data
- How to access data
- Database software
- Commercial
- Oracle, DB2, Sybase
- Freeware
- MySQL, PostgreSQL
33Data store and access
- Bioinformatics is producing huge amount of data
each day - How to organize and store data
- How to access data
- Database software
- Commercial
- Oracle, DB2, Sybase
- Freeware
- MySQL, PostgreSQL
- Current popular database
- DNA, protein sequence, like Genbank, SwisProt,
PIR etc. - Protein structure, like PDB, Scop
- DNA, mRNA, protein function, like GO, PFAM
34Database example Gene Ontology (GO)
Molecular function
Biological process
Cellular component
35Data access
- Web interface
- Protocol
- CGI, JSP, ASP
- Computer languages
- Perl, Java, C/C, Visual Basic, Visual C
36Forth looking
- Where are the markets
- Develop new programs
- Assemble current programs to build more efficient
data mining pipelines - Data storage and access
- Integrate the current database to use them more
effectively - Computing platform, including hardware, software
support, consulting etc. - What we can offer
- Multi-talents
- Team work
- Networking
37http//www.hongyu.org/paper/bioinformatics.ppt