Title: K' SEKAR, Ph'D'
1 K. SEKAR, Ph.D.
2Dr. K. Sekar Bioinformatics Centre Supercomputer
Education and Research Centre Indian Institute of
Science Bangalore 560 012 INDIA E-mail
sekar_at_physics.iisc.ernet.in Voice
91-080-3601409 or 91-080-2932469 Fax
91-080-3600683 or 91-080-3600551
3APPROACHES TO DEVELOPING DATA MINING TOOLS
4Abstract
Bioinformatics is one of the fastest growing
interdisciplinary areas in the biological
sciences and has explored in such a way that we
need powerful tools to organize and analyze the
data. An overview will be presented on the
general features of data mining tools, techniques
and its applications
5Bioinformatics is the fashionable new name for
the field previously called computational
biology.The name is preferred by many because it
puts the emphasis on the data storage and
analysis, rather than on the biology, and the
field is really data driven
6The term Bioinformatics is used to encompass
almost all computer applications in biological
sciences, but was originally coined in the mid
1980s for the analysis of biological sequence
data
The quantity of known sequences data outweighs
protein structural data and by virtue of the
genome projects, sequence database are doubling
in size every year
A key challenge of bioinformatics is to analyze
the wealth of sequence data in order to
understand the amassed information in term of
protein structure function and evolution
Wherever possible, a range of different methods
should be used, and the results should be married
with all available biological information
7Bioinformatics has provided us with a
communication channel to reach and decode all
this information in a comprehensive manner
Both the large information repositories and the
specialized tools to query them are held on
distributed internet sites, therefore
Bioinformatics require sound internet navigation
skills
The primary integrating technology that
facilitates access to copious data is the world
wide web
8Refers to database-like activities involving
persistent sets of data that are maintained in a
consistent state over essentially indefinite
periods of time
Encompass the use of algorithmic tools to
facilitate biological database analyses
Comprises the entire collection of information
management systems, analysis tools and
communication networks supporting biology
9DATA MINING
Datamining is defined as exploration and
analysis by automatic and semi-automatic means,
of large quantities of data in order to discover
meaningful patterns and rules
10The central challenge is to derive maximum
results from the wealth of data.This can be
achieved by establishing and maintaining
databases and providing search and analysis tools
to interpret the data
11DATABASE
Database is nothing but a collection of
quantitative data resulting from experimental
measurements or observations in various fields of
science.Recently interest in database has been
kindled through international efforts to organize
and analyze the data and update the knowledge
12A database is essentially just a store of
information.They are usually in the form of
simple files (just a flat file, say).You can
shove information into this store or retrieve it
from the store
13Derived Database
One of the greatest challenges in database
research is analyze the database in depth and
create derived databases to meet the needs or
demands without compromising the sustainability
and quality of the existing database. Creating
desired database is expected is expected to
dramatically reduce the workload of the user
community and will serve as a highly focused
database
14DBREF 1UNE 1 123 SWS P00593
PA2_BOVIN 23 145 SEQADV
1UNE ASN 122 SWS P00593 LYS 144
CONFLICT SEQRES 1 123
ALA LEU TRP GLN PHE ASN GLY MET ILE LYS CYS LYS
ILE SEQRES 2 123 PRO SER SER GLU
PRO LEU LEU ASP PHE ASN ASN TYR GLY
SEQRES 3 123 CYS TYR CYS GLY LEU GLY GLY
SER GLY THR PRO VAL ASP SEQRES 4
123 ASP LEU ASP ARG CYS CYS GLN THR HIS ASP ASN
CYS TYR SEQRES 5 123 LYS GLN ALA
LYS LYS LEU ASP SER CYS LYS VAL LEU VAL
SEQRES 6 123 ASP ASN PRO TYR THR ASN ASN
TYR SER TYR SER CYS SER SEQRES 7
123 ASN ASN GLU ILE THR CYS SER SER GLU ASN ASN
ALA CYS SEQRES 8 123 GLU ALA PHE
ILE CYS ASN CYS ASP ARG ASN ALA ALA ILE
SEQRES 9 123 CYS PHE SER LYS VAL PRO TYR
ASN LYS GLU HIS LYS ASN SEQRES 10
123 LEU ASP LYS LYS ASN CYS
HET CA 124 1
HETNAM CA CALCIUM ION
FORMUL 2 CA
CA1 2
FORMUL 3 HOH 134(H2 O1)
HELIX
1 1 LEU 2 LYS 12 1
11 HELIX 2 2 PRO
18 ASP 21 1
4 HELIX 3 3 ASP 40 LYS 57 1
18 HELIX 4
4 ASP 59 VAL 63 1
5 HELIX 5 5 ALA 90
LYS 108 1 19
HELIX 6 6 LYS 113 HIS 115 5
3 SHEET 1 A
2 TYR 75 SER 78 0
SHEET 2 A 2 GLU 81 CYS
84 -1 N THR 83 O SER 76
SSBOND 1 CYS 11 CYS 77
SSBOND 2 CYS
27 CYS 123
SSBOND 3 CYS 29 CYS 45
SSBOND 4 CYS 44 CYS 105
SSBOND 5 CYS
51 CYS 98
SSBOND 6 CYS 61 CYS 91
SSBOND 7 CYS 84 CYS 96
LINK CA
CA 124 O TYR 28
LINK CA CA 124
O GLY 32
CRYST1 47.120 64.590 38.140 90.00 90.00
90.00 P 21 21 21 4
15SUB-DERIVED DATABASE EXAMPLE-1 XXXXXSEKAR
RADHASEKAR
SHAMIASEKAR
SARADASEKAR EXAMPLE-2 XAXAXA
KAMALA
SARADA YAMAHA
KANAGA MANASA
VANASA PANAMA
16Adding information to the database
Software to collate the required Information
from the database
Analyze the collated information
17Web Browser
WWW Service
CGI-Script
Disk Storage
18WHY A TOOL?
The amount of information in the world is growing
exponentially, and it is becoming impossible to
effectively manage the data.Machine assistance is
clearly necessary, but the difficulty lies in
designing systems and softwares that are capable
of discovering useful information with minimal
human intervention
19PROTEIN DATA BANK (PDB) GENOME
DATABASE (GDB) STRUCTURAL CLASSIFICATION OF
PROTEINS (SCOP) CAMBRIDGE STRUCTURAL
DATABASE (CSD)
20Given PDB-Id 1une HEADER HYDROLASE
05-NOV-97 1UNE
TITLE CARBOXYLIC ESTER HYDROLASE, 1.5
ANGSTROM ORTHORHOMBIC FORM TITLE 2
OF THE BOVINE RECOMBINANT PLA2
COMPND MOL_ID 1
COMPND 2 MOLECULE PHOSPHOLIPASE A2
COMPND 3 CHAIN
NULL
COMPND 4 EC 3.1.1.4
COMPND 5 ENGINEERED YES
SOURCE MOL_ID
1
SOURCE 2 ORGANISM_SCIENTIFIC BOS
TAURUS
SOURCE 3 ORGANISM_COMMON BOVINE
SOURCE 4
EXPRESSION_SYSTEM ESCHERICHIA COLI
SOURCE 5 EXPRESSION_SYSTEM_S
TRAIN BL21 (DE3) PLYSS
SOURCE 6 EXPRESSION_SYSTEM_PLASMID
PTO-A2MBL21 SOURCE
7 EXPRESSION_SYSTEM_GENE MATURE PLA2
KEYWDS HYDROLASE,
ENZYME, CARBOXYLIC ESTER HYDROLASE
EXPDTA X-RAY DIFFRACTION
AUTHOR
M.SUNDARALINGAM
REVDAT 1 06-MAY-98 1UNE
0
21REMARK 1 REFERENCE 1
REMARK 1 AUTH
K.SEKAR,A.KUMAR,X.LIU,M.-D.TSAI,M.H.GELB,
REMARK 1 AUTH 2 M.SUNDARALINGAM
REMARK
1 TITL CRYSTAL STRUCTURE OF THE COMPLEX OF
BOVINE REMARK 1 TITL 2
PANCREATIC PHOSPHOLIPASE A2 WITH A TRANSITION
STATE REMARK 1 TITL 3 ANALOGUE
REMARK 1 REF TO BE PUBLISHED
REMARK 1 REFN
0353 REMARK 1 REFERENCE 2
REMARK 1 AUTH K.SEKAR,C.SEKARUDU,M.-D.TSAI,M
.SUNDARALINGAM REMARK 1 TITL
1.72A RESOLUTION REFINEMENT OF THE TRIGONAL FORM
OF REMARK 1 TITL 2 BOVINE PANCREATIC
PHOSPHOLIPASE A2
REMARK 1 REF TO BE PUBLISHED
REMARK 1 REFN
0353 REMARK 1 REFERENCE 3
REMARK 1 AUTH K.SEKAR,S.ESWARAMOORTHY,M.K.JA
IN,M.SUNDARALINGAM REMARK 1 TITL
CRYSTAL STRUCTURE OF THE COMPLEX OF BOVINE
REMARK 1 TITL 2 PANCREATIC
PHOSPHOLIPASE A2 WITH THE INHIBITOR
REMARK 1 TITL 3 1-HEXADECYL-3-(TRIFLUOROETHYL)
-SN-GLYCERO-2- REMARK 1 TITL
4 PHOSPHOMETHANOL
REMARK 1 REF BIOCHEMISTRY
V. 36 14186 1997
22REMARK 2 RESOLUTION. 1.5 ANGSTROMS.
REMARK 3
REFINEMENT.
REMARK 3 PROGRAM
X-PLOR 3.1
REMARK 3 AUTHORS BRUNGER
REMARK 3
DATA USED IN REFINEMENT.
REMARK 3 RESOLUTION RANGE
HIGH (ANGSTROMS) 1.5
REMARK 3 RESOLUTION RANGE LOW (ANGSTROMS)
10.0 REMARK 3 DATA
CUTOFF (SIGMA(F)) 1.0
REMARK 3 DATA CUTOFF HIGH
(ABS(F)) 0.1
REMARK 3 DATA CUTOFF LOW (ABS(F))
1000000.0 REMARK 3
COMPLETENESS (WORKINGTEST) () 92.
REMARK 3 NUMBER OF
REFLECTIONS 17572
REMARK 3 FIT TO DATA USED IN
REFINEMENT.
REMARK 3 CROSS-VALIDATION METHOD
NULL REMARK 3
FREE R VALUE TEST SET SELECTION X-PLOR
REMARK 3 R VALUE
(WORKING SET) 0.184
REMARK 3 FREE R VALUE
0.228 REMARK 3
FREE R VALUE TEST SET SIZE () 7.
REMARK 3 FREE R VALUE TEST
SET COUNT 1198
REMARK 3 ESTIMATED ERROR OF FREE R VALUE
0.24
23REMARK 3 PARAMETER FILE 1 PARHCSDX.PRO
REMARK 3
PARAMETER FILE 2 NULL
REMARK 3 TOPOLOGY FILE 1
TOPHCSDX.PRO
REMARK 3 TOPOLOGY FILE 2 NULL
REMARK 3 OTHER
REFINEMENT REMARKS NULL
REMARK 4 1UNE COMPLIES WITH FORMAT
V. 2.2, 16-DEC-1996
REMARK 200
REMARK 200
EXPERIMENTAL DETAILS
REMARK 200 EXPERIMENT TYPE
X-RAY DIFFRACTION
REMARK 200 DATE OF DATA COLLECTION
26-JAN-1996 REMARK 200
TEMPERATURE (KELVIN) 291
REMARK 200 PH
7.2
REMARK 200 NUMBER OF CRYSTALS USED 1
REMARK 200
REMARK 200 SYNCHROTRON
(Y/N) N
REMARK 200 RADIATION SOURCE
NULL REMARK 200
BEAMLINE NULL
REMARK 200 X-RAY GENERATOR
MODEL R-AXIS IIC
REMARK 200 MONOCHROMATIC OR LAUE (M/L) M
REMARK 200
WAVELENGTH OR RANGE (A) 1.5418
REMARK 200 MONOCHROMATOR
GRAPHITE
REMARK 200 OPTICS
NULL REMARK 200
24REMARK 200 IN THE HIGHEST RESOLUTION SHELL.
REMARK 200
HIGHEST RESOLUTION SHELL, RANGE HIGH (A) 1.5
REMARK 200 HIGHEST RESOLUTION
SHELL, RANGE LOW (A) 1.55
REMARK 200 COMPLETENESS FOR SHELL () 63.
REMARK 200 DATA
REDUNDANCY IN SHELL 3.7
REMARK 200 R MERGE FOR SHELL
(I) 0.172
REMARK 200 R SYM FOR SHELL (I)
NULL REMARK 200
FOR SHELL NULL
REMARK 200
REMARK 200
METHOD USED TO DETERMINE THE STRUCTURE THE HIGH
RESOLUTION REMARK 200 ATOMIC
COORDINATES OF THE WILD TYPE (PDB ENTRY 1BP2)
REMARK 200 WERE USED AS THE STARTING
MODEL FOR REFINEMENT. REMARK
200 SOFTWARE USED X-PLOR
REMARK 200 STARTING
MODEL WILD TYPE (PDB ENTRY 1BP2)
REMARK 200
REMARK
200 REMARK NULL
REMARK 280
REMARK 290
REMARK 290
CRYSTALLOGRAPHIC SYMMETRY
REMARK 290 SYMMETRY OPERATORS
FOR SPACE GROUP P 21 21 21
REMARK 290
REMARK 290
SYMOP SYMMETRY
REMARK 290 NNNMMM OPERATOR
REMARK 290 1555 X,Y,Z
REMARK 290
2555 1/2-X,-Y,1/2Z
REMARK 290 3555
-X,1/2Y,1/2-Z
REMARK 290 4555 1/2X,1/2-Y,-Z
25DBREF 1UNE 1 123 SWS P00593
PA2_BOVIN 23 145 SEQADV
1UNE ASN 122 SWS P00593 LYS 144
CONFLICT SEQRES 1 123
ALA LEU TRP GLN PHE ASN GLY MET ILE LYS CYS LYS
ILE SEQRES 2 123 PRO SER SER GLU
PRO LEU LEU ASP PHE ASN ASN TYR GLY
SEQRES 3 123 CYS TYR CYS GLY LEU GLY GLY
SER GLY THR PRO VAL ASP SEQRES 4
123 ASP LEU ASP ARG CYS CYS GLN THR HIS ASP ASN
CYS TYR SEQRES 5 123 LYS GLN ALA
LYS LYS LEU ASP SER CYS LYS VAL LEU VAL
SEQRES 6 123 ASP ASN PRO TYR THR ASN ASN
TYR SER TYR SER CYS SER SEQRES 7
123 ASN ASN GLU ILE THR CYS SER SER GLU ASN ASN
ALA CYS SEQRES 8 123 GLU ALA PHE
ILE CYS ASN CYS ASP ARG ASN ALA ALA ILE
SEQRES 9 123 CYS PHE SER LYS VAL PRO TYR
ASN LYS GLU HIS LYS ASN SEQRES 10
123 LEU ASP LYS LYS ASN CYS
HET CA 124 1
HETNAM CA CALCIUM ION
FORMUL 2 CA
CA1 2
FORMUL 3 HOH 134(H2 O1)
HELIX
1 1 LEU 2 LYS 12 1
11 HELIX 2 2 PRO
18 ASP 21 1
4 HELIX 3 3 ASP 40 LYS 57 1
18 HELIX 4
4 ASP 59 VAL 63 1
5 HELIX 5 5 ALA 90
LYS 108 1 19
HELIX 6 6 LYS 113 HIS 115 5
3 SHEET 1 A
2 TYR 75 SER 78 0
SHEET 2 A 2 GLU 81 CYS
84 -1 N THR 83 O SER 76
SSBOND 1 CYS 11 CYS 77
26REMARK 3 FIT IN THE HIGHEST RESOLUTION BIN.
REMARK 3 TOTAL
NUMBER OF BINS USED 8
REMARK 3 BIN RESOLUTION RANGE
HIGH (A) 1.5
REMARK 3 BIN RESOLUTION RANGE LOW (A)
1.55 REMARK 3 BIN
COMPLETENESS (WORKINGTEST) () 63.
REMARK 3 REFLECTIONS IN BIN
(WORKING SET) 1176
REMARK 3 BIN R VALUE (WORKING SET)
0.340 REMARK 3 BIN
FREE R VALUE 0.352
REMARK 3 BIN FREE R VALUE TEST
SET SIZE () 7.
REMARK 3 BIN FREE R VALUE TEST SET COUNT
81 REMARK 3
ESTIMATED ERROR OF BIN FREE R VALUE NULL
REMARK 3
REMARK 3 NUMBER OF NON-HYDROGEN ATOMS USED IN
REFINEMENT. REMARK 3
PROTEIN ATOMS 957
REMARK 3 NUCLEIC ACID ATOMS
0
REMARK 3 HETEROGEN ATOMS 1
REMARK 3
SOLVENT ATOMS 134
REMARK 3
REMARK 3 B VALUES.
REMARK 3 FROM
WILSON PLOT (A2) NULL
REMARK 3 MEAN B VALUE
(OVERALL, A2) NULL
REMARK 3 LOW RESOLUTION CUTOFF (A)
NULL REMARK 3
REMARK 3 CROSS-VALIDATED
ESTIMATED COORDINATE ERROR.
27ATOM 1 N ALA 1 13.830 17.835
32.697 1.00 11.41 ATOM 2 CA
ALA 1 12.869 16.725 32.889 1.00
11.31 ATOM 3 C ALA 1
12.106 16.547 31.592 1.00 12.00
ATOM 4 O ALA 1 12.366
17.226 30.614 1.00 11.37 ATOM
5 CB ALA 1 11.891 17.029 34.056
1.00 11.89 ATOM 6 N LEU
2 11.150 15.638 31.585 1.00 13.43
ATOM 7 CA LEU 2 10.392
15.362 30.376 1.00 14.98 ATOM
8 C LEU 2 9.556 16.543
29.879 1.00 14.65 ATOM 9 O
LEU 2 9.465 16.764 28.657 1.00
13.62 ATOM 10 CB LEU 2
9.522 14.116 30.561 1.00 15.03
ATOM 11 CG LEU 2 8.919
13.539 29.291 1.00 17.13 ATOM
12 CD1 LEU 2 10.038 13.103
28.360 1.00 17.29 ATOM 13 CD2
LEU 2 8.027 12.361 29.656 1.00
17.65 ATOM 14 N TRP 3
8.960 17.305 30.796 1.00 14.18
ATOM 15 CA TRP 3 8.157
18.443 30.347 1.00 16.10 ATOM
16 C TRP 3 8.998 19.448
29.543 1.00 14.26 ATOM 17 O
TRP 3 8.580 19.864 28.472 1.00
14.34 ATOM 18 CB TRP 3
7.359 19.103 31.491 1.00 19.02
ATOM 19 CG TRP 3 8.163
19.810 32.534 1.00 24.63 ATOM
20 CD1 TRP 3 8.699 19.262 33.683
1.00 25.51 ATOM 21 CD2 TRP
3 8.505 21.199 32.555 1.00 27.29
ATOM 22 NE1 TRP 3 9.348
20.230 34.403 1.00 27.56 ATOM
23 CE2 TRP 3 9.253 21.428
33.743 1.00 28.36 ATOM 24 CE3
TRP 3 8.258 22.278 31.686 1.00
27.60 ATOM 25 CZ2 TRP 3
9.754 22.695 34.083 1.00 28.94
ATOM 26 CZ3 TRP 3 8.761
23.542 32.026 1.00 28.78 ATOM
27 CH2 TRP 3 9.503 23.735 33.216
1.00 29.43
28- CAMBRIDGE STRUCTURAL DATABASE
- The CAMBRIDGE STRUCTURAL DATABASE
- Software for search, Retrieval Display and
- Analysis of CSD contents
- The CSD records bibliographic, 2D chemical and
3D structural results from crystallographic
analysis of organics, organometallics and metal
complexes .Both X-Ray and Neutron Diffraction
studies are included for small and medium sized
compounds containing upto 500 atoms including
hydrogens)
29THREE DBA COMPONENTS
Database Integrity Database Security Database
Recovery
30DATABASE INTEGRITY
The major issue for the database management is to
ensure that the data in the database is accurate,
correct, valid and consistent.Any inconsistency
between two or more entries that represent the
same entity demonstrates the lack of integrity
Database technology cannot do very much to
protect users against data errors made in the
outside world before the data has been entered in
the system
However, certain safety measures can be built
into a database to ensure that errors within the
system are minimized
31DATA RECOVERY
The process of recovery involves restoring the
database to a state which is know to be correct
following some kind of failure
The technique of redundancy is used in the sense
that it has to be possible to recover the
database to its correct state from information
available somewhere else in the system
The most common way to achieve this is to dump
the contents of the database with the defined
frequency on another medium, magnetic tape or
optical disk, which is then stored in the same
place
32DATABASE SECURITY
The DBA has to ensure that adequate measures are
taken to prevent unauthorized disclosure,
alteration or destruction of both the data within
the database and the database software itself
A password and a list of privileges attach to it
are most commonly used to control user access
rights to database information
33THREE COMPONENTS OF DATABASE
Development of a database structure that allows
the storage and maintenance of the required data
Data entry, maintenance and management
Retrieval of the data by end users equipped
with suitable analysis and display tools
34DATABASE ADMINISTRATION
The database administrator (DBA) is a person or a
group of persons responsible for overall control
of database systems
The DBA is usually not only answerable for the
design of the database, but also for choice of
DBMS used, its implementation and training of
all involved in the database running and use
Once the data is entered, it has to be maintained
and kept upto date
35PROBLEMS WITH THE DATA
Incomplete data Noisy data Temporal data An
extremely large amount of data Non-textual data
36INCOMPLETE DATA
Some data may be missing (e.g., some fields may
be left blank) Sometimes, the fact that missing
data itself is a valuable piece of information
37NOISY DATA
The field may contain incorrectly entered
information We do not know how does this affect
the certainty factor (or) confidence level of the
results
38TEMPORAL DATA
Since database grow rapidly, how can data be
incrementally added to our results What effect
should this have in the knowledge discovery
process
39AN EXTREMELY LARGE AMOUNT OF DATA
Some datasets can grow significantly
over time
How should such datasets be processed ?
The option is to perform parallel processing,
where n processors, each process approximately
1/n th of the data in approximately 1/n th of
the time
40NON-TEXTUAL DATA
There are many types of data that need to be
manipulated, including image data, multimedia
data (Video and Sound), spatial data in GIS and
user defined data types
41Data
Selection
Preprocessing transformation
Target data
Cleaned data
Data Mining
Patterns
Interpolation evaluation validation
knowledge
42 Stand alone machine application Web
Application
43PERL
Very powerful for string manipulation Uses
CGI as the interface
JAVA
- Application programming(Standalone machine)
- Applet Programming (Web oriented)
- Useful for graphics application over the WWW
44WHAT IS PERL?
PERL is an interpreted language optimized for
scanning arbitrary test files, extracting
information from those text files
The language is intended to be practical (easy to
use, efficient, complete) rather than beautiful
(tiny, elegant and minimal)
PERL uses sophisticated pattern matching
techniques to scan large amounts of data very
quickly.Although optimized for scanning text,
PERL can also deal with binary data and can make
dbm files look associate arrays
45CGI(CommonGateway Interface)
Common Gateway interface (CGI), as its name
implies, provides a gateway between a user
(Client) and command/logic oriented server
CGI performs the task of translation, means
translates the needs of clients into server
requests and then back translates server replies
to clients
46Client Java Servlet Server
Client CGI Server
47 RMI concept is very useful for multitier
architecture EXAMPLE
www.hotmail.com www.google.com
48Software (Search Engine)
Server
Remote machine
Client
RMI
49WEB-Page
Java Server pages (sun micro systems) Active
server pages (Microsoft corporation) useful for
dynamic web page creation
50GRAPHICAL USER INTERFACE (GUI)
The Programmer can quickly design the user
interface by drawing and arranging the screen
elements rather than writing the raw code CGI is
easily visualizable to users It is user
friendly Example MS-WINDOWS OPERATING SYSTEMS
51GUI (Graphical User Interface)
Active X (Microsoft corporation) Java swing (Sun
micro systems) Buttons, boxes and pull down
menus (windows based)
52VB (Visual Basic)
Application development languages. Supports
graphics Good for standalone applications Web
programming is not possible.But it is possible to
use script languages(vb script or java script) to
make it web oriented
53VC System Application Programming
Almost same as VB Additional advantage
System side
54WORLD WIDE WEB (W W W)
World Wide Web is the famous and fastest growing
Internet function.It is the way of accessing
information already on the Internet using the
concept of hypertext to link information.Like
FTP, any types of digital documents, images,
artwork, movies and sounds on the remote computer
can be made hyperlinks.The protocol used for
accessing such information is HTTP (Hyper Text
Transfer Protocol)
The hyper linked documents are known as HTML
documents. They are written in a special language
called HTML, stands for Hyper Text Markup
Language. The HTML is nothing but ASCII text with
embedded tags on it
55DBMS RDBMS
DBMS Dbase MS-Access Mysql-server FoxPro
(partially RDBMS) RDBMS Sybase Oracl
e SQL-server
56DATABASE a bunch of tables TABLES Store numerous
rows of information FIELDS The little boxes
inside a tables
57An expensive whopper of a database system called
SQL server, which is used in corporation that
needs to store huge wads of information ORACLE,
which is another database format
The best way to create your own access database
is by using, microsoft access.This tool chips
with the professional edition of office-87 and
enables you to graphically design your own tables
and individual field. Yet another one my-SQL
58Typical Web Search
Keywords
Search Engine
Output
59Web Browser
Form O/p (in HTML)
HTML
W W W
Form O/p (in HTML)
HTML
Flat file
CGI-Program
60Mirror sites
61PDB GDB SCOP
62PROTEIN DATABANK
PDB
144.16.71.2 144.16.49.185 203.90.127.146 (VPN
users)
63PDB-MIRROR MACHINE
3.40 GHz PIV machine 2 GB RD RAM 1 Tera-byte Hard
Disk 32 MB Graphics Card Powered by Intel
SOLARIS
64PDB
The PDB server is up-to-date and as of now
contains 24,080 coordinate entries(21,788
proteins, 992 protein and nucleic acid complexes,
1282 nucleic acids.
65(No Transcript)
66GENOME DATABASE
GDB
144.16.71.10 144.16.49.185 203.90.127.147 (VPN
users)
67GDB-MIRROR site machine
3.40 GHz PIV machine 2 GB RD RAM 1 Tera-byte
Hard Disk 32 MB Graphics Card Powered by Intel
SOLARIS
68(No Transcript)
69Structural Classification of Proteins
SCOP
144.16.71.2/scop 144.16.49.78/scop 203.90.127.1
46/scop (for VPN users)
70SCOP
The SCOP mirror site at the institute has been
created and maintained with the latest copy. Now
the mirror site (version 1.63, May 2003 release)
contains 49,497 domains from 18,946 PDB entries.
71(No Transcript)
72Packages developed at the Bioinformatics Centre
Raman Building Indian Institute of
Science Bangalore 560 012 Dr. K. SEKAR E-mail
sekar_at_physics.iisc.ernet.in
73(No Transcript)
74GENOME SEQUNECES
75- MSGS
- Motif Search in Genome Sequences
- A web based interactive display tool
- P. Selvarani, B.N. Vijay, V. Shanthi, S.
Saravanan and K. Sekar - (To be submitted)
- http//144.16.71.10/msgs (Internet users)
- http//203.90.127.147/msgs (VPN users)
76(No Transcript)
77THGS A Web based database of Transmembrane
Helices in Genome Sequences S.A. Fernando, P.
Selvarani, Soma Das, Ch. Kiran kumar, S.
Mondal, S. Ramakumar and K. Sekar NUCL. ACIDS
RES. (2004), 32, D125-D128 http//144.16.71.10/th
gs (Internet users) http//203.90.127.147/thgs
(VPN users)
78(No Transcript)
79(No Transcript)
80PROTEIN SEQUNECES
81- PSST
- Protein Sequence Search Tool
- A web based interactive search engine
- S. Saravanan, A. Ajmal Khan and K. Sekar
- CURR. SCI. (2000), 550-552
- http//144.16.71.10/psst (Internet users)
- http//203.90.127.147/psst (VPN users)
82(No Transcript)
83PROTEIN STRUCTURES
84- BSDD
- Biomolecules Segment Display Device
- A web based interactive display tool
- P. Selvarani, V. Shanthi, C.K. Rajesh, S.
Saravanan and K. Sekar - J. MOL. GRA. MODEL. (2004) (In the press)
- http//144.16.71.2/bsdd (Internet users)
- http//203.90.127.146/bsdd (VPN users)
85(No Transcript)
86(No Transcript)
87(No Transcript)
88(No Transcript)
89- PDB Goodies
- a web-based GUI to manipulate
- the Protein Data Bank file
- A.S.Z. Hussain, V. Shanthi, S.S. Sheik,
- J. Jeyakanthan, P. Selvarani and K. Sekar
- ACTA. CRYST. (2002), D58, 1385-1386
- http//144.16.71.11/pdbgoodies (Internet users)
- http//203.90.127.149/pdbgoodies (VPN users)
90(No Transcript)
91- CAP
- Conformation Angles Package
- Displaying the conformation angles
- of side chains in proteins
- S.S. Sheik, P. Sundararajan, V. Shanthi and K.
Sekar - BIOINFORMATICS (2003), 19, 1043-1044
- http//144.16.71.146/cap (Internet users)
- http//203.90.127.148/cap (VPN users)
92(No Transcript)
93WAP - a Web-based package to calculate
geometrical parameters between water oxygen and
protein atoms V. Shanthi, C.K. Rajesh, J.
Jayalakshmi, V.G. Vijay and K. Sekar J. APPL.
CRYST. (2003), 36, 167-168 http//144.16.71.11
/wap (Internet users) http//203.90.127.149/wap
(VPN users)
94(No Transcript)
95RP Ramachandran Plot on the web S.S. Sheik, P.
Sundararajan, A.S.Z. Hussain and K.
Sekar BIOINFORMATICS (2002), 18, 1548-1549
http//144.16.71.146/rp (Internet
users) http//203.90.127.148/rp (VPN users)
96(No Transcript)
97SSEP Secondary Structural Elements of
Proteins V. Shanthi, P. Selvarani, Ch. Kiran
Kumar, C.S.Mohire and K. Sekar NUCL. ACIDS
RES. (2003), 31, 3404-3405 http//144.16.71.148/
ssep (Internet users) http//203.90.127.150/ssep
(VPN users)
98(No Transcript)
99(No Transcript)
100(No Transcript)
101(No Transcript)
102SEM Symmetry Equivalent Molecules A.S.Z.
Hussain, Ch. Kiran Kumar, C.K. Rajesh, S.S.
Sheik and K. Sekar NUCL ACIDS RES. (2003),
31, 3356-3358. http//144.16.71.11/sem
(Internet users) http//203.90.127.149/sem (VPN
users)
103(No Transcript)
104(No Transcript)
105(No Transcript)
106(No Transcript)
107(No Transcript)
108CADB Conformational Angles DataBase of
proteins S.S. Sheik, P. Ananthalakshmi, G. Ramya
Bhargavi and K. Sekar NUCL. ACIDS RES. (2003),
31(1), 448-451 http//144.16.71.148/cadb
(Internet users) http//203.90.127.150/cadb (VPN
users)
109(No Transcript)
110Non-homologous (25 Identity) protein chains
Hobohm Sander, Protein Sci. 3, 522-524
X-Ray Diffraction 1,276
(25) NMR
460 (2) Fibre Diffraction
3 (0) Others
0 (5)
Total no. of chains
1,739 (32)
Total no. of residues in X-Ray Diffraction
2,53,623 NMR
37,281
Numbers within the paranthesis denote files
having C? coordinates.
111Non-homologous (90 Identity) protein chains
Hobohm Sander, Protein Sci. 3, 522-524
X-Ray Diffraction 5,147
(26) NMR
993 (5) Fibre Diffraction
6 (0) Others
0 (5)
Total no. of chains
6,146 (36)
Total no. of residues in X-Ray Diffraction
11,29,466 NMR
72,145
Numbers within the paranthesis denote files
having C? coordinates.
112LySDB Lysozyme Structural DataBase K. S. Mohan,
Soma Das, C. Chockalingham, V. Shanthi K.
Sekar ACTA CRYST. (2004), D60, 597-600.
http//144.16.71.2/lysdb (Internet
users) http//203.90.127.146/lysdb (VPN users)
113(No Transcript)
114TAKE HOME MESSAGE
Datamining is nothing but exploiting the Hidden
Trends in your data Create your own derived
database No one tool or set of tools is
universally applicable Present the data in a
useful format such as graph or table
115Department of Biotechnology Ministry of Science
Technology Govt. of India India Jai Vigyan
National Science Foundation Govt. of India India
116(No Transcript)
117(No Transcript)