Title: EScience
1E-Science GRIDSOpportunities for Information
Management (including Databases)
- Carole Goble
- University of Manchester
Acknowledgements Keith Jeffery Graham Kemp
2Take homes
- GRIDs data deluge implies data management is
crucial, - but the GRID is as important to database
research - Scale means automation at every point in the
information lifecycle - which means moving information from
machine-readable to machine-understandable - which means metadata matters
- Information-based research means fusing and
interoperating information and processes - which means middleware matters
- Databases, digital libraries, documents, text,
images, algorithms - which means holistic information management
matters - Delivering the right information to the right
person or process at the right time
3E-Science Data
- Complex, multimedia and multidisciplinary
- experiments, observations, interpretations,
images, literature, models etc - Archival
- incomplete, inconsistent, speculative,
accumulative - attribution and quality control
- Curation
- annotation ? expert added-value
- searching linking ? making relationships
- complex queries
- Update pattern
- monotonic growth
- evolution
- shared
- Access pattern
- searching linking ? making relationships
- complex queries
- Distributed
4Whats the data like?
- Multiple views
- Highly Interrelated
- (In)Stability Evolution
- Quality Provenance
- Fidelity Variability
- Longevity Cumulative
- Unknown work in progress
- Bursty acquisition
- Digital library repository
A duty of guardianship
5What do we do with information?
- Acquire Extract
- Describe Interpret
- Structure Organise
- Publish Share
- Access Retrieve
- Search Browse
- Discover
- Filter Forage
- Curate and add value
- Migrate and stage
- Interoperate Fuse
- Mine Predict
- Compare Contrast
- Preserve
- Reuse Decommission
6What do we do with information?
- Acquire Extract
- Describe Interpret
- Structure Organise
- Publish Share
- Access Retrieve
- Search Browse
- Discover
- Filter Forage
- Curate and add value
- Migrate and stage
- Interoperate Fuse
- Mine Predict
- Compare Contrast
- Preserve
- Reuse Decommission
Personal creativity Science is not linear
7Database RD Landscape
- Academic
- Led by CS theory
- Delta publications
- Core DB technology
- Non-scaleable technology
- E.g. Deductive Object
- Research Labs
- Dominate take up
- Transaction processing
- Query optimisation
- Indexing
- Repositories
- E.g. Active databases
- E.g. Object-relational
- Commercial
- Mainly basic and near market
- Performance Security
- Functionality Ease of use
- E.g. Relational
8Information RD Landscape
Databases
Information Retrieval
Knowledge representation
Hypermedia
Document Management
Distributed systems
Digital Libraries
9Information RD Landscape
Databases
XML-QL
Dynamic pages
Information Retrieval
Knowledge representation
RDF Ontologies
Search engines
Semantic Web
Open hypermedia
XML
Hypermedia
Document Management
XML as message, http
Dublin core metadata
Distributed systems
Digital Libraries
10Emerging trends
- Whole information delivery
- workflow, user interfaces
- Experience applications valued
- Apply theory to real world problems/applications
- Engineer for world-sized applications
11Layered GRIDs e-science in silico
Data to knowledge
Control
12GRID IM
- Software component reuse, design reuse
- Data mining, data visualisation
- Metadata, middleware, homogeneous access to
heterogeneous resources, intelligent retrieval,
information modelling, data warehousing,
workflow, information/content distribution,
active content management (distribution,
security), consistency management (versions,
quality), curation management, semi-automatic
annotation
- Warehousing, distributed databases, streaming,
near-line storage, large objects, efficient
access mechanisms, data staging, query
optimisation
13Key pillars M M
- Metadata
- To describe the information and computational
resources - Essential for navigation, integration, analysis,
use - Schemas, thesauri, domain ontologies
- Catalogues with keyword terms
- Middleware
- Making Software Work Together
- An interface to data from remote programs
- Data independence coping with change
- Information heterogeneity mediation
14Metadata
- Data - extension
- experimental, interpretations, images,
literature, models etc - Metadata - intension
- data about the data
- interpretation or meaning of the data
- model of the objects, properties, constraints,
relationships that hold on the data - Meta Data Coalition
- Dublin Core
- Resource Description Framework etc
- http//www.llnl.gov/liv_comp/metadata
15Controlled Vocabularies Considered Essential
SWISSPROTTET9_ENTFA ID TET9_ENTFA
STANDARD PRT 639 AA. AC P21598 DT
01-MAY-1991 (REL. 18, CREATED) DT 01-MAY-1991
(REL. 18, LAST SEQUENCE UPDATE) DT 01-OCT-1993
(REL. 27, LAST ANNOTATION UPDATE) DE
TETRACYCLINE RESISTANCE PROTEIN TETM (TRANSPOSON
TN916). GN TETM(916). OS ENTEROCOCCUS
FAECALIS (STREPTOCOCCUS FAECALIS). RA BURDETT
V. RL NUCLEIC ACIDS RES. 186137-6137(1990). CC
-!- FUNCTION ABOLISH THE INHIBITORY EFFECT OF
TETRACYCLIN ON PROTEIN CC SYNTHESIS BY A
NON-COVALENT MODIFICATION OF THE RIBOSOMES. CC
-!- SIMILARITY VERY HIGH TO OTHER TETM/TETO
PROTEINS. CC -!- SIMILARITY TO GTP-BINDING
ELONGATION FACTORS. DR EMBL X56353 G47062
-. DR PIR S13142 S13142. DR PROSITE
PS00301 EFACTOR_GTP 1. KW PROTEIN
BIOSYNTHESIS ANTIBIOTIC RESISTANCE
GTP-BINDING KW TRANSPOSABLE ELEMENT. FT
NP_BIND 10 17 GTP (BY
SIMILARITY). FT NP_BIND 74 78
GTP (BY SIMILARITY). SQ SEQUENCE 639 AA
72464 MW 523F1359 CRC32 gtTET9_ENTFA MKIINIGVLAH
VDAGKTTLTESLLYNSGAITELGSVDKGTTRTDNTLLERQRGITIQTGI
TSFQWENTKVNIIDTPGHMDFLAEVYRSLSVLDGAILLISAKDGVQAQTR
ILFHALRKMG IPTIFFINKIDQNGIDLSTVYQDIKEKLSAEIVIKQKVE
LYPNVCVTNFTESEQWDTVIE GNDDLLEKYMSGKSLEALELEQEESIRF
QNCSLFPLYHGSAKSNIGIDNLIEVITNKFYS STHRGPSELCGNVFKIE
YTKKRQRLAYIRLYSGVLHLRDSVRVSEKEKIKVTEMYTSING ELCKID
RAYSGEIVILQNEFLKLNSVLGDTKLLPQRKKIENPHPLLQTTVEPSKPE
QREM LLDALLEISDSDPLLRYYVDSTTHEIILSFLGKVQMEVISALLQE
KYHVEIEITEPTVIY MERPLKNAEYTIHIEVPPNPFWASIGLSVSPLPL
GSGMQYESSVSLGYLNQSFQNAVMEG IRYGCEQGLYGWNVTDCKICFKY
GLYYSPVSTPADFRMLAPIVLEQVLKKAGTELLEPYL SFKIYAPQEYLS
RAYNDAPKYCANIVDTQLKNNEVILSGEIPARCIQEYRSDLTFFTNGR S
VCLTELKGYHVTTGEPVCQPRRPNSRIDKVRYMFNKIT
16Gene Ontology
17Structured data Databases
- Metadata - database schema represents a typed
semantic data model extensible types
incorporated operations - Indexing and secondary storage management
- Efficient, complex and precise query answering
- Programmatic interface -- connectivity to WWW and
other databases - Architecture based on notion of data independence
- Good for
- Data and metadata is known a priori and is
regular - Data is checked consistent,
- Well-organised systematically described
- Poor for
- Data and metadata is not known a priori and is
variable - Incomplete, inconsistent, speculative data
- Deviation
18Semi-structured Data Documents
- Literature, annotations, free text descriptions,
flat files of data using mark-up, WWW data
- Metadata
- External in database catalogue Internal in
mark-up - RDF an object model to describe the contents of
the web DOM API for documents - Cataloguing Standards
- Metadata recovery
- text mining, NLP, information extraction, image
processing
SWISSPROTTET9_ENTFA ID TET9_ENTFA
STANDARD PRT 639 AA. AC P21598 DT
01-MAY-1991 (REL. 18, CREATED) DT 01-MAY-1991
(REL. 18, LAST SEQUENCE UPDATE) DT 01-OCT-1993
(REL. 27, LAST ANNOTATION UPDATE) DE
TETRACYCLINE RESISTANCE PROTEIN TETM (TRANSPOSON
TN916). GN TETM(916). OS ENTEROCOCCUS
FAECALIS (STREPTOCOCCUS FAECALIS). RA BURDETT
V. RL NUCLEIC ACIDS RES. 186137-6137(1990). CC
-!- FUNCTION ABOLISH THE INHIBITORY EFFECT OF
TETRACYCLIN ON PROTEIN CC
SYNTHESIS BY A NON-COVALENT MODIFICATION OF THE
RIBOSOMES. CC -!- SIMILARITY VERY
HIGH TO OTHER TETM/TETO PROTEINS. CC -!-
SIMILARITY TO GTP-BINDING ELONGATION FACTORS. DR
EMBL X56353 G47062 -. DR PIR S13142
S13142. DR PROSITE PS00301 EFACTOR_GTP 1. KW
PROTEIN BIOSYNTHESIS ANTIBIOTIC RESISTANCE
GTP-BINDING KW TRANSPOSABLE ELEMENT. FT
NP_BIND 10 17 GTP (BY
SIMILARITY). FT NP_BIND 74 78
GTP (BY SIMILARITY). SQ SEQUENCE 639 AA
72464 MW 523F1359 CRC32 gtTET9_ENTFA MKIINIGVLAH
VDAGKTTLTESLLYNSGAITELGSVDKGTTRTDNTLLERQRGITIQTG
19Semi-structured Data
- Origins
- integration of heterogeneous sources data
sources with non-rigid structure biological
data Web data - Characteristics
- missing or additional attributes
- multiple attributes
- different types in different objects
- heterogeneous collections
- self-describing, irregular data, not a priori
structure - data and schema are once again decoupled
- Research into query languages based on graphs,
query optimisation, indexing, schema extraction
from data
20Middleware CORBA, DCOM, EJB, Globus, XML
- Independence from Platform, Networking,
Language, Operating System, Network etc - Authentication, security, privacy
- Load partitioning, distribution, balancing,
optimisation - Resource location, access rights
21Interoperation Multi-database federations
- Schema reconciliation
- Instance reconciliation
- Query rewriting
- Data fusion
- Evolution and change
22Interoperation - Multi-database federations
Warehousing
- Schema reconciliation
- Instance reconciliation
- Query rewriting
- Data fusion
- Evolution and change
- incremental updates
- warehouse models
- data cleansing
wrapper
wrapper
wrapper
23Interoperation - Workflow
- Co-ordinating interoperation
- Flow of requests and processes through databases
and tools - Complex inter-source queries
wrapper
wrapper
wrapper
24Metadata Middleware Curation
25Database research future
- Service quality
- Performance, scalability, availability,
authentication, access, privacy, security,
optimisations, data/process movement - Representational Information Fidelity
- Types match complexity, data quality, context,
change - Intelligent retrieval
- Query refinement, clarification, optimisation
26OODB Challenges
- Performance
- Transaction and authorisation models
- Representation
- Multiple conditional inheritance
- Polymorphic objects
- Execute-time type bindings
- Constraint handling
- Active / deductive capability e.g. DOQL
- Distribution and parallelism e.g. POLAR
27Research Issues
- Describing the content, qualities and
interactions of distributed data, information,
knowledge, and related processes. - Self-description, emergent properties, indexes,
metadata, ontologies, controlled vocabularies and
strategies for supporting searching and finding. - The management of inconsistency
- Data fusion from diverse distributed sources,
automated integration, intelligent agents,
resource discovery - Management of evolution and change, data
transformation and tracking - Architectural issues for trust including naming,
access control, and permitted creation and
destruction of information, processes, profiles,
and relationships - Verification and validation, fault tolerance and
the management of fault propagation. - Performance, availability and resilience
EPSRC Distributed Information Management
Programme
28Lessons learnt from biologists
- Accessibility through the web is good but
computers need access too! - Provide program interfaces to sources
- Sharing data means publishing, agreeing and
sharing metadata, and standards - Expect change
- Generic, extensible and evolvable architectures
using layered independence and - (Re)Use shelf components, recognised technologies
and good SE practice - Dont use proprietary or home grown solutions
- (Semi)-automate annotation and data association
- A Distributed Digital Library
29Conclusion
- Information management has a lot to offer the
GRID today, and the GRID offers real research
opportunities - Support creative serendipitous discovery
- Support collection curation
- Its easier to accumulate data than use it
effectively