EScience - PowerPoint PPT Presentation

1 / 29

About This Presentation

Title:

EScience

Description:

PRINTS. 1310. What do we do with information? Acquire & Extract. Describe & Interpret ... CC -!- SIMILARITY: VERY HIGH TO OTHER TETM/TETO PROTEINS. ... Service quality ... – PowerPoint PPT presentation

Number of Views:19

Avg rating:3.0/5.0

Slides: 30

Provided by: Carole153

Category:

Tags: escience

more less

Transcript and Presenter's Notes

Title: EScience

1
E-Science GRIDSOpportunities for Information
Management (including Databases)

Carole Goble
University of Manchester

Acknowledgements Keith Jeffery Graham Kemp
2
Take homes

GRIDs data deluge implies data management is
crucial,
but the GRID is as important to database
research
Scale means automation at every point in the
information lifecycle
which means moving information from
machine-readable to machine-understandable
which means metadata matters
Information-based research means fusing and
interoperating information and processes
which means middleware matters
Databases, digital libraries, documents, text,
images, algorithms
which means holistic information management
matters
Delivering the right information to the right
person or process at the right time

3
E-Science Data

Complex, multimedia and multidisciplinary
experiments, observations, interpretations,
images, literature, models etc
Archival
incomplete, inconsistent, speculative,
accumulative
attribution and quality control

Curation
annotation ? expert added-value
searching linking ? making relationships
complex queries
Update pattern
monotonic growth
evolution
shared
Access pattern
searching linking ? making relationships
complex queries
Distributed

4
Whats the data like?

Multiple views
Highly Interrelated
(In)Stability Evolution
Quality Provenance
Fidelity Variability
Longevity Cumulative
Unknown work in progress
Bursty acquisition
Digital library repository

A duty of guardianship
5
What do we do with information?

Acquire Extract
Describe Interpret
Structure Organise
Publish Share
Access Retrieve
Search Browse
Discover
Filter Forage
Curate and add value
Migrate and stage

Interoperate Fuse
Mine Predict
Compare Contrast
Preserve
Reuse Decommission

6
What do we do with information?

Acquire Extract
Describe Interpret
Structure Organise
Publish Share
Access Retrieve
Search Browse
Discover
Filter Forage
Curate and add value
Migrate and stage

Interoperate Fuse
Mine Predict
Compare Contrast
Preserve
Reuse Decommission

Personal creativity Science is not linear
7
Database RD Landscape

Academic
Led by CS theory
Delta publications
Core DB technology
Non-scaleable technology
E.g. Deductive Object

Research Labs
Dominate take up
Transaction processing
Query optimisation
Indexing
Repositories
E.g. Active databases
E.g. Object-relational

Commercial
Mainly basic and near market
Performance Security
Functionality Ease of use
E.g. Relational

8
Information RD Landscape
Databases
Information Retrieval
Knowledge representation
Hypermedia
Document Management
Distributed systems
Digital Libraries
9
Information RD Landscape
Databases
XML-QL
Dynamic pages
Information Retrieval
Knowledge representation
RDF Ontologies
Search engines
Semantic Web
Open hypermedia
XML
Hypermedia
Document Management
XML as message, http
Dublin core metadata
Distributed systems
Digital Libraries
10
Emerging trends

Whole information delivery
workflow, user interfaces
Experience applications valued
Apply theory to real world problems/applications
Engineer for world-sized applications

11
Layered GRIDs e-science in silico
Data to knowledge
Control
12
GRID IM

Software component reuse, design reuse

Data mining, data visualisation

Metadata, middleware, homogeneous access to
heterogeneous resources, intelligent retrieval,
information modelling, data warehousing,
workflow, information/content distribution,
active content management (distribution,
security), consistency management (versions,
quality), curation management, semi-automatic
annotation

Warehousing, distributed databases, streaming,
near-line storage, large objects, efficient
access mechanisms, data staging, query
optimisation

13
Key pillars M M

Metadata
To describe the information and computational
resources
Essential for navigation, integration, analysis,
use
Schemas, thesauri, domain ontologies
Catalogues with keyword terms

Middleware
Making Software Work Together
An interface to data from remote programs
Data independence coping with change
Information heterogeneity mediation

14
Metadata

Data - extension
experimental, interpretations, images,
literature, models etc
Metadata - intension
data about the data
interpretation or meaning of the data
model of the objects, properties, constraints,
relationships that hold on the data
Meta Data Coalition
Dublin Core
Resource Description Framework etc
http//www.llnl.gov/liv_comp/metadata

15
Controlled Vocabularies Considered Essential
SWISSPROTTET9_ENTFA ID TET9_ENTFA
STANDARD PRT 639 AA. AC P21598 DT
01-MAY-1991 (REL. 18, CREATED) DT 01-MAY-1991
(REL. 18, LAST SEQUENCE UPDATE) DT 01-OCT-1993
(REL. 27, LAST ANNOTATION UPDATE) DE
TETRACYCLINE RESISTANCE PROTEIN TETM (TRANSPOSON
TN916). GN TETM(916). OS ENTEROCOCCUS
FAECALIS (STREPTOCOCCUS FAECALIS). RA BURDETT
V. RL NUCLEIC ACIDS RES. 186137-6137(1990). CC
-!- FUNCTION ABOLISH THE INHIBITORY EFFECT OF
TETRACYCLIN ON PROTEIN CC SYNTHESIS BY A
NON-COVALENT MODIFICATION OF THE RIBOSOMES. CC
-!- SIMILARITY VERY HIGH TO OTHER TETM/TETO
PROTEINS. CC -!- SIMILARITY TO GTP-BINDING
ELONGATION FACTORS. DR EMBL X56353 G47062
-. DR PIR S13142 S13142. DR PROSITE
PS00301 EFACTOR_GTP 1. KW PROTEIN
BIOSYNTHESIS ANTIBIOTIC RESISTANCE
GTP-BINDING KW TRANSPOSABLE ELEMENT. FT
NP_BIND 10 17 GTP (BY
SIMILARITY). FT NP_BIND 74 78
GTP (BY SIMILARITY). SQ SEQUENCE 639 AA
72464 MW 523F1359 CRC32 gtTET9_ENTFA MKIINIGVLAH
VDAGKTTLTESLLYNSGAITELGSVDKGTTRTDNTLLERQRGITIQTGI
TSFQWENTKVNIIDTPGHMDFLAEVYRSLSVLDGAILLISAKDGVQAQTR
ILFHALRKMG IPTIFFINKIDQNGIDLSTVYQDIKEKLSAEIVIKQKVE
LYPNVCVTNFTESEQWDTVIE GNDDLLEKYMSGKSLEALELEQEESIRF
QNCSLFPLYHGSAKSNIGIDNLIEVITNKFYS STHRGPSELCGNVFKIE
YTKKRQRLAYIRLYSGVLHLRDSVRVSEKEKIKVTEMYTSING ELCKID
RAYSGEIVILQNEFLKLNSVLGDTKLLPQRKKIENPHPLLQTTVEPSKPE
QREM LLDALLEISDSDPLLRYYVDSTTHEIILSFLGKVQMEVISALLQE
KYHVEIEITEPTVIY MERPLKNAEYTIHIEVPPNPFWASIGLSVSPLPL
GSGMQYESSVSLGYLNQSFQNAVMEG IRYGCEQGLYGWNVTDCKICFKY
GLYYSPVSTPADFRMLAPIVLEQVLKKAGTELLEPYL SFKIYAPQEYLS
RAYNDAPKYCANIVDTQLKNNEVILSGEIPARCIQEYRSDLTFFTNGR S
VCLTELKGYHVTTGEPVCQPRRPNSRIDKVRYMFNKIT
16
Gene Ontology
17
Structured data Databases

Metadata - database schema represents a typed
semantic data model extensible types
incorporated operations
Indexing and secondary storage management
Efficient, complex and precise query answering
Programmatic interface -- connectivity to WWW and
other databases
Architecture based on notion of data independence
Good for
Data and metadata is known a priori and is
regular
Data is checked consistent,
Well-organised systematically described
Poor for
Data and metadata is not known a priori and is
variable
Incomplete, inconsistent, speculative data
Deviation

18
Semi-structured Data Documents

Literature, annotations, free text descriptions,
flat files of data using mark-up, WWW data

Metadata
External in database catalogue Internal in
mark-up
RDF an object model to describe the contents of
the web DOM API for documents
Cataloguing Standards
Metadata recovery
text mining, NLP, information extraction, image
processing

SWISSPROTTET9_ENTFA ID TET9_ENTFA
STANDARD PRT 639 AA. AC P21598 DT
01-MAY-1991 (REL. 18, CREATED) DT 01-MAY-1991
(REL. 18, LAST SEQUENCE UPDATE) DT 01-OCT-1993
(REL. 27, LAST ANNOTATION UPDATE) DE
TETRACYCLINE RESISTANCE PROTEIN TETM (TRANSPOSON
TN916). GN TETM(916). OS ENTEROCOCCUS
FAECALIS (STREPTOCOCCUS FAECALIS). RA BURDETT
V. RL NUCLEIC ACIDS RES. 186137-6137(1990). CC
-!- FUNCTION ABOLISH THE INHIBITORY EFFECT OF
TETRACYCLIN ON PROTEIN CC
SYNTHESIS BY A NON-COVALENT MODIFICATION OF THE
RIBOSOMES. CC -!- SIMILARITY VERY
HIGH TO OTHER TETM/TETO PROTEINS. CC -!-
SIMILARITY TO GTP-BINDING ELONGATION FACTORS. DR
EMBL X56353 G47062 -. DR PIR S13142
S13142. DR PROSITE PS00301 EFACTOR_GTP 1. KW
PROTEIN BIOSYNTHESIS ANTIBIOTIC RESISTANCE
GTP-BINDING KW TRANSPOSABLE ELEMENT. FT
NP_BIND 10 17 GTP (BY
SIMILARITY). FT NP_BIND 74 78
GTP (BY SIMILARITY). SQ SEQUENCE 639 AA
72464 MW 523F1359 CRC32 gtTET9_ENTFA MKIINIGVLAH
VDAGKTTLTESLLYNSGAITELGSVDKGTTRTDNTLLERQRGITIQTG
19
Semi-structured Data

Origins
integration of heterogeneous sources data
sources with non-rigid structure biological
data Web data
Characteristics
missing or additional attributes
multiple attributes
different types in different objects
heterogeneous collections
self-describing, irregular data, not a priori
structure
data and schema are once again decoupled
Research into query languages based on graphs,
query optimisation, indexing, schema extraction
from data

20
Middleware CORBA, DCOM, EJB, Globus, XML

Independence from Platform, Networking,
Language, Operating System, Network etc
Authentication, security, privacy
Load partitioning, distribution, balancing,
optimisation
Resource location, access rights

21
Interoperation Multi-database federations

Schema reconciliation
Instance reconciliation
Query rewriting
Data fusion
Evolution and change

22
Interoperation - Multi-database federations
Warehousing

Schema reconciliation
Instance reconciliation
Query rewriting
Data fusion
Evolution and change

incremental updates
warehouse models
data cleansing

wrapper
wrapper
wrapper
23
Interoperation - Workflow

Co-ordinating interoperation
Flow of requests and processes through databases
and tools
Complex inter-source queries

wrapper
wrapper
wrapper
24
Metadata Middleware Curation
25
Database research future

Service quality
Performance, scalability, availability,
authentication, access, privacy, security,
optimisations, data/process movement
Representational Information Fidelity
Types match complexity, data quality, context,
change
Intelligent retrieval
Query refinement, clarification, optimisation

26
OODB Challenges

Performance
Transaction and authorisation models
Representation
Multiple conditional inheritance
Polymorphic objects
Execute-time type bindings
Constraint handling
Active / deductive capability e.g. DOQL
Distribution and parallelism e.g. POLAR

27
Research Issues

Describing the content, qualities and
interactions of distributed data, information,
knowledge, and related processes.
Self-description, emergent properties, indexes,
metadata, ontologies, controlled vocabularies and
strategies for supporting searching and finding.
The management of inconsistency
Data fusion from diverse distributed sources,
automated integration, intelligent agents,
resource discovery
Management of evolution and change, data
transformation and tracking
Architectural issues for trust including naming,
access control, and permitted creation and
destruction of information, processes, profiles,
and relationships
Verification and validation, fault tolerance and
the management of fault propagation.
Performance, availability and resilience

EPSRC Distributed Information Management
Programme
28
Lessons learnt from biologists

Accessibility through the web is good but
computers need access too!
Provide program interfaces to sources
Sharing data means publishing, agreeing and
sharing metadata, and standards
Expect change
Generic, extensible and evolvable architectures
using layered independence and
(Re)Use shelf components, recognised technologies
and good SE practice
Dont use proprietary or home grown solutions
(Semi)-automate annotation and data association
A Distributed Digital Library

29
Conclusion

Information management has a lot to offer the
GRID today, and the GRID offers real research
opportunities
Support creative serendipitous discovery
Support collection curation
Its easier to accumulate data than use it
effectively

Write a Comment

User Comments (0)