EScience - PowerPoint PPT Presentation

1 / 29
About This Presentation
Title:

EScience

Description:

PRINTS. 1310. What do we do with information? Acquire & Extract. Describe & Interpret ... CC -!- SIMILARITY: VERY HIGH TO OTHER TETM/TETO PROTEINS. ... Service quality ... – PowerPoint PPT presentation

Number of Views:19
Avg rating:3.0/5.0
Slides: 30
Provided by: Carole153
Category:
Tags: escience

less

Transcript and Presenter's Notes

Title: EScience


1
E-Science GRIDSOpportunities for Information
Management (including Databases)
  • Carole Goble
  • University of Manchester

Acknowledgements Keith Jeffery Graham Kemp
2
Take homes
  • GRIDs data deluge implies data management is
    crucial,
  • but the GRID is as important to database
    research
  • Scale means automation at every point in the
    information lifecycle
  • which means moving information from
    machine-readable to machine-understandable
  • which means metadata matters
  • Information-based research means fusing and
    interoperating information and processes
  • which means middleware matters
  • Databases, digital libraries, documents, text,
    images, algorithms
  • which means holistic information management
    matters
  • Delivering the right information to the right
    person or process at the right time

3
E-Science Data
  • Complex, multimedia and multidisciplinary
  • experiments, observations, interpretations,
    images, literature, models etc
  • Archival
  • incomplete, inconsistent, speculative,
    accumulative
  • attribution and quality control
  • Curation
  • annotation ? expert added-value
  • searching linking ? making relationships
  • complex queries
  • Update pattern
  • monotonic growth
  • evolution
  • shared
  • Access pattern
  • searching linking ? making relationships
  • complex queries
  • Distributed

4
Whats the data like?
  • Multiple views
  • Highly Interrelated
  • (In)Stability Evolution
  • Quality Provenance
  • Fidelity Variability
  • Longevity Cumulative
  • Unknown work in progress
  • Bursty acquisition
  • Digital library repository

A duty of guardianship
5
What do we do with information?
  • Acquire Extract
  • Describe Interpret
  • Structure Organise
  • Publish Share
  • Access Retrieve
  • Search Browse
  • Discover
  • Filter Forage
  • Curate and add value
  • Migrate and stage
  • Interoperate Fuse
  • Mine Predict
  • Compare Contrast
  • Preserve
  • Reuse Decommission

6
What do we do with information?
  • Acquire Extract
  • Describe Interpret
  • Structure Organise
  • Publish Share
  • Access Retrieve
  • Search Browse
  • Discover
  • Filter Forage
  • Curate and add value
  • Migrate and stage
  • Interoperate Fuse
  • Mine Predict
  • Compare Contrast
  • Preserve
  • Reuse Decommission

Personal creativity Science is not linear
7
Database RD Landscape
  • Academic
  • Led by CS theory
  • Delta publications
  • Core DB technology
  • Non-scaleable technology
  • E.g. Deductive Object
  • Research Labs
  • Dominate take up
  • Transaction processing
  • Query optimisation
  • Indexing
  • Repositories
  • E.g. Active databases
  • E.g. Object-relational
  • Commercial
  • Mainly basic and near market
  • Performance Security
  • Functionality Ease of use
  • E.g. Relational

8
Information RD Landscape
Databases
Information Retrieval
Knowledge representation
Hypermedia
Document Management
Distributed systems
Digital Libraries
9
Information RD Landscape
Databases
XML-QL
Dynamic pages
Information Retrieval
Knowledge representation
RDF Ontologies
Search engines
Semantic Web
Open hypermedia
XML
Hypermedia
Document Management
XML as message, http
Dublin core metadata
Distributed systems
Digital Libraries
10
Emerging trends
  • Whole information delivery
  • workflow, user interfaces
  • Experience applications valued
  • Apply theory to real world problems/applications
  • Engineer for world-sized applications

11
Layered GRIDs e-science in silico
Data to knowledge
Control
12
GRID IM
  • Software component reuse, design reuse
  • Data mining, data visualisation
  • Metadata, middleware, homogeneous access to
    heterogeneous resources, intelligent retrieval,
    information modelling, data warehousing,
    workflow, information/content distribution,
    active content management (distribution,
    security), consistency management (versions,
    quality), curation management, semi-automatic
    annotation
  • Warehousing, distributed databases, streaming,
    near-line storage, large objects, efficient
    access mechanisms, data staging, query
    optimisation

13
Key pillars M M
  • Metadata
  • To describe the information and computational
    resources
  • Essential for navigation, integration, analysis,
    use
  • Schemas, thesauri, domain ontologies
  • Catalogues with keyword terms
  • Middleware
  • Making Software Work Together
  • An interface to data from remote programs
  • Data independence coping with change
  • Information heterogeneity mediation

14
Metadata
  • Data - extension
  • experimental, interpretations, images,
    literature, models etc
  • Metadata - intension
  • data about the data
  • interpretation or meaning of the data
  • model of the objects, properties, constraints,
    relationships that hold on the data
  • Meta Data Coalition
  • Dublin Core
  • Resource Description Framework etc
  • http//www.llnl.gov/liv_comp/metadata

15
Controlled Vocabularies Considered Essential
SWISSPROTTET9_ENTFA ID TET9_ENTFA
STANDARD PRT 639 AA. AC P21598 DT
01-MAY-1991 (REL. 18, CREATED) DT 01-MAY-1991
(REL. 18, LAST SEQUENCE UPDATE) DT 01-OCT-1993
(REL. 27, LAST ANNOTATION UPDATE) DE
TETRACYCLINE RESISTANCE PROTEIN TETM (TRANSPOSON
TN916). GN TETM(916). OS ENTEROCOCCUS
FAECALIS (STREPTOCOCCUS FAECALIS). RA BURDETT
V. RL NUCLEIC ACIDS RES. 186137-6137(1990). CC
-!- FUNCTION ABOLISH THE INHIBITORY EFFECT OF
TETRACYCLIN ON PROTEIN CC SYNTHESIS BY A
NON-COVALENT MODIFICATION OF THE RIBOSOMES. CC
-!- SIMILARITY VERY HIGH TO OTHER TETM/TETO
PROTEINS. CC -!- SIMILARITY TO GTP-BINDING
ELONGATION FACTORS. DR EMBL X56353 G47062
-. DR PIR S13142 S13142. DR PROSITE
PS00301 EFACTOR_GTP 1. KW PROTEIN
BIOSYNTHESIS ANTIBIOTIC RESISTANCE
GTP-BINDING KW TRANSPOSABLE ELEMENT. FT
NP_BIND 10 17 GTP (BY
SIMILARITY). FT NP_BIND 74 78
GTP (BY SIMILARITY). SQ SEQUENCE 639 AA
72464 MW 523F1359 CRC32 gtTET9_ENTFA MKIINIGVLAH
VDAGKTTLTESLLYNSGAITELGSVDKGTTRTDNTLLERQRGITIQTGI
TSFQWENTKVNIIDTPGHMDFLAEVYRSLSVLDGAILLISAKDGVQAQTR
ILFHALRKMG IPTIFFINKIDQNGIDLSTVYQDIKEKLSAEIVIKQKVE
LYPNVCVTNFTESEQWDTVIE GNDDLLEKYMSGKSLEALELEQEESIRF
QNCSLFPLYHGSAKSNIGIDNLIEVITNKFYS STHRGPSELCGNVFKIE
YTKKRQRLAYIRLYSGVLHLRDSVRVSEKEKIKVTEMYTSING ELCKID
RAYSGEIVILQNEFLKLNSVLGDTKLLPQRKKIENPHPLLQTTVEPSKPE
QREM LLDALLEISDSDPLLRYYVDSTTHEIILSFLGKVQMEVISALLQE
KYHVEIEITEPTVIY MERPLKNAEYTIHIEVPPNPFWASIGLSVSPLPL
GSGMQYESSVSLGYLNQSFQNAVMEG IRYGCEQGLYGWNVTDCKICFKY
GLYYSPVSTPADFRMLAPIVLEQVLKKAGTELLEPYL SFKIYAPQEYLS
RAYNDAPKYCANIVDTQLKNNEVILSGEIPARCIQEYRSDLTFFTNGR S
VCLTELKGYHVTTGEPVCQPRRPNSRIDKVRYMFNKIT
16
Gene Ontology
17
Structured data Databases
  • Metadata - database schema represents a typed
    semantic data model extensible types
    incorporated operations
  • Indexing and secondary storage management
  • Efficient, complex and precise query answering
  • Programmatic interface -- connectivity to WWW and
    other databases
  • Architecture based on notion of data independence
  • Good for
  • Data and metadata is known a priori and is
    regular
  • Data is checked consistent,
  • Well-organised systematically described
  • Poor for
  • Data and metadata is not known a priori and is
    variable
  • Incomplete, inconsistent, speculative data
  • Deviation

18
Semi-structured Data Documents
  • Literature, annotations, free text descriptions,
    flat files of data using mark-up, WWW data
  • Metadata
  • External in database catalogue Internal in
    mark-up
  • RDF an object model to describe the contents of
    the web DOM API for documents
  • Cataloguing Standards
  • Metadata recovery
  • text mining, NLP, information extraction, image
    processing

SWISSPROTTET9_ENTFA ID TET9_ENTFA
STANDARD PRT 639 AA. AC P21598 DT
01-MAY-1991 (REL. 18, CREATED) DT 01-MAY-1991
(REL. 18, LAST SEQUENCE UPDATE) DT 01-OCT-1993
(REL. 27, LAST ANNOTATION UPDATE) DE
TETRACYCLINE RESISTANCE PROTEIN TETM (TRANSPOSON
TN916). GN TETM(916). OS ENTEROCOCCUS
FAECALIS (STREPTOCOCCUS FAECALIS). RA BURDETT
V. RL NUCLEIC ACIDS RES. 186137-6137(1990). CC
-!- FUNCTION ABOLISH THE INHIBITORY EFFECT OF
TETRACYCLIN ON PROTEIN CC
SYNTHESIS BY A NON-COVALENT MODIFICATION OF THE
RIBOSOMES. CC -!- SIMILARITY VERY
HIGH TO OTHER TETM/TETO PROTEINS. CC -!-
SIMILARITY TO GTP-BINDING ELONGATION FACTORS. DR
EMBL X56353 G47062 -. DR PIR S13142
S13142. DR PROSITE PS00301 EFACTOR_GTP 1. KW
PROTEIN BIOSYNTHESIS ANTIBIOTIC RESISTANCE
GTP-BINDING KW TRANSPOSABLE ELEMENT. FT
NP_BIND 10 17 GTP (BY
SIMILARITY). FT NP_BIND 74 78
GTP (BY SIMILARITY). SQ SEQUENCE 639 AA
72464 MW 523F1359 CRC32 gtTET9_ENTFA MKIINIGVLAH
VDAGKTTLTESLLYNSGAITELGSVDKGTTRTDNTLLERQRGITIQTG
19
Semi-structured Data
  • Origins
  • integration of heterogeneous sources data
    sources with non-rigid structure biological
    data Web data
  • Characteristics
  • missing or additional attributes
  • multiple attributes
  • different types in different objects
  • heterogeneous collections
  • self-describing, irregular data, not a priori
    structure
  • data and schema are once again decoupled
  • Research into query languages based on graphs,
    query optimisation, indexing, schema extraction
    from data

20
Middleware CORBA, DCOM, EJB, Globus, XML
  • Independence from Platform, Networking,
    Language, Operating System, Network etc
  • Authentication, security, privacy
  • Load partitioning, distribution, balancing,
    optimisation
  • Resource location, access rights

21
Interoperation Multi-database federations
  • Schema reconciliation
  • Instance reconciliation
  • Query rewriting
  • Data fusion
  • Evolution and change

22
Interoperation - Multi-database federations
Warehousing
  • Schema reconciliation
  • Instance reconciliation
  • Query rewriting
  • Data fusion
  • Evolution and change
  • incremental updates
  • warehouse models
  • data cleansing

wrapper
wrapper
wrapper
23
Interoperation - Workflow
  • Co-ordinating interoperation
  • Flow of requests and processes through databases
    and tools
  • Complex inter-source queries

wrapper
wrapper
wrapper
24
Metadata Middleware Curation
25
Database research future
  • Service quality
  • Performance, scalability, availability,
    authentication, access, privacy, security,
    optimisations, data/process movement
  • Representational Information Fidelity
  • Types match complexity, data quality, context,
    change
  • Intelligent retrieval
  • Query refinement, clarification, optimisation

26
OODB Challenges
  • Performance
  • Transaction and authorisation models
  • Representation
  • Multiple conditional inheritance
  • Polymorphic objects
  • Execute-time type bindings
  • Constraint handling
  • Active / deductive capability e.g. DOQL
  • Distribution and parallelism e.g. POLAR

27
Research Issues
  • Describing the content, qualities and
    interactions of distributed data, information,
    knowledge, and related processes.
  • Self-description, emergent properties, indexes,
    metadata, ontologies, controlled vocabularies and
    strategies for supporting searching and finding.
  • The management of inconsistency
  • Data fusion from diverse distributed sources,
    automated integration, intelligent agents,
    resource discovery
  • Management of evolution and change, data
    transformation and tracking
  • Architectural issues for trust including naming,
    access control, and permitted creation and
    destruction of information, processes, profiles,
    and relationships
  • Verification and validation, fault tolerance and
    the management of fault propagation.
  • Performance, availability and resilience

EPSRC Distributed Information Management
Programme
28
Lessons learnt from biologists
  • Accessibility through the web is good but
    computers need access too!
  • Provide program interfaces to sources
  • Sharing data means publishing, agreeing and
    sharing metadata, and standards
  • Expect change
  • Generic, extensible and evolvable architectures
    using layered independence and
  • (Re)Use shelf components, recognised technologies
    and good SE practice
  • Dont use proprietary or home grown solutions
  • (Semi)-automate annotation and data association
  • A Distributed Digital Library

29
Conclusion
  • Information management has a lot to offer the
    GRID today, and the GRID offers real research
    opportunities
  • Support creative serendipitous discovery
  • Support collection curation
  • Its easier to accumulate data than use it
    effectively
Write a Comment
User Comments (0)
About PowerShow.com