Data Curation and the Scientific Record - PowerPoint PPT Presentation

1 / 103
About This Presentation
Title:

Data Curation and the Scientific Record

Description:

Pfizer. Schering. Syngenta. Unilever R&D. Philips Medical Devices. Aventis. Boeringer Ingelheim ... if you have a million users who think it's just fine. But ... – PowerPoint PPT presentation

Number of Views:29
Avg rating:3.0/5.0
Slides: 104
Provided by: grahamc6
Category:

less

Transcript and Presenter's Notes

Title: Data Curation and the Scientific Record


1
Data Curation and the Scientific Record
  • Graham Cameron

2
  • Enormous value of carefully curated data
    collections in the biomolecular arena
  • What we want from eScience to make our job easier
  • 10 myths about data curation
  • 10 commandments of data curation

3
DNA
4
Genes
5
Protein Sequences
6
Structures
7
Expression
8
PDB code 1DIF HIV-1 Protease/Inhibitor Complex
A79285 (Difluoroketone)
molecules interact
9
Pathways
From KEGG
10
Cells
11
(No Transcript)
12
(No Transcript)
13
(No Transcript)
14
(No Transcript)
15
(No Transcript)
16
(No Transcript)
17
(No Transcript)
18
(No Transcript)
19
EMBL-BankDNA sequences
20
EMBL-BankDNA sequences
UniProt Protein Sequences
21
EMBL-BankDNA sequences
UniProt Protein Sequences
EMSD Macromolecular Structure Data
22
EMBL-BankDNA sequences
UniProt Protein Sequences
Array-Express Microarray Expression Data
EMSD Macromolecular Structure Data
23
EMBL-BankDNA sequences
EnsEMBL Genome Annotation
UniProt Protein Sequences
Array-Express Microarray Expression Data
EMSD Macromolecular Structure Data
24
Getting into processes
  • Properties of molecules
  • DNA sequences
  • Protein sequences
  • Protein Structures
  • Behaviour of molecules
  • Function
  • Interaction
  • Reactions
  • Pathways

25
EMBL-BankDNA sequences
EnsEMBL Genome Annotation
UniProt Protein Sequences
Array-Express Microarray Expression Data
EMSD Macromolecular Structure Data
26
EMBL-BankDNA sequences
EnsEMBL Genome Annotation
UniProt Protein Sequences
Array-Express Microarray Expression Data
EMSD Macromolecular Structure Data
IntActProtein Interactions
27
Reactome
EMBL-BankDNA sequences
EnsEMBL Genome Annotation
UniProt Protein Sequences
Array-Express Microarray Expression Data
EMSD Macromolecular Structure Data
IntActProtein Interactions
28
(No Transcript)
29
Global Picture
  • DNA tripartite international collaboration
  • (including patent data acquisition)
  • Protein sequences Uniprot collaboration
  • Macromolecular structures tripartite
    international collaboration
  • Intact international agreements
  • Reactome USA Europe collaboration
  • Etc.

30
Usage
  • Basic research
  • Industry
  • Pharma
  • Diagnostics
  • Medical device research
  • Personal care
  • Nutrition
  • Agriculture
  • Forestries
  • Fishery
  • Patent searching and provenance

31
Industry Partners
  • AstraZeneca
  • Bayer Healthcare
  • Bristol-Myers Squibb
  • F. Hoffman-La Roche
  • Merck KgaA
  • Nestle Research Centre
  • Pfizer
  • Schering
  • Syngenta
  • Unilever RD
  • Philips Medical Devices
  • Aventis
  • Boeringer Ingelheim
  • GlaxoSmithKline
  • Johnson Johnson Pharma RD
  • Merck Sharp Dohme
  • Organon
  • Proctor and Gamble
  • Serono S.A.
  • Sanofi-Synthelabo

32
European Patent Office (EPO)
  • We want the biomolecular data from patent
    applications in our databases
  • Patent examiners need to search databases in
    ascertaining originality
  • We gather patent data
  • We run special (secure) services for the EPO
  • (Note if, for example, a biomolecular sequence
    appears in a patent we want it in the database.
    This does not mean that the molecule is the
    subject of the patent.)

33
Using the information
Youve found a gene whose variationseems relevant
34
EMBL-BankDNA sequences
Search EMBL-Bank for sequences similar to your
gene of interest
35
EMBL-BankDNA sequences
Look for the protein translation in SWISS-PROT
SWISS-PROT TrEMBL Protein Sequences
36
EMBL-BankDNA sequences
See if its structure (or a similar one) is in EMSD
UniProt Protein Sequences
EMSD Macromolecular Structure Data
37
EMBL-BankDNA sequences
Do we know anything about what proteins
it Interacts with, or when Its expressed?
UniProt Protein Sequences
Array-Express Microarray Expression Data
EMSD Macromolecular Structure Data
IntActProtein Interactions
38
Pathways
EMBL-BankDNA sequences
Does this give us a hint about what pathways
its involved in?
UniProt Protein Sequences
Array-Express Microarray Expression Data
EMSD Macromolecular Structure Data
IntActProtein Interactions
39
Pathways
What do we know about other molecules involved in
that pathway?
EMBL-BankDNA sequences
EnsEMBL Human Genome Gene Annotation
UniProt Protein Sequences
Array-Express Microarray Expression Data
EMSD Macromolecular Structure Data
IntActProtein Interactions
40
Using the information
Can we influence the pathway?
41
Using the information
Can we influence the pathway?
42
This all keeps us fairly busy
43
Nucleotide Sequence Database Growth
Megabases
Date
44
EMBL-Bank 1982-2005
UniProt etc. 1986-2005
Megabases
Entries
PDB 1972-2005
InterPro 2000-2005
Entries
Entries
45
Average Web Hits per Day
Including Ensembl
Average Hits per Day
Note Ensembl is a joint project with The
Wellcome Trust Sanger Institute. Equivalent
usage data have only been available since 2004.
Quarter Year
46
Web Hits
47
EBI Staff 2005
Total Staff 265
48
EBI Total RunningBudget 2005 25 million
49
Specialist biomolecular data
resource examples
Medical data resources
Core biomolecular resources
Biodiversity data resources
SGD
Flybase
Chemical data resources
MGD
Eumorphia/ Phenotypes
Mutants
Mouse Atlas
50
Specialist biomolecular data
resource examples
Medical data resources
Core biomolecular resources
Biodiversity data resources
SGD
Flybase
Chemical data resources
MGD
Eumorphia/ Phenotypes
Mutants
Mouse Atlas
51
Medical data resources
Core biomolecular resources
52
Specialist biomolecular data
resource examples
Medical data resources
Core biomolecular resources
Biodiversity data resources
SGD
Flybase
Chemical data resources
MGD
Eumorphia/ Phenotypes
Mutants
Mouse Atlas
53
How eScience can help us
54
(No Transcript)
55
The User doesnt care
56
Harvest and delivery
57
Layers
Application
User interface
58
Databases
Application
User interface
59
Interconnectivity
Application
User interface
Application interface
60
Communicate objects and their identities
Application
User interface
Application interface
61
Using standard protocols
Application
User interface
Application interface
62
Using standard protocols
Primary Biomolecular Databases
Application
User interface
Application interface
63
Authorities - GO
Primary Biomolecular Databases
User interface
Application
Application interface
Go
64
Authorities - GO
Primary Biomolecular Databases
User interface
Application
Application interface
Go
65
Authorities are databases
Primary Biomolecular Databases
Go
66
Creating connections
Primary Biomolecular Databases
Go
67
Integr8
Primary Biomolecular Databases
Go
68
Integr8
Primary Biomolecular Databases
User interface
Application interface
Application
Go
69
Staging database(Biomart)
Integr8
Primary Biomolecular Databases
User interface
Application interface
Application
Go
70
Staging database(Biomart)
Integr8
User interface
Application interface
Application
Primary Biomolecular Databases
Go
71
Lots of it
  • 10-12 core databases
  • 100 others in the first shell
  • 1000 more in the next shell

72
10. Myths about data curation
  • Well, 9 actually unless I think of a tenth by
    the end of the meeting!

73
1. Databases are designed then built
  • Oh yeah! - usually theres
  • Data spilling all over the floor
  • A rabidly angry user community
  • A skinflint funding agency
  • Then you realise you need a database

74
2. Integration - the more the merrier
  • Every link is a potential dead link
  • Every dependency can find its way on to your
    critical path
  • Monolithic solutions always fail
  • Find the natural lines of cleavage which minimise
    the number of connections
  • Standardise the connections

75
3. Standards are a good thing
  • Design to minimise what you need to standardise

76
4. Domain experts will help you with your
standardisation
  • Consider taxonomy of species
  • There are huge differences of opinion among
    experts
  • They want to represent all the uncertainty and
    alternatives
  • They have no concept of good enough
  • In short, they care far too much to be of any use

77
5. Students of the formal structure of knowledge
will help you structure your knowledge
  • They will give you loads of advice about a world
    that doesnt exist
  • They will give you extensive (usually insulting)
    comments on what you are doing wrong even if
    you have a million users who think its just fine
  • But
  • Read their formalisms
  • Try to understand the rules you are breaking (and
    why)
  • Dont actually let them do anything

78
6. Warehouses work
  • Piffle
  • They never manage to maintain synchrony with the
    source data
  • Mostly they fall down of their own weight!

79
7. Databases are use agnostic
  • The only cost effective design knows what needs
    it is trying to serve

80
8. Databases are expensive
  • A damned sight less so than no databases
  • Cost of the science they support
  • Cost of the traditional scientific record

81
9. Scientists care whats in the databases
  • Not if its their data

82
Ten Commandments of Data Curation
  • Well 20 actually (virtual memory?)

83
1. Capture expensive and useful data for posterity
  • PDB (1971) was a recognition that structure data
    are immensely useful
  • But they are expensive (35000 _at_ 50,000?)
  • 1750,000,000!
  • Structure and function inexorably intertwined

84
2. Make it easy to capture the data when they are
available
  • Scientists dont do research for the sake of the
    databases
  • Saving data is a chore - make it easy
  • The structure problem was incompletely solved -
    the structures are derived data, and the original
    data was never captured
  • Harvesting data at source

85
3. Be proactive about acquiring data
  • Go out and get the data which have potential
    downstream use
  • Run a data liberation army
  • Educate the generators of information as to the
    value of the sum of the parts

86
4. Make it easy to get at the most frequently
used data
  • Vast storage capacity can go to your head
  • Archive vs. library
  • E.g. Public DNA database vs. lab files of traces

87
5. Dont complicate everyones life for the sake
of a few esoteric cases
  • Computer systems are too complicated - fight it
  • Information resources are worse
  • He who pays the piper establishes a committee to
    call the tune
  • Gather requirements expansively, prune ruthlessly
  • Example the EMBL/GenBank/DDBJ/Feature Table

88
6. Archive the enduring, not the episodic
  • Human Genome
  • Neuroinformatics databases
  • Confocal images of cells

89
7. Worry about todays data, worry more about
tomorrows - yesterdays will soon be history
  • DNA database doubling every year (10 months
    maybe)
  • Capture the data when they are available
  • History does matter
  • Tomorrow today will be yesterday
  • Balance your effort between correcting
    yesterdays mistakes and making todays mistakes!

90
8. Databases are not publications
  • The scientific record is
  • citable
  • permanent
  • high-quality
  • Databases are
  • dynamic
  • up-to-date
  • high-quality

91
9. Can you afford it?
  • Careful data curation is expensive
  • Updates
  • New technology
  • Youre never finished paying

92
10. Who owns it?
  • Protect what you must protect carefully
  • Recognise the benefits of sharing
  • Dont assert rights youre not willing to defend

93
11. Know your domain
  • Natural lines of cleavage
  • Use external authorities
  • Link to other databases
  • Cleave along scientific not technological lines

94
12. Avoid dependencies
  • Every link is a potential dead link
  • Every processing step could end up on your
    critical path
  • External authorities??

95
13. Design at the ontological level
  • Think biology
  • Choose tools tools which allow you to think
    biology
  • Take expansive input
  • Apply architecture

96
14. Control your metadata
  • ASN.1
  • CIF
  • mmCIF
  • CORBA
  • Web Services

97
15. Just because a database is ill defined, it
doesnt mean its contents arent worth rescuing
  • MIM
  • P53
  • Transfac

98
16. Databases mutate - know where your definitive
version is
  • Propagate from the database to the copies

99
17. Youll get it wrong - design for change
  • At every stage of the design realise that youll
    soon be doing it again
  • Store all your definitions in one place and
    derive everything else from it
  • Tools to do this are available

100
18. Standardise where you need standards, dont
where you dont
  • Standardise messages rather than structures
  • GenBank/EMBL data exchange
  • Demand pull rather than authority push

101
19. Identifiers - know their scope and persistence
  • Give identifiers to objects which must be
    referenced
  • Know their scope
  • Know their persistence
  • Make them meaningless!
  • Handle all (e.g., split and merge)
    transformations
  • Share them

102
20. Watch your language!
  • Use consistent terminology
  • Use accepted terminology
  • Use external authorities
  • Support variants - synonyms etc.

103
  • All in all, as curators we are custodians of a
    vast and valuable corpus of human knowledge
  • Its important that we get it right
  • There is little understanding of best practice
  • Its time we took the job seriously
  • This meeting is a good contribution to doing so
Write a Comment
User Comments (0)
About PowerShow.com