Data Curation and the Scientific Record - PowerPoint PPT Presentation

1 / 103

About This Presentation

Title:

Data Curation and the Scientific Record

Description:

Pfizer. Schering. Syngenta. Unilever R&D. Philips Medical Devices. Aventis. Boeringer Ingelheim ... if you have a million users who think it's just fine. But ... – PowerPoint PPT presentation

Number of Views:29

Avg rating:3.0/5.0

Slides: 104

Provided by: grahamc6

Category:

more less

Transcript and Presenter's Notes

Title: Data Curation and the Scientific Record

1
Data Curation and the Scientific Record

Graham Cameron

Enormous value of carefully curated data
collections in the biomolecular arena
What we want from eScience to make our job easier
10 myths about data curation
10 commandments of data curation

3
DNA
4
Genes
5
Protein Sequences
6
Structures
7
Expression
8
PDB code 1DIF HIV-1 Protease/Inhibitor Complex
A79285 (Difluoroketone)
molecules interact
9
Pathways
From KEGG
10
Cells
11
(No Transcript)
12
(No Transcript)
13
(No Transcript)
14
(No Transcript)
15
(No Transcript)
16
(No Transcript)
17
(No Transcript)
18
(No Transcript)
19
EMBL-BankDNA sequences
20
EMBL-BankDNA sequences
UniProt Protein Sequences
21
EMBL-BankDNA sequences
UniProt Protein Sequences
EMSD Macromolecular Structure Data
22
EMBL-BankDNA sequences
UniProt Protein Sequences
Array-Express Microarray Expression Data
EMSD Macromolecular Structure Data
23
EMBL-BankDNA sequences
EnsEMBL Genome Annotation
UniProt Protein Sequences
Array-Express Microarray Expression Data
EMSD Macromolecular Structure Data
24
Getting into processes

Properties of molecules
DNA sequences
Protein sequences
Protein Structures
Behaviour of molecules
Function
Interaction
Reactions
Pathways

25
EMBL-BankDNA sequences
EnsEMBL Genome Annotation
UniProt Protein Sequences
Array-Express Microarray Expression Data
EMSD Macromolecular Structure Data
26
EMBL-BankDNA sequences
EnsEMBL Genome Annotation
UniProt Protein Sequences
Array-Express Microarray Expression Data
EMSD Macromolecular Structure Data
IntActProtein Interactions
27
Reactome
EMBL-BankDNA sequences
EnsEMBL Genome Annotation
UniProt Protein Sequences
Array-Express Microarray Expression Data
EMSD Macromolecular Structure Data
IntActProtein Interactions
28
(No Transcript)
29
Global Picture

DNA tripartite international collaboration
(including patent data acquisition)
Protein sequences Uniprot collaboration
Macromolecular structures tripartite
international collaboration
Intact international agreements
Reactome USA Europe collaboration
Etc.

30
Usage

Basic research
Industry
Pharma
Diagnostics
Medical device research
Personal care
Nutrition
Agriculture
Forestries
Fishery
Patent searching and provenance

31
Industry Partners

AstraZeneca
Bayer Healthcare
Bristol-Myers Squibb
F. Hoffman-La Roche
Merck KgaA
Nestle Research Centre
Pfizer
Schering
Syngenta
Unilever RD
Philips Medical Devices

Aventis
Boeringer Ingelheim
GlaxoSmithKline
Johnson Johnson Pharma RD
Merck Sharp Dohme
Organon
Proctor and Gamble
Serono S.A.
Sanofi-Synthelabo

32
European Patent Office (EPO)

We want the biomolecular data from patent
applications in our databases
Patent examiners need to search databases in
ascertaining originality
We gather patent data
We run special (secure) services for the EPO
(Note if, for example, a biomolecular sequence
appears in a patent we want it in the database.
This does not mean that the molecule is the
subject of the patent.)

33
Using the information
Youve found a gene whose variationseems relevant
34
EMBL-BankDNA sequences
Search EMBL-Bank for sequences similar to your
gene of interest
35
EMBL-BankDNA sequences
Look for the protein translation in SWISS-PROT
SWISS-PROT TrEMBL Protein Sequences
36
EMBL-BankDNA sequences
See if its structure (or a similar one) is in EMSD
UniProt Protein Sequences
EMSD Macromolecular Structure Data
37
EMBL-BankDNA sequences
Do we know anything about what proteins
it Interacts with, or when Its expressed?
UniProt Protein Sequences
Array-Express Microarray Expression Data
EMSD Macromolecular Structure Data
IntActProtein Interactions
38
Pathways
EMBL-BankDNA sequences
Does this give us a hint about what pathways
its involved in?
UniProt Protein Sequences
Array-Express Microarray Expression Data
EMSD Macromolecular Structure Data
IntActProtein Interactions
39
Pathways
What do we know about other molecules involved in
that pathway?
EMBL-BankDNA sequences
EnsEMBL Human Genome Gene Annotation
UniProt Protein Sequences
Array-Express Microarray Expression Data
EMSD Macromolecular Structure Data
IntActProtein Interactions
40
Using the information
Can we influence the pathway?
41
Using the information
Can we influence the pathway?
42
This all keeps us fairly busy
43
Nucleotide Sequence Database Growth
Megabases
Date
44
EMBL-Bank 1982-2005
UniProt etc. 1986-2005
Megabases
Entries
PDB 1972-2005
InterPro 2000-2005
Entries
Entries
45
Average Web Hits per Day
Including Ensembl
Average Hits per Day
Note Ensembl is a joint project with The
Wellcome Trust Sanger Institute. Equivalent
usage data have only been available since 2004.
Quarter Year
46
Web Hits
47
EBI Staff 2005
Total Staff 265
48
EBI Total RunningBudget 2005 25 million
49
Specialist biomolecular data
resource examples
Medical data resources
Core biomolecular resources
Biodiversity data resources
SGD
Flybase
Chemical data resources
MGD
Eumorphia/ Phenotypes
Mutants
Mouse Atlas
50
Specialist biomolecular data
resource examples
Medical data resources
Core biomolecular resources
Biodiversity data resources
SGD
Flybase
Chemical data resources
MGD
Eumorphia/ Phenotypes
Mutants
Mouse Atlas
51
Medical data resources
Core biomolecular resources
52
Specialist biomolecular data
resource examples
Medical data resources
Core biomolecular resources
Biodiversity data resources
SGD
Flybase
Chemical data resources
MGD
Eumorphia/ Phenotypes
Mutants
Mouse Atlas
53
How eScience can help us
54
(No Transcript)
55
The User doesnt care
56
Harvest and delivery
57
Layers
Application
User interface
58
Databases
Application
User interface
59
Interconnectivity
Application
User interface
Application interface
60
Communicate objects and their identities
Application
User interface
Application interface
61
Using standard protocols
Application
User interface
Application interface
62
Using standard protocols
Primary Biomolecular Databases
Application
User interface
Application interface
63
Authorities - GO
Primary Biomolecular Databases
User interface
Application
Application interface
Go
64
Authorities - GO
Primary Biomolecular Databases
User interface
Application
Application interface
Go
65
Authorities are databases
Primary Biomolecular Databases
Go
66
Creating connections
Primary Biomolecular Databases
Go
67
Integr8
Primary Biomolecular Databases
Go
68
Integr8
Primary Biomolecular Databases
User interface
Application interface
Application
Go
69
Staging database(Biomart)
Integr8
Primary Biomolecular Databases
User interface
Application interface
Application
Go
70
Staging database(Biomart)
Integr8
User interface
Application interface
Application
Primary Biomolecular Databases
Go
71
Lots of it

10-12 core databases
100 others in the first shell
1000 more in the next shell

72
10. Myths about data curation

Well, 9 actually unless I think of a tenth by
the end of the meeting!

73
1. Databases are designed then built

Oh yeah! - usually theres
Data spilling all over the floor
A rabidly angry user community
A skinflint funding agency
Then you realise you need a database

74
2. Integration - the more the merrier

Every link is a potential dead link
Every dependency can find its way on to your
critical path
Monolithic solutions always fail
Find the natural lines of cleavage which minimise
the number of connections
Standardise the connections

75
3. Standards are a good thing

Design to minimise what you need to standardise

76
4. Domain experts will help you with your
standardisation

Consider taxonomy of species
There are huge differences of opinion among
experts
They want to represent all the uncertainty and
alternatives
They have no concept of good enough
In short, they care far too much to be of any use

77
5. Students of the formal structure of knowledge
will help you structure your knowledge

They will give you loads of advice about a world
that doesnt exist
They will give you extensive (usually insulting)
comments on what you are doing wrong even if
you have a million users who think its just fine
But
Read their formalisms
Try to understand the rules you are breaking (and
why)
Dont actually let them do anything

78
6. Warehouses work

Piffle
They never manage to maintain synchrony with the
source data
Mostly they fall down of their own weight!

79
7. Databases are use agnostic

The only cost effective design knows what needs
it is trying to serve

80
8. Databases are expensive

A damned sight less so than no databases
Cost of the science they support
Cost of the traditional scientific record

81
9. Scientists care whats in the databases

Not if its their data

82
Ten Commandments of Data Curation

Well 20 actually (virtual memory?)

83
1. Capture expensive and useful data for posterity

PDB (1971) was a recognition that structure data
are immensely useful
But they are expensive (35000 _at_ 50,000?)
1750,000,000!
Structure and function inexorably intertwined

84
2. Make it easy to capture the data when they are
available

Scientists dont do research for the sake of the
databases
Saving data is a chore - make it easy
The structure problem was incompletely solved -
the structures are derived data, and the original
data was never captured
Harvesting data at source

85
3. Be proactive about acquiring data

Go out and get the data which have potential
downstream use
Run a data liberation army
Educate the generators of information as to the
value of the sum of the parts

86
4. Make it easy to get at the most frequently
used data

Vast storage capacity can go to your head
Archive vs. library
E.g. Public DNA database vs. lab files of traces

87
5. Dont complicate everyones life for the sake
of a few esoteric cases

Computer systems are too complicated - fight it
Information resources are worse
He who pays the piper establishes a committee to
call the tune
Gather requirements expansively, prune ruthlessly
Example the EMBL/GenBank/DDBJ/Feature Table

88
6. Archive the enduring, not the episodic

Human Genome
Neuroinformatics databases
Confocal images of cells

89
7. Worry about todays data, worry more about
tomorrows - yesterdays will soon be history

DNA database doubling every year (10 months
maybe)
Capture the data when they are available
History does matter
Tomorrow today will be yesterday
Balance your effort between correcting
yesterdays mistakes and making todays mistakes!

90
8. Databases are not publications

The scientific record is
citable
permanent
high-quality
Databases are
dynamic
up-to-date
high-quality

91
9. Can you afford it?

Careful data curation is expensive
Updates
New technology
Youre never finished paying

92
10. Who owns it?

Protect what you must protect carefully
Recognise the benefits of sharing
Dont assert rights youre not willing to defend

93
11. Know your domain

Natural lines of cleavage
Use external authorities
Link to other databases
Cleave along scientific not technological lines

94
12. Avoid dependencies

Every link is a potential dead link
Every processing step could end up on your
critical path
External authorities??

95
13. Design at the ontological level

Think biology
Choose tools tools which allow you to think
biology
Take expansive input
Apply architecture

96
14. Control your metadata

ASN.1
CIF
mmCIF
CORBA
Web Services

97
15. Just because a database is ill defined, it
doesnt mean its contents arent worth rescuing

MIM
P53
Transfac

98
16. Databases mutate - know where your definitive
version is

Propagate from the database to the copies

99
17. Youll get it wrong - design for change

At every stage of the design realise that youll
soon be doing it again
Store all your definitions in one place and
derive everything else from it
Tools to do this are available

100
18. Standardise where you need standards, dont
where you dont

Standardise messages rather than structures
GenBank/EMBL data exchange
Demand pull rather than authority push

101
19. Identifiers - know their scope and persistence

Give identifiers to objects which must be
referenced
Know their scope
Know their persistence
Make them meaningless!
Handle all (e.g., split and merge)
transformations
Share them

102
20. Watch your language!

Use consistent terminology
Use accepted terminology
Use external authorities
Support variants - synonyms etc.

103

All in all, as curators we are custodians of a
vast and valuable corpus of human knowledge
Its important that we get it right
There is little understanding of best practice
Its time we took the job seriously
This meeting is a good contribution to doing so

Write a Comment

User Comments (0)