Title: Data Curation and the Scientific Record
1Data Curation and the Scientific Record
2- Enormous value of carefully curated data
collections in the biomolecular arena - What we want from eScience to make our job easier
- 10 myths about data curation
- 10 commandments of data curation
3DNA
4Genes
5Protein Sequences
6Structures
7Expression
8PDB code 1DIF HIV-1 Protease/Inhibitor Complex
A79285 (Difluoroketone)
molecules interact
9Pathways
From KEGG
10Cells
11(No Transcript)
12(No Transcript)
13(No Transcript)
14(No Transcript)
15(No Transcript)
16(No Transcript)
17(No Transcript)
18(No Transcript)
19EMBL-BankDNA sequences
20EMBL-BankDNA sequences
UniProt Protein Sequences
21EMBL-BankDNA sequences
UniProt Protein Sequences
EMSD Macromolecular Structure Data
22EMBL-BankDNA sequences
UniProt Protein Sequences
Array-Express Microarray Expression Data
EMSD Macromolecular Structure Data
23EMBL-BankDNA sequences
EnsEMBL Genome Annotation
UniProt Protein Sequences
Array-Express Microarray Expression Data
EMSD Macromolecular Structure Data
24Getting into processes
- Properties of molecules
- DNA sequences
- Protein sequences
- Protein Structures
- Behaviour of molecules
- Function
- Interaction
- Reactions
- Pathways
25EMBL-BankDNA sequences
EnsEMBL Genome Annotation
UniProt Protein Sequences
Array-Express Microarray Expression Data
EMSD Macromolecular Structure Data
26EMBL-BankDNA sequences
EnsEMBL Genome Annotation
UniProt Protein Sequences
Array-Express Microarray Expression Data
EMSD Macromolecular Structure Data
IntActProtein Interactions
27Reactome
EMBL-BankDNA sequences
EnsEMBL Genome Annotation
UniProt Protein Sequences
Array-Express Microarray Expression Data
EMSD Macromolecular Structure Data
IntActProtein Interactions
28(No Transcript)
29Global Picture
- DNA tripartite international collaboration
- (including patent data acquisition)
- Protein sequences Uniprot collaboration
- Macromolecular structures tripartite
international collaboration - Intact international agreements
- Reactome USA Europe collaboration
- Etc.
30Usage
- Basic research
- Industry
- Pharma
- Diagnostics
- Medical device research
- Personal care
- Nutrition
- Agriculture
- Forestries
- Fishery
- Patent searching and provenance
31Industry Partners
- AstraZeneca
- Bayer Healthcare
- Bristol-Myers Squibb
- F. Hoffman-La Roche
- Merck KgaA
- Nestle Research Centre
- Pfizer
- Schering
- Syngenta
- Unilever RD
- Philips Medical Devices
- Aventis
- Boeringer Ingelheim
- GlaxoSmithKline
- Johnson Johnson Pharma RD
- Merck Sharp Dohme
- Organon
- Proctor and Gamble
- Serono S.A.
- Sanofi-Synthelabo
32European Patent Office (EPO)
- We want the biomolecular data from patent
applications in our databases - Patent examiners need to search databases in
ascertaining originality - We gather patent data
- We run special (secure) services for the EPO
- (Note if, for example, a biomolecular sequence
appears in a patent we want it in the database.
This does not mean that the molecule is the
subject of the patent.)
33Using the information
Youve found a gene whose variationseems relevant
34EMBL-BankDNA sequences
Search EMBL-Bank for sequences similar to your
gene of interest
35EMBL-BankDNA sequences
Look for the protein translation in SWISS-PROT
SWISS-PROT TrEMBL Protein Sequences
36EMBL-BankDNA sequences
See if its structure (or a similar one) is in EMSD
UniProt Protein Sequences
EMSD Macromolecular Structure Data
37EMBL-BankDNA sequences
Do we know anything about what proteins
it Interacts with, or when Its expressed?
UniProt Protein Sequences
Array-Express Microarray Expression Data
EMSD Macromolecular Structure Data
IntActProtein Interactions
38Pathways
EMBL-BankDNA sequences
Does this give us a hint about what pathways
its involved in?
UniProt Protein Sequences
Array-Express Microarray Expression Data
EMSD Macromolecular Structure Data
IntActProtein Interactions
39Pathways
What do we know about other molecules involved in
that pathway?
EMBL-BankDNA sequences
EnsEMBL Human Genome Gene Annotation
UniProt Protein Sequences
Array-Express Microarray Expression Data
EMSD Macromolecular Structure Data
IntActProtein Interactions
40Using the information
Can we influence the pathway?
41Using the information
Can we influence the pathway?
42This all keeps us fairly busy
43Nucleotide Sequence Database Growth
Megabases
Date
44EMBL-Bank 1982-2005
UniProt etc. 1986-2005
Megabases
Entries
PDB 1972-2005
InterPro 2000-2005
Entries
Entries
45Average Web Hits per Day
Including Ensembl
Average Hits per Day
Note Ensembl is a joint project with The
Wellcome Trust Sanger Institute. Equivalent
usage data have only been available since 2004.
Quarter Year
46Web Hits
47EBI Staff 2005
Total Staff 265
48EBI Total RunningBudget 2005 25 million
49 Specialist biomolecular data
resource examples
Medical data resources
Core biomolecular resources
Biodiversity data resources
SGD
Flybase
Chemical data resources
MGD
Eumorphia/ Phenotypes
Mutants
Mouse Atlas
50 Specialist biomolecular data
resource examples
Medical data resources
Core biomolecular resources
Biodiversity data resources
SGD
Flybase
Chemical data resources
MGD
Eumorphia/ Phenotypes
Mutants
Mouse Atlas
51Medical data resources
Core biomolecular resources
52 Specialist biomolecular data
resource examples
Medical data resources
Core biomolecular resources
Biodiversity data resources
SGD
Flybase
Chemical data resources
MGD
Eumorphia/ Phenotypes
Mutants
Mouse Atlas
53How eScience can help us
54(No Transcript)
55The User doesnt care
56Harvest and delivery
57Layers
Application
User interface
58Databases
Application
User interface
59Interconnectivity
Application
User interface
Application interface
60Communicate objects and their identities
Application
User interface
Application interface
61Using standard protocols
Application
User interface
Application interface
62Using standard protocols
Primary Biomolecular Databases
Application
User interface
Application interface
63Authorities - GO
Primary Biomolecular Databases
User interface
Application
Application interface
Go
64Authorities - GO
Primary Biomolecular Databases
User interface
Application
Application interface
Go
65Authorities are databases
Primary Biomolecular Databases
Go
66Creating connections
Primary Biomolecular Databases
Go
67Integr8
Primary Biomolecular Databases
Go
68Integr8
Primary Biomolecular Databases
User interface
Application interface
Application
Go
69Staging database(Biomart)
Integr8
Primary Biomolecular Databases
User interface
Application interface
Application
Go
70Staging database(Biomart)
Integr8
User interface
Application interface
Application
Primary Biomolecular Databases
Go
71Lots of it
- 10-12 core databases
- 100 others in the first shell
- 1000 more in the next shell
7210. Myths about data curation
- Well, 9 actually unless I think of a tenth by
the end of the meeting!
731. Databases are designed then built
- Oh yeah! - usually theres
- Data spilling all over the floor
- A rabidly angry user community
- A skinflint funding agency
- Then you realise you need a database
742. Integration - the more the merrier
- Every link is a potential dead link
- Every dependency can find its way on to your
critical path - Monolithic solutions always fail
- Find the natural lines of cleavage which minimise
the number of connections - Standardise the connections
753. Standards are a good thing
- Design to minimise what you need to standardise
764. Domain experts will help you with your
standardisation
- Consider taxonomy of species
- There are huge differences of opinion among
experts - They want to represent all the uncertainty and
alternatives - They have no concept of good enough
- In short, they care far too much to be of any use
775. Students of the formal structure of knowledge
will help you structure your knowledge
- They will give you loads of advice about a world
that doesnt exist - They will give you extensive (usually insulting)
comments on what you are doing wrong even if
you have a million users who think its just fine - But
- Read their formalisms
- Try to understand the rules you are breaking (and
why) - Dont actually let them do anything
786. Warehouses work
- Piffle
- They never manage to maintain synchrony with the
source data - Mostly they fall down of their own weight!
797. Databases are use agnostic
- The only cost effective design knows what needs
it is trying to serve
808. Databases are expensive
- A damned sight less so than no databases
- Cost of the science they support
- Cost of the traditional scientific record
819. Scientists care whats in the databases
82Ten Commandments of Data Curation
- Well 20 actually (virtual memory?)
831. Capture expensive and useful data for posterity
- PDB (1971) was a recognition that structure data
are immensely useful - But they are expensive (35000 _at_ 50,000?)
- 1750,000,000!
- Structure and function inexorably intertwined
842. Make it easy to capture the data when they are
available
- Scientists dont do research for the sake of the
databases - Saving data is a chore - make it easy
- The structure problem was incompletely solved -
the structures are derived data, and the original
data was never captured - Harvesting data at source
853. Be proactive about acquiring data
- Go out and get the data which have potential
downstream use - Run a data liberation army
- Educate the generators of information as to the
value of the sum of the parts
864. Make it easy to get at the most frequently
used data
- Vast storage capacity can go to your head
- Archive vs. library
- E.g. Public DNA database vs. lab files of traces
875. Dont complicate everyones life for the sake
of a few esoteric cases
- Computer systems are too complicated - fight it
- Information resources are worse
- He who pays the piper establishes a committee to
call the tune - Gather requirements expansively, prune ruthlessly
- Example the EMBL/GenBank/DDBJ/Feature Table
886. Archive the enduring, not the episodic
- Human Genome
- Neuroinformatics databases
- Confocal images of cells
897. Worry about todays data, worry more about
tomorrows - yesterdays will soon be history
- DNA database doubling every year (10 months
maybe) - Capture the data when they are available
- History does matter
- Tomorrow today will be yesterday
- Balance your effort between correcting
yesterdays mistakes and making todays mistakes!
908. Databases are not publications
- The scientific record is
- citable
- permanent
- high-quality
- Databases are
- dynamic
- up-to-date
- high-quality
919. Can you afford it?
- Careful data curation is expensive
- Updates
- New technology
- Youre never finished paying
9210. Who owns it?
- Protect what you must protect carefully
- Recognise the benefits of sharing
- Dont assert rights youre not willing to defend
9311. Know your domain
- Natural lines of cleavage
- Use external authorities
- Link to other databases
- Cleave along scientific not technological lines
9412. Avoid dependencies
- Every link is a potential dead link
- Every processing step could end up on your
critical path - External authorities??
9513. Design at the ontological level
- Think biology
- Choose tools tools which allow you to think
biology - Take expansive input
- Apply architecture
9614. Control your metadata
- ASN.1
- CIF
- mmCIF
- CORBA
- Web Services
9715. Just because a database is ill defined, it
doesnt mean its contents arent worth rescuing
9816. Databases mutate - know where your definitive
version is
- Propagate from the database to the copies
9917. Youll get it wrong - design for change
- At every stage of the design realise that youll
soon be doing it again - Store all your definitions in one place and
derive everything else from it - Tools to do this are available
10018. Standardise where you need standards, dont
where you dont
- Standardise messages rather than structures
- GenBank/EMBL data exchange
- Demand pull rather than authority push
10119. Identifiers - know their scope and persistence
- Give identifiers to objects which must be
referenced - Know their scope
- Know their persistence
- Make them meaningless!
- Handle all (e.g., split and merge)
transformations - Share them
10220. Watch your language!
- Use consistent terminology
- Use accepted terminology
- Use external authorities
- Support variants - synonyms etc.
103- All in all, as curators we are custodians of a
vast and valuable corpus of human knowledge - Its important that we get it right
- There is little understanding of best practice
- Its time we took the job seriously
- This meeting is a good contribution to doing so