Title: INSDC Sequencing Project Registry: NCBI web service protocol
1INSDC Sequencing Project Registry NCBI web
service protocol
Use and step-by-step description
National Center for Biotechnology Information,
NIH, Bethesda, MD. USA
2Project definition
- A project is defined as a collection of INSDC
records originating from a single organization,
or from a consortium of coordinating
organizations. - The collective database records from a project
make a complete genome from a single organism
studies or a metagenome comprising communities of
organisms. - A project may contain genomic sequences, EST
libraries and any other sequences that contribute
to the assembly and annotation of the genome or
metagenome
3Field definitions
- Assigned by INSDC
- project ID, locus-tag prefix
- Mandatory fields
- submitter contact info submitting
organization - project type (single organism or metagenomic)
- project name (for metagenomic) organism name
(for single organism) - strain/isolate/breed (for single organism)
- physical source of material (for single
organism) - Optional fields
- project description
project URL - replicon names,
estimated sizes - sequencing method
sequencing depth - estimated/calculated genome size
-
4Schematic diagram of a generic eukaryotic genome
project
Nucleotide data at NCBI (GenBank)
6 Large-scale cDNA sequencing (incomplete) Center
B
1 Genomic sequencing (WGS) and assembly
and annotation (complete) Center B
Genomic data at NCBI (RefSeq)
Organism-specific overview
Links to third-party sites
2 Genomic sequencing (WGS) (complete) Center A
Nucleotide data at NCBI (GenBank)
4 BAC-ends sequencing (incomplete) Center F
project
overview
external data
NCBI data
5Main tables in Genome Project database
6(No Transcript)
7International Nucleotide Sequence Database
Collaboration
Locus-tag prefix for annotated genes
8INSDC project
NCBI genome project submission CGI
EMBL genome project submission CGI
NCBI Server
DDBJ genome project submission CGI
NCBI Project Database
http//www.ncbi.nlm.nih.gov/projects/gpws
9- Web services are web-based enterprise
applications that use open, XML-based standards
and transport protocols to exchange data with
calling clients - WSDL is an XML-based service description on how
to communicate using the web service namely, the
protocol bindings and message formats required to
interact with the web services listed in its
directory. - WSDL is often used in combination with SOAP and
XML Schema to provide web services over the
internet. A client program connecting to a web
service can read the WSDL to determine what
functions are available on the server. Any
special datatypes used are embedded in the WSDL
file in the form of XML Schema. The client can
then use SOAP to actually call one of the
functions listed in the WSDL.
10NCBI Web service implementation
- Web service methods
- Submit Project
- Update Project
- Delete Project
- Check Status
- Get Document ID
- Get Document
- Others
- Bulk dump
- Conflict resolution
11Submitting a new project eSubmit(example
successful submission)
eSubmit names Locus_tag_prefix
Collab CGI
NCBI Server
eOK eNone
Normal case with requested Locus_tag prefix
12New project submission - eSubmit (inconsistent
request)
eSubmit names Flag to auto assign locus_tag
Locus_tag_prefix
Collab CGI
NCBI Server
eError eProvidedLocusTagPrefixWillBeIgnored
Data Error If CSubmission.AutoAssignment flag is
set and pLocusTagPrefix is provided by the
submitter.
13Providing Reliability
NCBI is providing dual middleware and SQL servers
Sql Server1
In this case, the choice of which API server is
used is by load balancing, even when both
middleware servers are available
NCBI api Server
Data are stored redundantly on two SQL Servers
Having Both or any one API server available
provides full functionality
Sql Server2
NCBI api Server
14Normal handling of conflicting request
Reject
New request
Sql Server1
Conflicting request
New request
New request
When both servers are up, no problem, both get
the new request.
Sql Server2
So, in this state, if a Conflicting request
incompatible with the new request is made, it can
be rejected, as it should be.
New request
15There are multiple RARE reasons why a valid
request could have problems.
- Connection to NCBI could be down, anywhere in
between the Collab CGI and NCBI. Among rare
events, we expect this to be the most common
problem. - The entire NCBI site could be down.
Historically, this has been extremely rare. - One or both of the database servers hosting the
service could be down. (See later slides for
partial service provided with one server up.) - (If any NCBI middleware API server is up, request
handled.)
16Benefits of Redundant SQL Servers
- If any Server is up, requests for information can
be handled. - If any server is up, submissions for project IDs
and locus_tags can be accepted. - Normally, a server going down and coming back
requires the only minimal action of checking back
to confirm that the state is now ok.
17Expected transient case that can be handled
automatically
Reject
Received, not confirmed
New request
Sql Server1
Conflicting request
New request
NCBI Maintenance task
New request
The Collab API would receive the status
eReceived, until the maintainence task
completed, then for the new request, it would
then receive the eConfirmed status.
When a server is down, but then comes back up the
request would normally be propagated by NCBI
maintenance tasks.
Sql Server2
So, in this corrected state, if a Conflicting
request incompatible with the new request is
made, it can still be rejected, as it should be.
New request
18Collab handling of eReceived
- Following slides will provide more information
about why the eReceived return is necessary as
a possible return. - To handle it, the collaborators can check back to
confirm that the status has matured to
eConfirmed, or to see if a problem was
detected. - The possible EXTREMELY RARE and UNLIKELY problems
will be presented in following slides.
19Two Phase Commit
- Computer Scientists might recognize the problem
as a natural consequence of a two phase commit. - Normally, the two phases are hidden from
submitters. - If the second phase is blocked by a server being
down, then this complexity is revealed by the
receipt of the eReceived status.
20Unavoidable Complexity Caused by Redundant SQL
Servers
- Redundant SQL Servers both prevent data loss and
maximize uptime for queries. That is why we
choose to accept the complexity of the two phase
commit. - Even in this case the request can be accepted,
but confirmation has to be after a delay.
21Why bother with the two phase commit at all?
- Although expected to be EXTREMELY RARE and
UNLIKELY, the following slide shows a sequence of
events prevented by the current system. - This slide shows what would NOT HAPPEN in the
proposed system because of the two phase commit. - The following slide shows what would happen
WITHOUT the two phase commit.
22Illustration of what we will not allow and must
protect against
Accepted!
Unacceptable state prevented by two stage commit
New request
New request
Conflicting request
But, when one server is down, then come back and
the first comes down, watch what can happen!
So, in this state, if a conflicting request
incompatible with the new request it could be
accepted, leading to an unacceptable data state.
Conflicting request
23Why this event is expected to be so rare
- This event requires the following sequence
- A server 2 going down.
- Server 1 accepting a request, then going down,
while - Server 2 comes back up to accept the
conflicting request.
24How this event would be handled.
- Should this unlikely event happen, instead of the
status maturing from eReceived to eConfirmed, it
would degrade to eConflict for both. - The desired correction would be decided among the
collaborators, by dialog. - Database would be patched to reflect the desired
outcome. -
25Illustration of rare event
- The following slide illustrates the sequence of
events should this rare sequence of events occur. - It may never happen, but the two phase commit
makes is possible, so we want to be clear at the
beginning, what would happen, and how it would be
handled.
26Proposed responses should this happen
Received, not confirmed
Received, not confirmed
New request
New request
Conflicting request
But, when one server is down, then come back and
the first comes down, this is what we should do
So, both requests are received, but not
confirmed. Processes running on our servers will
detect this for manual attention.
Conflicting request
27The previous should be a rare event
- Then why bother handling it? Because
- The cost of automatically making this mistake
would be high, and - The more normal, typical, frequent and expected
recovery, as on previous slides, are handled
automatically. - It will be noticed by a eReceived state degrading
to eConflict.
28Handling of eReceived
- All of the rare cases are noticed by the receipt
of eReceived. - Collaborators need to check back for changes to
the status of eReceived projects. - If the status matures to eConfirmed, no further
action is needed. - If the status degrades to eConflict, then
discussion will be needed. This will be rare!
29Non-collab users of the NCBI Genome Project data
- Public data is available in Entrez
- Use eUtils (also implemented as NCBI web service)
- Discussion on the data elements in Etrez Genome
Project Docsum - ftp dumps