Title: EMBOSS as a DAS Client
1EMBOSS as a DAS Client
- Peter Rice pmr_at_ebi.ac.uk
- Mahmut Uludag uludag_at_ebi.ac.uk
- 3rd March 2011.
2EMBOSS A quick introduction
- European Molecular Biology Open Software Suite
- Open source package for sequence analysis
- ANSI C source code
- GPL licensed applications, LGPL libraries
- 200 applications
- 100 third party applications in 15 associated
packages - Project started 1996 at Sanger Centre and HGMP
- Now based at EBI
- Release 6.3.0 15th July 2010
- Funded by UK-BBSRC and EMBL-EBI
3EMBOSS history
- Project started at Sanger Centre and SEQNET
August 1996 - Alan moved from SEQNET 1997 (Wellcome funding)
- Peter moved to Lion Bioscience 2000
(CCP11-BBSRC/MRC) - Peter moved to EBI 2003
- HGMP closed 2005 AlanJon moved to EBI
- BBSRC funding (limited) 2006-2009
- BBSRC BBR funding 2009-2011
- Major new developments
- New data types
- New data sources
- Built-in ontologies
4EMBOSS command line interface
- EMBOSS applications run from the command line
- This is not the only interface
- There are over 100 interfaces and packaged
systems available - Web interfaces
- Graphical user interfaces (GUIs)
- Web services
- All applications have a command definition file
(.acd) - Defines all inputs, outputs, and other options
- Read at startup
- Contains all command line options with
descriptions - Template for any other interface
5EMBOSS command line example
- antigenic
- Input protein sequence(s) uniprotactb1_fugru
- Minimum length of antigenic region 6
- Output report actb1_fugru.antigenic
- antigenic uniprotactb1_fugru -auto
6EMBOSS ACD File
integer minlen standard "Y" minimum
"1" maximum "50" default "6"
information "Minimum length of antigenic
region" endsection required section
output information "Output section type
"page report outfile parameter "Y"
rformat "motif" multiple "Y"
taglist "intposMax_score_pos" endsection
output
- application antigenic
- documentation "Finds antigenic sites in
proteins" - groups "ProteinMotifs"
-
- section input information "Input section
type "page - seqall sequence
- parameter "Y"
- type proteinstandard"
-
- endsection input
- section required information "Required
section type "page
7EMBOSS ACD File with EDAM Annotation
integer minlen standard "Y" minimum
"1" maximum "50" default "6"
information "Minimum length of antigenic
region" relations "EDAM0001249 data
Sequence length endsection
required section output information "Output
section type "page report outfile
parameter "Y" rformat "motif" multiple
"Y" taglist "intposMax_score_pos"
relations "EDAM0001534 data Peptide
immunogenicity report endsection output
- application antigenic
- documentation "Finds antigenic sites in
proteins" - groups "ProteinMotifs"
- relations "EDAM0000201 topic Immunological
analysis" - relations "EDAM0000416 operation Epitope
mapping -
- section input information "Input section
type "page - seqall sequence
- parameter "Y"
- type proteinstandard"
- relations "EDAM0001219 data Pure protein
sequence" - relations "EDAM0000849 data Sequence
record" - relations "EDAM0002178 data 1 or more
-
- endsection input
8Documentation books
- Three books at typesetting stage.
- Administrators Manual
- Users Manual
- Developers Manual
- Concomitant major revision of EMBOSS website.
- Automation of website content addition.
- Books to form basis of new website content.
9EMBOSS Sequences
- Uniform Sequence Address (USA) URL-style naming
- Derived from the familiar "VMS logical name"
syntax used by SRS and GCG. - database entryname
- embl ecompa ID or accession can be used in
this way - uniprot-id opsd_bovin SRS syntax for query by
ID - embl-acc x13776 SRS syntax for query by
accession - format filename
- fasta /users/pmr/paamir.fa Filename with
specific format - ecoompa.genbank With no format, can try all
formats - format filename entryname
- fasta unfinished AH6.1 Most formats allow
multiple sequences - Also _at_listfile
- and asisgctgactgactgatg
- Queries database-fieldquery SRS syntax for id,
acc, sv, des, key, org
10New data resources
- Aim to read all public data resources
- Follow cross-references (explicit and implied)
- UniProt
- EMBL/GenBank/DDBJ
- Other
- Servers
- Multiple data resources through a single server
definition - DAS, Ensembl, BioMart, WsEbeye, DbFetch, SRS
- Cache files of resource definitions for server
- Data resource catalogue (drcat)
- 600 data resources
- Query terms and URLs
- EDAM annotation of resources, formats,
identifiers, terms
11Data resource catalogue (drcat)
- ID ArachnoServer
- Acc DB-0145
- Name ArachnoServer
- Desc Spider toxin database
- URL http//www.arachnoserver.org
- Cat Organism-specific databases
- Taxon 6845 Arachnida
- EDAMres 0000621 Organism-specific
- EDAMdat 0002400 Toxin annotation
- EDAMid 0002578 ArachnoServer ID
- Xref SP_explicit ArachnoServer IDToxin name
- Query Toxin annotation HTML ArachnoServer
ID www.arachnoserver.org/toxincard.html?ids - Example ArachnoServer ID AS000014
- CCmisc BMC Genomics 10375-375(2009) Pubmed
19674480
12EMBOSS Data Types
- Sequences
- Nucleotide (DNA and RNA)
- Protein
- Features
- Attached to sequences
- Independent data objects
- Bio-Ontologies (OBO)
- Taxonomy (NCBI)
- Data Resources
- Assembled reads
- Text
- Text, HTML, XML
13New data types
- Reuse USA syntax
- Server Dbname identifier Database has an
access method - Server Dbname field query General field
names - Data types features, bio-ontologies, taxonomy,
etc. - Access methods HTTP, DAS, BioMart, Ensembl, ...
- Multiple types and formats for a server/resource
- type sequence features
- format embl fasta
14EMBOSS Query Language
- Query fields are now made general
- Any field queriable by the access method (DAS,
SRS, ) - Any index created by indexing applications
- Any query term in the data resource catalogue
- Multiple queries combined
- For one data resource
- AND, OR, to combine queries
15DAS Server Definitions
- SERVER das
- method "dassource"
- type "sequence, features"
- url "http//www.dasregistry.org/das/
" - comment "access sequence/feature sources
listed on das registry - (http//www.dasregistry.org/das/)"
- cachefile "server.dassource"
-
16DAS Server Definitions
- SERVER ensembldas
- method "dassource"
- type "sequence, features"
- url "http//www.ensembl.org/das/"
- comment "access sequence/feature sources on
ensembl das server - (http//www.ensembl.org/das/)"
- cachefile "server.ensembldas"
-
17DAS Example
- DB Ensembl_Human_Genes
- method das
- type "Sequence, Features
- taxon "9606
- format "das, dasgff
- url http//www.ebi.ac.uk/das-srv/genedas/da
s/ Homo_sapiens.Gene_ID.reference - example "ENSG00000139618
- comment "The Ensembl human Gene_ID
reference source, serving sequences and
non-location features. - hasaccession "N
- identifier "segment
- fields "segment, type, category,
categorize, feature_id -
18Ensembl DAS Example
- DB Felis_catus_CAT_prediction_transcript
- method das
- type "Nucfeatures
- taxon "9685
- format "dasgff
- url http//www.ensembl.org/das/Felis_catus.
CAT.prediction_transcript - example "scaffold_2099871550
- comment "Annotation source for Felis_catus
prediction_transcript - hasaccession "N
- identifier "segment
- fields "segment, type, category,
categorize, feature_id -
19EMBOSS Query Language
- das ensembl_human_genes ENSG00000139618
- ensembldas Felis_catus_CAT_prediction_transcript
scaffold_209987 1550 - das Homo_sapiens_GRCh37_transcript 10
3288961132973347 - das uniprot P00280
- das cath 5pti
- das uniparc UPI000000000A
- das Homo_sapiens_GRCh37_reference-
segment 11 type supercontig
20EMBOSS Query Language Future
- Ontology-based searches of data resources
- Taxonomy
- EDAM terms
- Resources
- Data types
- Identifiers
- Descriptions
- Search for applications matching data types
- Sequences and features
- Nucleotide and protein
-
- Support for DAS advanced query ...
21Acknowledgements
- EBI Peter Rice, Alan Bleasby, Jon Ison, Mahmut
Uludag, Martin Senger, Tom Oinn, Jaina Mistry,
Rodrigo Lopez, Sharmilla Pillai, Hamish McWilliam - RFCGR/HGMP Alan Bleasby, Jon Ison, Tim Carver,
Hugh Morgan, Claude Beazley, Lisa Mullan, Damian
Counsell, Gary Williams, Val Curwen, Mark Faller,
Sinead OLeary, Thon deBoer, Martin Bishop - Sanger Institute Ian Longden, Richard
Bruskiewich, Simon Kelley - LION Mahmut Uludag, Thomas Laurent, Bijay
Jassal, Bren Vaughan, Thure Etzold - National bioinformatics service providers in
Norway, Spain, Italy, Netherlands, Germany,
Belgium, Russia, China, Canada, Australia,
Argentina - Others Catherine Letondal, Don Gilbert, Rodger
Staden, Bill Pearson, Webb Miller, Marie-Laetitia
Denayer, Amandine Schurmann, Gabriele Weiler,
Luke McCarthy, David Mathog, David Bauer,
Henrikki Almusa, Thomas Siegmund, Scott Markel,
Darryl Leon, Bastien Chevreux, Ivo Hofacker, ... - IBM, Hewlett-Packard, (Compaq), Apple, SGI, Sun,
LION bioscience, SciTegic, Cambridge University
Press - Open-Bio Foundation, Sourceforge, Debian, Fedora,
CEH - ... And the British Antarctic Survey
- http//emboss.sourceforge.net
- http//emboss.open-bio.org/wiki/Latest_development
s