Title: ArthropodEST: KState Bioinformatics EST analysis pipeline
1ArthropodEST K-State Bioinformatics EST analysis
pipeline Sanjay Chellapilla1, Yoonseong Park2,
Doina Caragea3 and Susan J. Brown1 1Bioinformatics
Center, Division of Biology 2Department of
Entomology 3Department of Computing and
Information Sciences Kansas State University,
Manhattan KS 66506
ABSTRACT Expressed Sequence Tags (ESTs), produced
by single-pass end-sequencing of cDNA clones,
generate large datasets that are instrumental in
gene discovery and gene sequence determination.
Although several EST data analysis pipelines are
available on the WWW (e.g. ESTpass, EGassembler,
ESTexplorer etc.), the WWW-accessible K-State
Bioinformatics EST analysis pipeline
ArthropodEST goes further than these existing
pipelines in providing more options and analyses,
along with a user-friendly interface. The
pipeline was developed utilizing freely available
bioinformatics and system software (academic or
F/OSS licenses). Available options in the
pipeline include input sequence cleaning and
screening for vectors and contaminants, masking
repetitive sequences using repeat databases,
clustering and assembly into contigs, computing
ORFs (Open Reading Frames) and/or signal-peptide
predictions, and assigning functional annotations
to the contigs and singletons. The pipeline sends
out automatic result notification email(s)
containing a unique URL to download results from,
to the users email address. A summary report
(automatically generated) of the analyses is
included in the results available for download.
The pipeline is accessible at http//bioinformatic
s.ksu.edu/ArthropodEST/
WORKFLOW
User-input project name, e-mail address, input
files and options/parameters for analyses
client-side (User) ArthropodEST homepage
Process user inputs, display project-receipt
confirmation and summary, send automatic
confirmation email, invoke pipeline shell script
server-side CGI script
Input sequences cleaning Vector/contaminant
screening
server-side Pipeline shell-script
COMPONENTS OF THE PIPELINE
Repeat-masking with standard RepBase libraries
(a) System software GNU/Linux Ubuntu
2.6.24-23-server, bash 3.2.39, Apache 2.2.8 with
mod_perl/2.0.3, PERL 5.8.8 with PERL
modules CGI 3.29, MailMailer 1.74, FileTemp
0.18, MySQL 5.0 and Postfix 2.5.4 Mail Transport
Agent (MTA). (b) Bioinformatics software -
TGICL software suite http//compbio.dfci.harvard.
edu/tgi/software/ - Vector databases NCBI
UniVec http//www.ncbi.nlm.nih.gov/VecScreen/UniV
ec.html
EMBL EmVec ftp//ftp.ebi.ac.uk/pub/databases/
emvec/ - RepeatMasker http//www.RepeatMasker.
org/ and associated RepBase libraries
http//www.girinst.org/ requires
either cross_match http//www.phrap.org/phredphra
pconsed.html or wu-blastall
http//blast.wustl.edu/ - CAP3
sequence-assembly program http//seq.cs.iastate.e
du/ - NCBI BLAST suite http//www.ncbi.nlm.n
ih.gov/BLAST/download.shtml and/or
wu-blastall http//blast.wustl.edu/ -
blast2GO pipeline version B2G4PIPE
http//blast2go.bioinfo.cipf.es/ - signalp
http//www.cbs.dtu.dk/services/SignalP/ and
EMBOSS http//emboss.sourceforge.net/ (c)
In-house developed software WWW-interface
HTML/CSS, server-side CGI, PERL, bash shell and
awk scripts
Assembly with optional prior clustering into
contigs, singletons
Further analyses functional annotations and/or
signal-peptide predictions
User downloads results and report from unique URL
automatically sent by email
client-side (User)
Acknowledgements Supported by KSU-TE-AGC (SC),
KSU Bioinformatics Center (DC, SC) and K-INBRE
(DC, SC).
KANSAS STATE UNIVERSITY KSU
BIOINFORMATICS CENTER KSU ARTHROPOD GENOMICS
CENTER K-INBRE