Title: Bioinformatics%20at%20Promega%20Corporation
1Bioinformatics at Promega Corporation
Intro to Bioinformatics Biotec November 28,
2006 Ethan Strauss Sr. Scientist RD
Bioinformatics, Promega, Ethan.strauss_at_promega.c
om http//q7.com/ethan/molbio
2My Background
- Bachelors degree in biology
- PhD and work experience in Molecular Biology
- Eight years in Promega Technical Services
- Almost two years in Bioinformatics (officially)
- No formal computer training
- No formal bioinformatics training
3Bioinformatics at Promega Corporation
- Bioinformatics did not exists as a separate
function until 2001 - One person 2001- 2005
- Two people 2005 - ?
- Bioinformatics supports primarily RD (100
scientists) - Mentor and train RD scientists
- Provide expertise for projects (120 requests per
year) - Propose and evaluate new acquisitions
- Liaison to IT department
- Manage bioinformatics infrastructure (15 tools)
- Develop new tools and adapt existing tools in
house
4Bioinformatics Projects
- Programming
- Tools for internal and external Promega customers
- Plexor Primer Design System (https//www.promega
.com/techserv/tools/plexor/logon.aspx) - Biomath (http//www.promega.com/biomath/)
- siRNA Designer (http//www.promega.com/siRNADesi
gner/) - Sequence analysis for Excel and Microsoft
Word(http//www.promega.com/enotes/features/fe002
5.htm) - Analysis of BLAST results
- Automated data retrieval (Web services)
- Database for tracking vector construction
- Database for keeping track of plasmid features
5Bioinformatics Projects
- Biocomputing (use of computers in biological
research) - Database searches
- data mining
- discovery research
- Primer design
- Blast analysis and interpretation
- Etc
6NCBI
- I recently took the Powerscripting course from
NCBI - NCBI has a lot of very powerful tools and
databases. - They are not as well documented as they might be.
- Check them out periodically.
- Databases at NCBI I was not aware of, but am now.
- Pub Med CentralArticles with free text
- 3D domain, structure, 3D structural information.
- GEO (Gene Expression Omnibus)Micorarray
expression data - There are many more which I see on the drop down
list, but dont really know any thing about
7NCBI ftp site
- Most NCBI data is available by FTP from
http//www.ncbi.nlm.nih.gov/Ftp/ - I have used it for a number of projects including
an analysis of amino acid residue distribution
for the first 11 positions of human and E. coli
8NCBI - Entrez Programming Utilities
Programatic access to Entrez http//eutils.ncbi.nl
m.nih.gov/entrez/query/static/eutils_help.html
Allows incorporation of entrez functionality
into third party tools http//www.promega.com/tech
serv/tools/plexor/NewQpcrProject.aspx Allows
automation of Entrez searchesAnalysis of large
datasetsAutomation of searches and
queries Accessable using HTTP or SOAP
9NCBI - Entrez Programming Utilities
- Programs available
- ESearch Searches and retrieves primary IDs and
term translations and optionally retains results
for future use in the user's environment. - ESummary Retrieves document summaries from a
list of primary IDs or from the user's
environment. - EFetch Retrieves records in the requested
format from a list of one or more primary IDs or
from the user's environment. - ELink Checks for links from the query ID
numbers to other Entrez databases - EInfo Provides field index term counts, last
update, and available links for each database. - EPost Posts a file containing a list of primary
IDs for future use in the user's environment to
use with subsequent search strategies.
10NCBI - Entrez Programming Utilities
Lets try it! Go to http//www.ncbi.nlm.nih.gov/Cla
ss/wheeler/eutils/eu.html and play Now
try http//www.ncbi.nlm.nih.gov/Class/wheeler/euti
ls/epipe.html
11NCBI - Entrez Programming Utilities
These sorts of utilities can be access
programtically using Perl. See Demonstration
Programs at http//eutils.ncbi.nlm.nih.gov/entrez
/query/static/eutils_help.html
12NCBI - Entrez Programming Utilities
my utils "http//www.ncbi.nlm.nih.gov/entrez/eu
tils" my db ask_user("Database",
"Pubmed") my query ask_user("Query",
"zanzibar") my report ask_user("Report",
"abstract") my esearch "utils/esearch.fcgi?
dbdbretmax1usehistoryyterm" my
esearch_result get(esearch . query) print
"\nESEARCH RESULT esearch_result\n" esearch_re
sult mltCountgt(\d)lt/Countgt.ltQueryKeygt(\d)lt
/QueryKeygt.ltWebEnvgt(\S)lt/WebEnvgts my Count
1 my QueryKey 2 my WebEnv
3 print "Count Count QueryKey QueryKey
WebEnv WebEnv\n" my retstart my
retmax3 for(retstart 0 retstart lt Count
retstart retmax) my efetch
"utils/efetch.fcgi?rettypereportretmodetextr
etstartretstartretmaxretmax" .
"dbdbquery_keyQueryKeyWebEnvWebEnv"
print "\nEF_QUERYefetch\n" my
efetch_result get(efetch) print
"---------\nEFETCH RESULT(". (retstart
1) . . (retstart retmax) . ") ".
"efetch_result\n-----PRESS ENTER!!!-------\n"
13Bioinformatics Advice
- Be aware of bias in databases!
- Search Genbank (nucleotide) for HumanOrganism
apoptosis. How many hits? - Now try OrcinusOrganism apoptosisHow many
hits? - Can you conclude that Orcinus does not have
apoptosis?
14Bioinformatics Advice
- Bioinformatics is changing and advancing very
rapidly. - Dont forget to notice what is new.
- NCBI now has 20 different databases. They had
two only 3-5 years ago - If you want to do something that you know cant
be done, check again in two weeks! - My standard computer can process the entire human
genome for Restriction sites, ORFs etc in a few
hours. Not long ago, the best computers couldnt
even hold that much data! - If old tools work, dont feel you need to use the
newest tools. - I still do much of my analysis with Microsoft
Word
15LIMS Laboratory Information Management System
Goal Manage in-house DNA sequences and
associated data Eval UW-Madison Center for
Eukaryotic Structural Genomics Sesame
http//www.sesame.wisc.edu/ Sesame is designed
to organize and record data relevant to complex
scientific projects, to launch computer-controlled
processes, and to help decide about subsequent
steps on the basis of information available. The
Sesame system is based on the multi-tier
paradigm, and it consists of a framework and
application modules that carry out specific
tasks.Users interact with Sesame through a
series of web-based Java applet-applications
designed to organize data. It allows
collaborators on a given project to enter,
process, view, and extract relevant data,
regardless of location, so long as web access is
available. Data reside in an Oracle relational
database. Sesame serves as a digital laboratory
notebook and allows users to attach numerous
files and images
16Programming
- Tools for Promega customers
- Biomath (http//www.promega.com/biomath/)
- Basic calculations (Most can be done easily by
hand) - Simple code (Javascript)
- Established theory.
- Universal (not Promega specific)
- siRNA Designer(http//www.promega.com/siRNADesigne
r/ ) - Complex calculations
- More complex code (VBScript)
- Rapidly evolving theory
- Partially Promega specific
17Programming
- Tools for Promega customers
- Plexor Primer Design (https//www.promega.com/tech
serv/tools/plexor) - Complex calculations
- Complex code (C.Net)
- Separate user interface and main calculations
- Multiple interacting modules
- Database integration
- Integration with Genbank (through a web service)
- Proprietary improvements on established theory
- Very Promega specific
18Programming
- Tools for internal use
- BLAST analysis of Plexor Primers
- Primer specificity is important
- BLAST can determine specificity, but output is
very complex. - Simplify
- Combine all hits from the same Gene
- Only show hits which could mis-prime
- Groups hits by species
- Allow sorting by species
-
19Programming
- Tools for internal use
- BLAST analysis of Plexor Primers
-
Initial BLAST results (1 page out of 30)
Analyzed BLAST results (complete!)
20Programming
- Tools for internal use
- Vector/Insert Database
- Promegas Flexi vector system has a very
structured cloning procedure. - RD has been making many different Flexi vector
backbones with many inserts. - Keeping track has been a problem.
- A database is in development
-
21Programming
22Programming
- Internal Projects
- Which Restriction enzyme cuts least frequently in
human ORFs? - Method
- Download human Refseq database (ftp//ftp.ncbi.nih
.gov/refseq/H_sapiens/) - Load into local database
- Scan each sequence for each RE site
- The scan took 2-3 hours to complete
-
http//www.promega.com/pnotes/89/12416_11/12416_11
.pdf
23Programming
- Internal Projects
- Which human genes in Genbank are the most
popular? - Method
- Download Gene database (ftp//ftp.ncbi.nlm.nih.g
ov/gene/) - Download Gene Ontology information
(http//www.geneontology.org/) - Use web services to get pathway information from
KEGG (http//www.genome.jp/kegg/) - Use web services to get citation information from
Pubmed (http//www.ncbi.nlm.nih.gov/entrez/query.f
cgi?dbPubMed) - Load all into local database
- Rank genes by desired criteria
- Size
- Function
- Localization
- Pathways
- Publications
-
24Database searches and data mining
Question Can you reformat this sequence for
me?Tool ReadSeq http//bimas.dcrt.nih.gov/molb
io/readseq Macros Question How many viral
proteins start with MetHis?Tool Hits database
motif searches http//hits.isb-sib.ch/ Question
How many different bacterial two-domain
proteins are known?Tool SCOP database
http//scop.berkeley.edu/ Question How do I
design PCR primers selective for bacterial
species X?Tool Ribosomal database 16s rRNA
alignment http//rdp.cme.msu.edu