Title: Lecture 3'0 Biopackages I sequence manipulation with BioJava
1Lecture 3.0 Biopackages I - sequence
manipulation with BioJava
- Sohrab Shah
- UBC Bioinformatics Centre
- sohrab_at_bioinformatics.ubc.ca
- http//bioinformatics.ubc.ca/people/sohrab
2My Work
- Pegasys workflow management for bioinformatics
- Uses Biojava heavily
- Comparative vertebrate genomics
- Annotation of Cryptococcus
http//bioinformatics.ubc.ca
3Outline
- When to use biopackages
- A brief tour of biojava
- Sequence IO
- Sequence manip
- Distributions
- Features
- Proteomics
- Learn by examples
4http//www.jb.man.ac.uk/drl/
5Using APIs/Libraries
- Libraries are devoted to avoid re-inventing the
wheel in software development - Include API contracts that specify the rules of
engagement for programmers - What are the inputs to a method?
- What is the output of a method?
- Designed to speed up development time for
application programmers - Standard APIs promote consistent code and a
commonality among diverse/unrelated applications
6What to look for in a software library
- Does the software have the functionality you
need? - Is the software extensible?
- Is the software the accepted standard for the
domain? - Does the package have a user community?
- Is the software well-supported by the developers?
- Documentation
- Mailing lists
- Bug fixing process
- Can I use and distribute this legally?
7Pitfalls
- For open source
- Public domain ! correctness
- Public domain ! robustness
- Spend some time evaluating the code
- BioPerl anecdote
- Does the documentation (publication or other)
accurately reflect the algorithm? - Is there documentation??
- NCBI C Toolkit anecdote
8Rolling your own
- Sometimes this is the best way
- Understanding code at many or all levels is
critical - Understanding an algorithm
- Implement it yourself?
- Debugging/modifying a library may take more time
than writing it yourself
9BIOJAVA
10BioJava
- http//biojava.org
- Dedicated to providing Java tools for processing
biological data - Released under the Open Source LGPL
- Less restrictive than GPL
- Last stable release v1.3 June 15th 2003
- Designed for programmers with experience in Java
- Is a work in progress
- Robust architecture that models biology
11biojava.org
12API docs
13biojava.org
14BioJava in Anger
- Setup
- Where do I get a Java installation?
- How do I get and install BioJava?
- Alphabets and Symbols
- How do I get a DNA, RNA or Protein Alphabet?
- How do I make a custom Alphabet from custom
Symbols? - How do I make a CrossProductAlphabet such as a
codon Alphabet? - How do I break Symbols from CrossProduct
Alphabets into their component Symbols? - How can I tell if two Alphabets or Symbols are
equal? - How can I make an ambiguous Symbol like Y or R?
- Basic Sequence Manipulation
- How do I make a Sequence from a String or make a
Sequence Object back into a String? - How do I get a subsection of a Sequence?
- How do I transcribe a DNA Sequence to a RNA
Sequence? - How do I reverse complement a DNA or RNA
Sequence? - Sequences are immutable so how can I change it's
name? - How can I edit a Sequence or SymbolList?
15BioJava in Anger
- Translation
- How do I translate a DNA or RNA Sequence or
SymbolList to Protein? - How do I translate a single codon to a single
amino acid? - How do I use a non standard translation table?
- How do I translate a nucleotide sequence in all
six frames - Proteomics
- How do I calculate the mass and pI of a peptide?
- Sequence I/O
- How do I write Sequences in Fasta format?
- How do I read in a Fasta file?
- How do I read a GenBank/EMBL/SwissProt file?
- How do I extract GenBank/EMBL/Swissprot sequences
and write them as Fasta? - How do I turn an ABI sequence trace into a
BioJava Sequence?
16BioJava Functionality
- Manipulating sequences
- Sequence File I/O
- Probability distributions over sets of sequences
- Parsing sequence similarity search results
- Taxonomy data structure
- Bibliographic data structures
- Annotation file formats
- GFF, GAME XML, DAS
- Dynamic programming
- Support Vector machine classification and
regression
17BioJava
- Rag-bag collection of useful stuff!
- Focus
- Sequence IO
- Sequence manip
- Probability Distributions
- Features
- Proteomics
18BioJava Design Ethos
- Models biopolymers as a list of symbols, and not
as strings - Strings have 3 disadvantages
- Validation
- It is possible to pass any string to a routine
which is expecting a biological sequence. Any
validation has to be performed on an ad hoc
basis. - Ambiguity
- The meaning of each symbol is not necessarily
clear. The T' which means thymidine in DNA is
the same T' which is a threonine residue in a
protein sequence - Limited alphabet
- While there are obvious encodings for nucleic
acid and sequence data as strings, the same
approach does not always work well for other
kinds of data generated in biological sequence
analysis software - Closer to the biology of the sequence can
attach biochemical annotation to the symbol
objects cant do this with strings
Thomas Down biojava.org
19Sequence manipulation
- The Sequence Object
- abstract data type for biological sequences
- residues of the sequence are stored in a
SymbolList
20Introduction what is BioJava?
- OpenSource initiative to create Java libraries
and tools for processing biological data - Contains objects for manipulating sequences, file
parsers, Dynamic Programming routines for HMM
analysis, basic statistics, sequence database
interoperability - Work in progress now at version 1.21
- www.biojava.org
- Generally designed for programmers with Java
experience - Many ideas from this lecture are taken from
- http//www.biojava.org/
21BioJava Overview
- Sequences and features
- IO
- Processing, storing, manipulating
- Visualising
- Dynamic programming
- Single-sequence and pair-wise HMMs
- Viterbi-path, Forward and Backward algorithms
- Training models
- Sampling sequences from models
- External file formats and programs
- GFF
- Blast
- Meme
- Sequence Databases
- BioCorba interoperability
- ACeDB client
- BioSQL
22Symbol Interface
- BioJava takes a rather different approach to
modeling sequence data. Instead of using a string
of ASCII characters, a sequence is modelled as a
list of Java objects implementing the Symbol
interface. This class, and the others described
here, are part of the Java package
org.biojava.bio.symbol.
23Symbol Interface
- public interface Symbol
- public String getName()
- public Annotation getAnnotation()
- public Alphabet getMatches()
-
- All Symbol instances have
- a name property (ie, Thymine).
- An optional Annotation property (ie, information
about the chemical properties of a DNA base) - getMatches, is only important for ambiguous
symbols
24Alphabet interface
- A set of symbol objects that can be found in a
particular type of sequence is defined by the
Alphabet Interface - Allows you to define your own alphabet
- A set of standard alphabets is available from the
AlphabetManager
25SequenceAlphabet
import org.biojava.bio.symbol. import
org.biojava.bio.seq. import java.util. public
class SequenceAlphabet public static void
main(String args) throws Exception // print
the names of the alphabets available by
default // get an iterator on the
Alphabets Iterator iam AlphabetManager.alphabe
ts() while (iam.hasNext()) Alphabet a
((Alphabet) iam.next()) System.out.println("Al
phabet " a.getName()) // get all the
DNA symbols FiniteAlphabet dna
DNATools.getDNA() Iterator dnaI
dna.iterator() while (dnaI.hasNext())
Symbol dnaSymbol (Symbol)
dnaI.next() System.out.println(dnaSymbol.getNa
me()) // get all the Protein
symbols FiniteAlphabet prot
ProteinTools.getAlphabet() Iterator protI
prot.iterator() while (protI.hasNext())
Symbol protSymbol (Symbol)
protI.next() System.out.println(protSymbol.get
Name())
26SequenceAlphabet Output
Alphabet Alphabet of all integers. Alphabet
RNA Alphabet DNA Alphabet STRUCTURE Alphabet
PROTEIN Alphabet NUCLEOTIDE Alphabet
PROTEIN-TERM Alphabet Alphabet of all
doubles. DNA symbols cytosine thymine adenine gu
anine Protein symbols LYS ALA CYS TYR TRP ASP HI
S ILE ARG VAL GLN GLY PHE MET THR PRO LEU SEC GLU
ASN SER
27Modeling Sequences
- Some rules
- BioJava models sequences with a SymbolList
interface - Every SymbolList has an associated Alphabet, and
may only contain Symbols from that alphabet. - SymbolLists can be seen as strings which are made
up of Symbol objects rather than characters. The
interface specifies methods for querying the
alphabet and length, and accessing the Symbols - Effectively modeling biopolymers
28SymbolList Interface
- getAlphabet() The alphabet that this SymbolList
is over. - iterator() An Iterator over all Symbols in this
SymbolList. - length() The number of symbols in this
SymbolList. - seqString() Return the symbol list as a String
Object. - subList(int start, int end) Return a new
SymbolList for the symbols start to end
inclusive. - subStr(int start, int end) Return a region of
this symbol list as a String. - symbolAt(int index) Return the symbol at index,
counting from 1. - toList() Returns a List of symbols.
29Sequence Interface
- Extends SymbolList
- getName( )
- Extracted from some file format
- getURN( )
- Could be a URL or a A Uniform Resource Identifier
(URI) which identifies the sequence represented
by this object.
30Loading Sequences Local Files
- SeqIOTools
- Read formats
- Genbank, EMBL, FASTA, GFF, SwissProt, etc
- Write formats
- FASTA,
- SequenceBuilderFactories
31SeqIOTools Interface
- readEmbl (java.io.BufferedReader br)
- Iterate over the sequences in an EMBL-format
stream. - readFasta (java.io.InputStream seqFile,
Alphabet alpha) - Create a sequence database from a fasta file
provided as an input stream. - readFastaDNA (java.io.BufferedReader br)
- Iterate over the sequences in an FASTA-format
stream of DNA sequences. - readFastaProtein (java.io.BufferedReader br)
- Iterate over the sequences in an FASTA-format
stream of Protein sequences. - readGenbank (java.io.BufferedReader br)
- Iterate over the sequences in an GenBank-format
stream. - readSwissprot (java.io.BufferedReader br)
- Iterate over the sequences in an Swissprot-format
stream. - writeFasta (java.io.OutputStream os,
SequenceDB db) - Write a sequenceDB to an output stream in fasta
format - writeFasta (java.io.OutputStream os,
SequenceIterator in) - Writes sequences from a SequenceIterator to an
OutputStream in Fasta Format.
32Example Genbank2Fasta
- Motivation
- Read in a sequence record in one format and
export it in another - Tools
- SeqIOTools static methods
- readGenbank
- writeFasta
33Example SeqIOTools Genbank2Fasta
import java.io. import org.biojava.bio.seq. im
port org.biojava.bio.seq.io. public class
Genbank2Fasta public static void
main(String args) throws Exception String
genbank args0 String fasta args1
FileOutputStream faout new
FileOutputStream(fasta) BufferedReader
gbreader new BufferedReader(new
FileReader(genbank)) SequenceIterator
si SeqIOTools.readGenbank(gbreader)
SeqIOTools.writeFasta(faout, si)
34Check-in
- Collections ?
- Iterators ?
35Example ResidueCount
- Motivation
- Print out the counts/percentages of each symbol
in an alphabet for every sequence in a given
FASTA file - Tools
- SeqIOTools
- SequenceDB
- Sequence
- Symbol
36ResidueCount
HashMap hash new HashMap() FiniteAlphabet a
null //get the alphabet a (FiniteAlphabet)
AlphabetManager.alphabetForName(alphname) //open
the file and read the sequences FileInputStream
fis new FileInputStream(filename) SequenceDB
seqdb SeqIOTools.readFasta(fis,
a) SequenceIterator stream seqdb.sequenceIterat
or() while (stream.hasNext())
//initialise the hash - all counts to 0
Iterator ai a.iterator() while
(ai.hasNext()) Symbol s (Symbol)
ai.next() hash.put(s.getName(), new
Integer(0)) //get the next sequence
Sequence seq stream.nextSequence() for
(int i 1 i lt seq.length() i)
//read the sequence and count the residues
Symbol sym seq.symbolAt(i) // check
if symbol in alphabet if
(a.contains(sym)) int count
((Integer) hash.get(sym.getName())).intValue()
hash.put(sym.getName(), new
Integer(count)) else
// print warning //output the
results for this sequence System.out.println(s
eq.getName()) Collection keys
hash.keySet() Iterator keyI
keys.iterator() while (keyI.hasNext())
String c (String) keyI.next()
Integer count (Integer)hash.get(c)
float percent count.floatValue() /
(float) seq.length() System.out.println(c
" " count " " percent)
37ResidueCount
HashMap hash new HashMap() FiniteAlphabet a
null //get the alphabet a (FiniteAlphabet)
AlphabetManager.alphabetForName(alphname) //open
the file and read the sequences FileInputStream
fis new FileInputStream(filename) SequenceDB
seqdb SeqIOTools.readFasta(fis,
a) SequenceIterator stream seqdb.sequenceIterat
or() while (stream.hasNext())
//initialise the hash - all counts to 0
Iterator ai a.iterator() while
(ai.hasNext()) Symbol s (Symbol)
ai.next() hash.put(s.getName(), new
Integer(0)) //get the next sequence
Sequence seq stream.nextSequence() for
(int i 1 i lt seq.length() i)
//read the sequence and count the residues
Symbol sym seq.symbolAt(i) // check
if symbol in alphabet if
(a.contains(sym)) int count
((Integer) hash.get(sym.getName())).intValue()
hash.put(sym.getName(), new
Integer(count)) else
// print warning //output the
results for this sequence System.out.println(s
eq.getName()) Collection keys
hash.keySet() Iterator keyI
keys.iterator() while (keyI.hasNext())
String c (String) keyI.next()
Integer count (Integer)hash.get(c)
float percent count.floatValue() /
(float) seq.length() System.out.println(c
" " count " " percent)
38Input of ResidueCount
- Accession AF000580.1
- Plasmid from Dictyostelium discoideum
- Known to have approx 80 AT
- Ameboid protozoan
39Output of ResidueCount
40Probability Distributions
- Package org.biojava.bio.dist
- Distribution interface
- Alphabet getAlphabet()
- The alphabet from which this spectrum emits
symbols. - Distribution getNullModel()
- Retrieve the null model Distribution that this
Distribution recognizes. - Double getWeight(Symbol s)
- Return the probability that Symbol s is emitted
by this spectrum. - void registerWithTrainer(DistributionTrainerContex
t dtc) - Register this distribution with a training
context. - Symbol sampleSymbol()
- Sample a symbol from this state's probability
distribution. - void setNullModel(Distribution nullDist)
- Set the null model Distribution that this
Distribution recognizes. - void setWeight(Symbol s, double w)
- Set the probability or odds that Symbol s is
emitted by this state.
41Example GenerateRandomSequences
- Motivation
- Often we need a probability distribution of the
residue composition of a set of sequences - We may want to emit random sequences from this
distribution to represent the background - Test some algorithm over the real sequences and
the random sequences and compare results - Tools
- Distribution
42GenerateRandomSequences
// set up the distribution DistributionTrainerCont
ext dtc new SimpleDistributionTrainerContext()
Distribution dist DistributionFactory.DEFAUL
T.createDistribution( AlphabetManager.alphabetForN
ame(alph)) dtc.registerDistribution(dist) //ope
n the file and read the sequences FileInputStream
fis new FileInputStream(inputFile) SequenceDB
seqdb SeqIOTools.readFasta(fis,
AlphabetManager.alphabetForName(alph)) SequenceIt
erator stream seqdb.sequenceIterator() while
(stream.hasNext()) Sequence seq
(Sequence) stream.nextSequence() for (int j
1 j lt seq.length() j)
dtc.addCount(dist, seq.symbolAt(j), 1.0)
// create the Distribution dtc.train() FileO
utputStream fos new FileOutputStream(outputFile)
for (int i 0 i lt numOutputSeqs i)
String name "Random Sequence " i
Sequence seq DistributionTools.generateSequence(
name,dist,lengthOutputSeqs)
SeqIOTools.writeFasta(fos, seq)
43Biological Simulation
- From DNA-gtRNA-gtProtein
- BioJava has rigourously defined methods to do
this
44Sequence Tools
- Package org.biojava.bio.seq
- DNATools
- RNATools
- ProteinTools
45DNATools
- a() returns static AtomicSymbol for a
- c() returns static AtomicSymbol for c
- complement(Symbol sym) Complement the symbol.
- complement(SymbolList list) Retrieve a
complement view of list. - complementTable() Get a translation table for
complementing DNA symbols. - createDNA(java.lang.String dna) Return a new DNA
SymbolList for dna. - dnaToken(Symbol sym) Get a single-character
token for a DNA symbol - forIndex(int index) Return the symbol for an
index - compatible with index. - forSymbol(char token) Retrieve the symbol for a
char. - g() returns static AtomicSymbol for g
- getDNA() Return the DNA alphabet.
- index(Symbol sym) Return an integer index for a
symbol - compatible with forIndex. - n() returns static AtomicSymbol for n
- reverseComplement(SymbolList list) Retrieve a
reverse-complement view of list. - t() returns static AtomicSymbol for t
46RNATools
- complementTable()
- Get a translation table for complementing DNA
symbols - createRNA(java.lang.String rna)
- Return a new RNA SymbolList for rna.
- getGeneticCode(java.lang.String name)
Retrieve a TranslationTable by name. - getGeneticCodeNames() Retrieve a Set
containing the name of each genetic code. - getRNA() Return the RNA alphabet.
- reverseComplement(SymbolList list)
Retrieve a reverse-complement view of
list. - transcribe(SymbolList list) Transcribe
DNA into RNA. - transcriptionTable() Get a translation
table for converting DNA to RNA. - translate(SymbolList syms) Translate
RNA into protein (with termination symbols).
47ProteinTools
- createProtein(java.lang.String theProtein)
Return a new Protein SymbolList for
protein. - getAlphabet() Gets the protein
alphabet - getSymbolPropertyTable(java.lang.String name)
- getTAlphabet() Gets the protein
alphabet including the translation termination
symbols
48Biological example Selenoproteins
- Selenoproteins carry out the function of selenium
in our cells - Selenium levels implicated in neurological
disease - Alzheimers
- Parkinsons
- Most selenoprotein mRNAs contain a single UGA
codon encoding a single selenocysteine residue
per polypeptide chain, and a single specific RNA
secondary structure, termed a selenocysteine
insertion sequence (SECIS) element, directing
incorporation of this amino acid.
Castellano S, Morozova N, Morey M, Berry MJ,
Serras F, Corominas M, Guigo R. In silico
identification of novel selenoproteins in the
Drosophila melanogaster genome. EMBO Rep. 2001
Aug2(8)697-702.
49Example Translate
- Motivation
- From a cDNA output the protein sequence
- Tools
- DNATools
- RNATools
- ProteinTools
- Question what do we see when we translate a
selenoprotein mRNA?
50Translating an mRNA
Summary This gene encodes a selenoprotein
containing selenocysteine at the active site. The
selenocysteine is encoded by the ususal nonsense
codon UGA. This gene is expressed in a variety of
tissues, and the protein is localized to the
perinuclear structures.
51NM_080430
DEFINITION Homo sapiens selenoprotein M (SELM),
mRNA.
CDS 61..498 /gene"SELM" /note"selenoprotein
SelM" /selenocysteine /codon_start1
/transl_except(pos202..204,aaSec)
/product"selenoprotein M precursor"
/protein_id"NP_536355.1"
52Example Translate
import java.io. import org.biojava.bio.symbol.
import org.biojava.bio.seq. import
org.biojava.bio.seq.io. import
org.biojava.bio.seq.db. public class Translate
public static void main(String args) throws
Exception String filename args0 //
open the file FileInputStreamfis new
FileInputStream(filename) SequenceDB seqdb
SeqIOTools.readFasta(fis,DNATools.getDNA())
SequenceIterator si seqdb.sequenceIterator()
while (si.hasNext()) // get the dna
sequence Sequence dna
si.nextSequence() // transcribe the
dna SymbolList rna RNATools.transcribe(d
na) System.out.println(dna.getName()
"") SymbolList protein
RNATools.translate(rna)
System.out.println(protein.seqString())
53Output of Translate with NM_080430
- gi4637009261-498
- MSLLLPPLALLLLLAALVAPATAATAYRPDWNRLSGLTRARVETCGGQ
LNRLKEVKAFVTQDIPFYHNLVMKHLPGADPELVLLGRRYEELERIPLSE
MTREEINALVQELGFYRKAAPDAQVPPEYVWAPAKPPEETSDHADL
54Sequences and Features
- In most data models sequences have information
attached to them global information
(identifiers, publications) and biological
features corresponding to locations on the
sequences - Global annotations key-value pairs
Sequence seq getSequence() Annotation seqAn
seq.getAnnotation() for (Iterator i
seqAn.keys().iterator() i.hasNext() )
Object key i.next() Object value
seqAn.getProperty(key) System.out.println(key.t
oString() " " value.toString())
//source biojava.org Thomas Down
55Feature Class
- features() Iterate over any child
features which are held by this feature. - getLocation() The location of this
feature. - getParent() Return the FeatureHolder
to which this feature has been attached. - getSequence() Return the Sequence
object to which this feature is (ultimately)
attached. - getSource() The source of the feature
- getSymbols() Return a list of symbols
that are contained in this feature. - getType() The type of the feature.
- makeTemplate() Create a new Template
that could be used to generate a feature
identical to this one.
56Using the Feature interface FeatureExtract
//import statements public class FeatureExtract
public static void main(String args)
throws Exception if (args.length ! 2)
throw new Exception("usage java FeatureExtract
GENBANKFILE FEATURETYPE") String filename
args0 String feature args1 BufferedRead
er gbreader new BufferedReader(new FileReader
(filename)) SequenceIterator si
SeqIOTools.readGenbank(gbreader) while
(si.hasNext()) Sequence seq
si.nextSequence() Iterator feati
seq.features() while (feati.hasNext())
Feature feat (Feature) feati.next() if
(feat.getType().equals(feature))
System.out.println(feat.getType() " "
feat.getLocation().toString())
57Input to FeatureExtract
- AF000580.gbff gene
- AF000580.gbff repeat_region
58Output of FeatureExtract
59FeatureExtract Output
Genbank FlatFile of 80kb of human sequence
containing 2 genes
LOCUS HSDJ155G6 80600 bp DNA
PRI 13-APR-2000 DEFINITION Human DNA
sequence from clone RP1-155G6 on chromosome 20
Contains part of the gene for
brefeldin A-inhibited guanine
nucleotide-exchange protein 2, part of the gene
for CSE1L (chromosome segregation 1
(yeast homolog)-like), ESTs, STSs, GSSs
and a CpG Island, complete sequence. ACCESSION
AL121903 VERSION AL121903.13
GI7330682 KEYWORDS HTG brefeldin CpG
Island CSE1L. SOURCE human. ORGANISM
Homo sapiens Eukaryota Metazoa
Chordata Craniata Vertebrata Euteleostomi
Mammalia Eutheria Primates Catarrhini
Hominidae Homo. REFERENCE 1 (bases 1 to
80600)
gt java FeatureExtract AL121903.genbank gene
60FeatureExtract Output
gt java FeatureExtract AL121903.genbank mRNA
61FeatureExtract Output
Cgtjava FeatureExtract AL121903.genbank
repeat_region
62Proteomics Tools
- Package org.biojava.bio.proteomics
- Digest
- This class contains methods for calculating the
results of proteolytic digestion of a protein
sequence this class is not designed to be thread
safe - IsoelectricPointCalc
- class that computes isoelectric point for
proteins - MassCalc
- MassCalc calculates the mass of peptides which
for our purposes are SymbolLists which contain
Symbolsfrom the protein Alphabet. - Protease
- The protease class stores parameters needed by
Digest to digest a protein sequence. - ProteaseManager
- Registry and utility methods for Proteases.
- StructureTools
63Example TrypticDigest
- Motivation
- Widely used method in proteomics to identify
which proteins are present in a cell state
using mass spectrometry - Tools
- Digest
- MassCalc
- Protease
64TrypticDigest
public static void trypticDigest(Sequence seq)
throws BioException, ChangeVetoException
Digest digest new Digest()
digest.setMaxMissedCleavages(MAXMISSCLEAVAGE)
digest.setProtease(ProteaseManager.getProteaseByN
ame(Protease.TRYPSIN)) digest.setSequence(seq
) digest.addDigestFeatures() Iterator
iterator digest.getSequence().features()
while (iterator.hasNext()) // iterate
through the features SimpleFeature f
(SimpleFeature) iterator.next()
SymbolList symList f.getSymbols() //
calculate the mass (in Daltons) MassCalc
mc new MassCalc(SymbolPropertyTable.AVG_MASS,
true) double mass mc.getMass(symList)
// output System.out.println(f.
getLocation().toString() "\t" mass
"\t" symList.seqString())
65TrypticDigest
- Input
- gtgi16761190refNP_456807.1 ecotin precursor
Salmonella enterica subsp. - MKMFVPAVVFAALASASAWANNGDTAQPLEKIAPYPQAEKGMKRQVITLT
PQQDESTLKVELLIGQTLNV - DCNQHRLGGTLETKTLEGWGYDYYVFDNVTSPVSTMMACPEGKKEQKFVT
AWLGEDGMLRYNSKLPIVVY - TPANVDVKYRIWKADANVQNAIAR
66TrypticDigest
67Annotations
- Tell us something about the whole record
- Structured as keyvalue pairs
- Examples
- Author list
- PubMed ID
- Organism
68Annotation Interface
- asMap()
- Return a map that contains the same key/values as
this Annotation. - containsProperty(java.lang.Object key)
- Returns whether there is any property under that
key in this Annotation. - getProperty(java.lang.Object key)
- Retrieve the value of a property by key.
- keys()
- Get a set of key objects.
- setProperty(java.lang.Object key,
java.lang.Object value) - Set the value of a property.
69Example AnnotationExtract
public class AnnotationExtract public static
void main(String args) throws Exception if
(args.length ! 1) throw new
Exception("usage java AnnotationExtract
GENBANKFILE") String filename
args0 BufferedReader gbreader new
BufferedReader(new FileReader (filename)) Seque
nceIterator si SeqIOTools.readGenbank(gbreader)
while (si.hasNext()) Sequence seq
si.nextSequence() Annotation annot
seq.getAnnotation() java.util.Set keys
annot.keys() java.util.Iterator i
keys.iterator() while (i.hasNext())
Object key i.next() if
(annot.containsProperty(key))
System.out.println(key " "
annot.getProperty(key))
70AnnotationExtract Output
AnnotationExtract AF000580.gbff
TYPE DNA ACCESSION AF000580 genbank_accessions
AF000580 KEYWORDS . ORGANISM Dictyostelium
discoideum Eukaryota Mycetozoa Dictyosteliida
Dictyostelium. LOCUS AF000580 JOURNAL Genetics
148 (3), 1117-1125 (1998), Submitted
(21-APR-1997) Biology, Utah State University,
Logan, UT 84322-5305, USA DIVISION
INV REFERENCE 1 (bases 1 to 14955), 2 (bases
1 to 14955) MDAT 21-APR-1998 DEFINITION
Dictyostelium discoideum plasmid Ddp5, complete
genome. VERSION AF000580.1 TITLE Dictyostelium
discoideum nuclear plasmid Ddp5 is a chimera
related to the Ddp1 and Ddp2 plasmid families,
Direct Submission CIRCULAR circular SIZE
14955 ORIGIN SOURCE Dictyostelium
discoideum MEDLINE 98198836 GI 3068582 AUTHORS
Rieben,W.K. Jr., Gonzales,C.M., Gonzales,S.T.,
Pilkington,K.J., Kiyosawa,H., Hughes,J.E. and
Welker,D.L., Rieben,W.K., Gonzales,C.,
Gonzales,S.T., Pilkington,K., Kiyosawa,H.,
Hughes,J.E. and Welker,D.L. PUBMED 9539429
71Sequence Tools
- DNATools
- RNATools
- ProteinTools
72DNATools
- a() returns static AtomicSymbol for a
- c() returns static AtomicSymbol for c
- complement(Symbol sym) Complement the symbol.
- complement(SymbolList list) Retrieve a
complement view of list. - complementTable() Get a translation table for
complementing DNA symbols. - createDNA(java.lang.String dna) Return a new DNA
SymbolList for dna. - dnaToken(Symbol sym) Get a single-character
token for a DNA symbol - forIndex(int index) Return the symbol for an
index - compatible with index. - forSymbol(char token) Retrieve the symbol for a
char. - g() returns static AtomicSymbol for g
- getDNA() Return the DNA alphabet.
- index(Symbol sym) Return an integer index for a
symbol - compatible with forIndex. - n() returns static AtomicSymbol for n
- reverseComplement(SymbolList list) Retrieve a
reverse-complement view of list. - t() returns static AtomicSymbol for t
73RNATools
- complementTable()
- Get a translation table for complementing DNA
symbols - createRNA(java.lang.String rna)
- Return a new RNA SymbolList for rna.
- getGeneticCode(java.lang.String name)
Retrieve a TranslationTable by name. - getGeneticCodeNames() Retrieve a Set
containing the name of each genetic code. - getRNA() Return the RNA alphabet.
- reverseComplement(SymbolList list)
Retrieve a reverse-complement view of
list. - transcribe(SymbolList list) Transcribe
DNA into RNA. - transcriptionTable() Get a translation
table for converting DNA to RNA. - translate(SymbolList syms) Translate
RNA into protein (with termination symbols).
74ProteinTools
- createProtein(java.lang.String theProtein)
Return a new Protein SymbolList for
protein. - getAlphabet() Gets the protein
alphabet - getSymbolPropertyTable(java.lang.String name)
- getTAlphabet() Gets the protein
alphabet including the translation termination
symbols
75sequence tools Translate
public class Translate public static void
main(String args) throws Exception if
(args.length ! 1) throw new
Exception("usage java Translate Genbankfile")
String filename args0 // open the
file FileInputStream fis new
FileInputStream(filename) SequenceDB seqdb
SeqIOTools.readFasta(fis, DNATools.getDNA()) Se
quenceIterator si seqdb.sequenceIterator() wh
ile (si.hasNext()) // get the dna
sequence Sequence dna si.nextSequence()
// reverse complement the dna
sequence SymbolList dnaList
DNATools.reverseComplement(dna) // transcribe
the dna SymbolList rna RNATools.transcribe(dn
aList) System.out.println(dna.getName()
"") for (int i 1 i lt 3 i) //
translate the rna in 3 reading frames int
length (rna.length())-((rna.length()-i1)3)
SymbolList protein RNATools.translate(rna.sub
List(i, length)) System.out.println("Frame "
i " " protein.seqString())
76Translate Input
- FASTA File of ESTs from Fugu rubripes
gtFEEFRa010apsH5 AGACAACACACGGGGAACCAGGAACGTCAGGCC
CCTTACACTGGGATCCAGAGCCACAGC GCGACAGAAAGTTAAAGGATAA
ACTCTAACGGAACACAATGGCAGACAAAATTAAAGATG CCAAGATCATC
TTTGTCGTGGGTGGGCCTGGTTCTGGAAAGGGCACTCAGTGTGAGAAAA
TAGTTGCAAAGTATGGCTACACCCATCTGTCATCCGGGGATCTGCTCCGT
GCTGAAGTGG CCTCTGGCTCCGAGAGGGGCAAGCAGCTCCAGGCCATCA
TGCAGAAGGGAGAGCTTGTTC CCCTGGACACCGTCTTAGACATGATTAA
GGATGCCATGATCGCCAAGGCCGACGTGTCCA AGGGCTTCCTCATTGAC
GGCTACCCCCGTGAGGTGGAGCAGGGCGAGGAGTTTGAGAAGA AGATCG
GCAGACCCTGCCTGGTGCTGGACGTTGACGCCAGGGGGGAGACCATGGGC
AAGA gtFEEFRa010apsC2 CGAGGTGCGGACTGTGCTGTCGCTCGTC
AGGGCCCAGGACGCCTACGCTCGCCTGCCCGA GAGCTACAGGAACGGCG
TGAATCTGACTCTGGAGCAGCTGAACTCTCACACTGGGGTCCA TCACCA
TTTCCGCTTCTTGAAAAGTCTGGAAAAGTCAGAAATTGAGTCTGGTTTTG
GTGT GAGTTACCTCTACCACCATTTTTACCTGAAGCCAACGTGGTGCGC
CAAAGGAACAGAAGA GTCGAACCCTGAGGCGTGTGCCTTCAGGAACGAC
AGGCCGCTGATGGACTGCGCGATCTG CTACAAAACAGCGAACAATGTGA
TGGAGGCCAACCCAAAGCCATATGTGCACTGCATCCA AAAGCCAAGGAC
TCACACCGGGCATGAGGAGCAGTAGGACAGAACATTGCAGAAAAATGA G
CTACAACAGCGGAGCTCCAACACTTTTGGCTGTGAAAGTCGGCTGAGGCA
ACACGTGGA GCAACT
77Translate Output
78Loading Sequences
- Three main ways to do this
- Local Files
- Local Databases
- Remote Databases
79Loading Sequences Remote URL
- How do we retrieve a FASTA file remotely from
Genbank? - Code from Ratiba Terbaoui
//String idSeq SOMEACCESSIONNUMBER URL url
new URL("http//www.ncbi.nlm.nih.gov/htbin-post/En
trez/query?dbnform6doptfhtmlnotitlenouid
" idSeq) BufferedReader in new
BufferedReader(new InputStreamReader(url.openStre
am())) SequenceIterator stream
SeqIOTools.readFastaDNA(in) String
InputLine while ((InputLine in.readLine()) !
null) Sequence seq1 getSequence(stream) Sys
tem.out.println(seq1.seqString())
80Adding Annotation to a Sequence
- Our main goal in bioinformatics
- GFF format
- Widely used for sequence analysis and annotation
NAME lttabgt SOURCE lttabgt FEATURE lttabgt START lttabgt
END lttabgt SCORE lttabgt STRAND lttabgt PHASE lttabgt
ANYTHING COMMENT
81GFF Parsing
- BioJava has a gff package containing libraries
for parsing and manipulating GFF files - org.biojava.bio.program.gff.GFFParser
82Example Fasta2Features
- Takes in a FASTA file, a GFF file and a source
string - Filters the gff file by source
- Annotates the FASTA file
- Prints out the features and subsequences
associated with those features
83Fasta2Features
public class Fasta2Features public static
void main(String args) throws Exception
if(args.length ! 3) throw new
Exception("Use Fasta2Features fastafile GFFfile
source") String fastafile args0
String gfffile args1 String source
args2 //load the sequence
FileInputStream fis new FileInputStream(fastafil
e) SequenceDB seqdb SeqIOTools.readFasta(fi
s, DNATools.getDNA()) // load in the GFF
System.out.println("Loading gff with inputted
source") GFFEntrySet gffEntries new
GFFEntrySet() GFFRecordFilter.SourceFilter
sFilter new GFFRecordFilter.SourceFilter()
sFilter.setSource(source) GFFFilterer
filterer new GFFFilterer(gffEntries.getAddHandle
r(), sFilter) GFFParser parser new
GFFParser() parser.parse(new
BufferedReader( new InputStreamReader( new
FileInputStream(new File(gfffile)))), filterer)
// add the features to the sequences
System.out.println("Adding features from gff to
sequences") SequenceDB aSeqDB new
AnnotatedSequenceDB(seqdb, gffEntries.getAnnotator
()) // use member method to do the
printing PrintFeatures(aSeqDB)
84Fasta2Features methods
public static void PrintFeatures(SequenceDB
aSeqDB) // print the features we just added
to our sequence SequenceIterator si
aSeqDB.sequenceIterator() while (si.hasNext())
Sequence seq si.nextSequence() Iterator
feati seq.features() while
(feati.hasNext()) Feature feat (Feature)
feati.next() System.out.println(feat.getType(
) " " feat.getLocation().toString())
System.out.println(seq.subStr(feat.getLocat
ion().getMin(), feat.getLocation().getMax())
)
85Fasta2Features example input
Example Input
FASTA file containing 1 sequence
gtID1 aaaaaactgggaacaaagcaatcaaaagaaatgtatgtggccggg
cgtcatggctcacgc ctgtaatcccagcactttgggaggctgaggcagg
tggatcacaggtcaggagatcgagac cgtcctggctaacatggtgaaac
cccatctctactaaaaatacaaaagtttagcccggcg tggtggcgagcg
cctgtagtcccagctactcaggaggctgaggcagaagaatggcgtgaa c
ccgggaggcagagcttgcagtgagctgagatcgtgccactgccctccagc
ctgggcgac agagcgagactccgtcttaaaaaaaaaaaaaaaaaaaaaa
gaaatgtatgtatgtatgta tctgtgatgtcatttgcctcttaagggaa
aggtaataataacagcttcctaccaaaacac atttcatgatctcaagaa
aattttctcaaacctggtggtgaagagttataataccaatgc Ttaaaaa
tagtgttagtaaaaaaacaatccagaaattctgtacattgaatagaggaa
aat
GFF file containing 1 gene and 1 exon annotation
gff-version 1 date 2000-02-18 I'm just
making this up ID1 cbw gene 10 500 0 . cbw_gene
ID1 cbw exon 15 400 0 . cbw_exon
86Fasta2Features output
gt java Fasta2Features ID1.fa.txt gfftest.gff.txt
cbw
87BLAST parsing
- Similar parser for BLAST, WuBLAST, FASTA, HMMer
- See Biojava in Anger for detailed example
- http//www.biojava.org/docs/bj_in_anger/BlastParse
r.htm
88MEME parsing
- Sequence motif finder
- Use Meme class
89Other BioJava facilities
- BioSQL interface
- HMM utilities
- Dynamic Programming
- Model trainers
- Profile makers
- Weight Matrices
- DAS distributed annotation server
- Allows you to serve up your annotations on
Ensembl
90BioJava concluding thoughts
- Still in rapid development
- Will always be
- Missing some critical data types
- gene expression, molecular interactions
- Subscribe to mailing lists
- Always more informative than static documentation
- Contribute code
- Forces you to write good code
- Everyone wins
- OpenSource philosophy is very powerful
91Summary
- Existing packages can be extremely useful and can
dramatically increase development time - Look before you code
- Biojava provides a useful codebase for working
with some bioinformatics data - Very well-developed tools for sequence modeling
and manipulation - Go Open Source
92End
- Resources
- http//www.biojava.org
- Biojava wwwsite
- http//biojava.org/docs/api/index.html
- Javadoc API docs
- http//biojava.org/docs/bj_in_anger/index.htm
- Biojava in Anger
- sohrab_at_bioinformatics.ubc.ca
- me
93Sequence I/O
- Motivation
- Want to input sequence data from different
databases in different formats - Want to be able to write sequence records in
different formats - Want to convert between formats
- Want to easily load data into memory
- Single records
- Multiple records
94Sequence I/O
95Sequence manipulation
- Motivation
- Often, we want to compute on sub-sequences of
an inputted sequence - Could correspond to a feature in a sequence
- Could need to do a sliding window analysis
- Perform biological operations
- Transcription on DNA
- Translation on RNA
- Tryptic digest
96Sequence manipulation
97Sequence distribution
- Motivation
- We often need to know the background noise
level relative to a signal - Gives statistical robustness to a detected
pattern - Can generate random sequences from the
distribution and compare observations in real and
random data
98Sequence distribution
99Interpreting Similarity search data
- Motivation
- Volume of computational results to great to sift
through manually - We need to be able to extract parts of reports
and effectively summarize the results - Top BLAST hits
100Similarity searching results parsing
101Translating an mRNA
Summary This gene encodes a selenoprotein
containing selenocysteine at the active site. The
selenocysteine is encoded by the ususal nonsense
codon UGA. This gene is expressed in a variety of
tissues, and the protein is localized to the
perinuclear structures.
102NM_080430
DEFINITION Homo sapiens selenoprotein M (SELM),
mRNA.
CDS 61..498 /gene"SELM" /note"selenoprotein
SelM" /selenocysteine /codon_start1
/transl_except(pos202..204,aaSec)
/product"selenoprotein M precursor"
/protein_id"NP_536355.1"
103Other Packages
- PAL
- MADAM TIGR
- CABIO
- MAGE SDK
- Ensj/MartJ
- LSID IBM Life Science
- Many, many more.
104TODO
- FeatureExtract output for Dicty seq