Lecture 3'0 Biopackages I sequence manipulation with BioJava

About This Presentation

Title:

Lecture 3'0 Biopackages I sequence manipulation with BioJava

Description:

How do I make a Sequence from a String or make a Sequence Object back into a String? ... How do I reverse complement a DNA or RNA Sequence? ... – PowerPoint PPT presentation

Number of Views:321

Avg rating:3.0/5.0

Slides: 72

Provided by: stephe78

Category:

more less

Transcript and Presenter's Notes

Title: Lecture 3'0 Biopackages I sequence manipulation with BioJava

1
Lecture 3.0 Biopackages I - sequence
manipulation with BioJava

Sohrab Shah
UBC Bioinformatics Centre
sohrab_at_bioinformatics.ubc.ca
http//bioinformatics.ubc.ca/people/sohrab

2
My Work

Pegasys workflow management for bioinformatics
Uses Biojava heavily
Comparative vertebrate genomics
Annotation of Cryptococcus

http//bioinformatics.ubc.ca
3
Outline

When to use biopackages
A brief tour of biojava
Sequence IO
Sequence manip
Distributions
Features
Proteomics
Learn by examples

4
http//www.jb.man.ac.uk/drl/
5
Using APIs/Libraries

Libraries are devoted to avoid re-inventing the
wheel in software development
Include API contracts that specify the rules of
engagement for programmers
What are the inputs to a method?
What is the output of a method?
Designed to speed up development time for
application programmers
Standard APIs promote consistent code and a
commonality among diverse/unrelated applications

6
What to look for in a software library

Does the software have the functionality you
need?
Is the software extensible?
Is the software the accepted standard for the
domain?
Does the package have a user community?
Is the software well-supported by the developers?
Documentation
Mailing lists
Bug fixing process
Can I use and distribute this legally?

7
Pitfalls

For open source
Public domain ! correctness
Public domain ! robustness
Spend some time evaluating the code
BioPerl anecdote
Does the documentation (publication or other)
accurately reflect the algorithm?
Is there documentation??
NCBI C Toolkit anecdote

8
Rolling your own

Sometimes this is the best way
Understanding code at many or all levels is
critical
Understanding an algorithm
Implement it yourself?
Debugging/modifying a library may take more time
than writing it yourself

9
BIOJAVA
10
BioJava

http//biojava.org
Dedicated to providing Java tools for processing
biological data
Released under the Open Source LGPL
Less restrictive than GPL
Last stable release v1.3 June 15th 2003
Designed for programmers with experience in Java
Is a work in progress
Robust architecture that models biology

11
biojava.org
12
API docs
13
biojava.org
14
BioJava in Anger

Setup
Where do I get a Java installation?
How do I get and install BioJava?
Alphabets and Symbols
How do I get a DNA, RNA or Protein Alphabet?
How do I make a custom Alphabet from custom
Symbols?
How do I make a CrossProductAlphabet such as a
codon Alphabet?
How do I break Symbols from CrossProduct
Alphabets into their component Symbols?
How can I tell if two Alphabets or Symbols are
equal?
How can I make an ambiguous Symbol like Y or R?
Basic Sequence Manipulation
How do I make a Sequence from a String or make a
Sequence Object back into a String?
How do I get a subsection of a Sequence?
How do I transcribe a DNA Sequence to a RNA
Sequence?
How do I reverse complement a DNA or RNA
Sequence?
Sequences are immutable so how can I change it's
name?
How can I edit a Sequence or SymbolList?

15
BioJava in Anger

Translation
How do I translate a DNA or RNA Sequence or
SymbolList to Protein?
How do I translate a single codon to a single
amino acid?
How do I use a non standard translation table?
How do I translate a nucleotide sequence in all
six frames
Proteomics
How do I calculate the mass and pI of a peptide?
Sequence I/O
How do I write Sequences in Fasta format?
How do I read in a Fasta file?
How do I read a GenBank/EMBL/SwissProt file?
How do I extract GenBank/EMBL/Swissprot sequences
and write them as Fasta?
How do I turn an ABI sequence trace into a
BioJava Sequence?

16
BioJava Functionality

Manipulating sequences
Sequence File I/O
Probability distributions over sets of sequences
Parsing sequence similarity search results
Taxonomy data structure
Bibliographic data structures
Annotation file formats
GFF, GAME XML, DAS
Dynamic programming
Support Vector machine classification and
regression

17
BioJava

Rag-bag collection of useful stuff!
Focus
Sequence IO
Sequence manip
Probability Distributions
Features
Proteomics

18
BioJava Design Ethos

Models biopolymers as a list of symbols, and not
as strings
Strings have 3 disadvantages
Validation
It is possible to pass any string to a routine
which is expecting a biological sequence. Any
validation has to be performed on an ad hoc
basis.
Ambiguity
The meaning of each symbol is not necessarily
clear. The T' which means thymidine in DNA is
the same T' which is a threonine residue in a
protein sequence
Limited alphabet
While there are obvious encodings for nucleic
acid and sequence data as strings, the same
approach does not always work well for other
kinds of data generated in biological sequence
analysis software
Closer to the biology of the sequence can
attach biochemical annotation to the symbol
objects cant do this with strings

Thomas Down biojava.org
19
Sequence manipulation

The Sequence Object
abstract data type for biological sequences
residues of the sequence are stored in a
SymbolList

20
Introduction what is BioJava?

OpenSource initiative to create Java libraries
and tools for processing biological data
Contains objects for manipulating sequences, file
parsers, Dynamic Programming routines for HMM
analysis, basic statistics, sequence database
interoperability
Work in progress now at version 1.21
www.biojava.org
Generally designed for programmers with Java
experience
Many ideas from this lecture are taken from
http//www.biojava.org/

21
BioJava Overview

Sequences and features
IO
Processing, storing, manipulating
Visualising
Dynamic programming
Single-sequence and pair-wise HMMs
Viterbi-path, Forward and Backward algorithms
Training models
Sampling sequences from models
External file formats and programs
GFF
Blast
Meme
Sequence Databases
BioCorba interoperability
ACeDB client
BioSQL

22
Symbol Interface

BioJava takes a rather different approach to
modeling sequence data. Instead of using a string
of ASCII characters, a sequence is modelled as a
list of Java objects implementing the Symbol
interface. This class, and the others described
here, are part of the Java package
org.biojava.bio.symbol.

23
Symbol Interface

public interface Symbol
public String getName()
public Annotation getAnnotation()
public Alphabet getMatches()
All Symbol instances have
a name property (ie, Thymine).
An optional Annotation property (ie, information
about the chemical properties of a DNA base)
getMatches, is only important for ambiguous
symbols

24
Alphabet interface

A set of symbol objects that can be found in a
particular type of sequence is defined by the
Alphabet Interface
Allows you to define your own alphabet
A set of standard alphabets is available from the
AlphabetManager

25
SequenceAlphabet
import org.biojava.bio.symbol. import
org.biojava.bio.seq. import java.util. public
class SequenceAlphabet public static void
main(String args) throws Exception // print
the names of the alphabets available by
default // get an iterator on the
Alphabets Iterator iam AlphabetManager.alphabe
ts() while (iam.hasNext()) Alphabet a
((Alphabet) iam.next()) System.out.println("Al
phabet " a.getName()) // get all the
DNA symbols FiniteAlphabet dna
DNATools.getDNA() Iterator dnaI
dna.iterator() while (dnaI.hasNext())
Symbol dnaSymbol (Symbol)
dnaI.next() System.out.println(dnaSymbol.getNa
me()) // get all the Protein
symbols FiniteAlphabet prot
ProteinTools.getAlphabet() Iterator protI
prot.iterator() while (protI.hasNext())
Symbol protSymbol (Symbol)
protI.next() System.out.println(protSymbol.get
Name())
26
SequenceAlphabet Output
Alphabet Alphabet of all integers. Alphabet
RNA Alphabet DNA Alphabet STRUCTURE Alphabet
PROTEIN Alphabet NUCLEOTIDE Alphabet
PROTEIN-TERM Alphabet Alphabet of all
doubles. DNA symbols cytosine thymine adenine gu
anine Protein symbols LYS ALA CYS TYR TRP ASP HI
S ILE ARG VAL GLN GLY PHE MET THR PRO LEU SEC GLU
ASN SER
27
Modeling Sequences

Some rules
BioJava models sequences with a SymbolList
interface
Every SymbolList has an associated Alphabet, and
may only contain Symbols from that alphabet.
SymbolLists can be seen as strings which are made
up of Symbol objects rather than characters. The
interface specifies methods for querying the
alphabet and length, and accessing the Symbols
Effectively modeling biopolymers

28
SymbolList Interface

getAlphabet() The alphabet that this SymbolList
is over.
iterator() An Iterator over all Symbols in this
SymbolList.
length() The number of symbols in this
SymbolList.
seqString() Return the symbol list as a String
Object.
subList(int start, int end) Return a new
SymbolList for the symbols start to end
inclusive.
subStr(int start, int end) Return a region of
this symbol list as a String.
symbolAt(int index) Return the symbol at index,
counting from 1.
toList() Returns a List of symbols.

29
Sequence Interface

Extends SymbolList
getName( )
Extracted from some file format
getURN( )
Could be a URL or a A Uniform Resource Identifier
(URI) which identifies the sequence represented
by this object.

30
Loading Sequences Local Files

SeqIOTools
Read formats
Genbank, EMBL, FASTA, GFF, SwissProt, etc
Write formats
FASTA,
SequenceBuilderFactories

31
SeqIOTools Interface

readEmbl (java.io.BufferedReader br)
Iterate over the sequences in an EMBL-format
stream.
readFasta (java.io.InputStream seqFile,
Alphabet alpha)
Create a sequence database from a fasta file
provided as an input stream.
readFastaDNA (java.io.BufferedReader br)
Iterate over the sequences in an FASTA-format
stream of DNA sequences.
readFastaProtein (java.io.BufferedReader br)
Iterate over the sequences in an FASTA-format
stream of Protein sequences.
readGenbank (java.io.BufferedReader br)
Iterate over the sequences in an GenBank-format
stream.
readSwissprot (java.io.BufferedReader br)
Iterate over the sequences in an Swissprot-format
stream.
writeFasta (java.io.OutputStream os,
SequenceDB db)
Write a sequenceDB to an output stream in fasta
format
writeFasta (java.io.OutputStream os,
SequenceIterator in)
Writes sequences from a SequenceIterator to an
OutputStream in Fasta Format.

32
Example Genbank2Fasta

Motivation
Read in a sequence record in one format and
export it in another
Tools
SeqIOTools static methods
readGenbank
writeFasta

33
Example SeqIOTools Genbank2Fasta
import java.io. import org.biojava.bio.seq. im
port org.biojava.bio.seq.io. public class
Genbank2Fasta public static void
main(String args) throws Exception String
genbank args0 String fasta args1
FileOutputStream faout new
FileOutputStream(fasta) BufferedReader
gbreader new BufferedReader(new
FileReader(genbank)) SequenceIterator
si SeqIOTools.readGenbank(gbreader)
SeqIOTools.writeFasta(faout, si)
34
Check-in

Collections ?
Iterators ?

35
Example ResidueCount

Motivation
Print out the counts/percentages of each symbol
in an alphabet for every sequence in a given
FASTA file
Tools
SeqIOTools
SequenceDB
Sequence
Symbol

36
ResidueCount
HashMap hash new HashMap() FiniteAlphabet a
null //get the alphabet a (FiniteAlphabet)
AlphabetManager.alphabetForName(alphname) //open
the file and read the sequences FileInputStream
fis new FileInputStream(filename) SequenceDB
seqdb SeqIOTools.readFasta(fis,
a) SequenceIterator stream seqdb.sequenceIterat
or() while (stream.hasNext())
//initialise the hash - all counts to 0
Iterator ai a.iterator() while
(ai.hasNext()) Symbol s (Symbol)
ai.next() hash.put(s.getName(), new
Integer(0)) //get the next sequence
Sequence seq stream.nextSequence() for
(int i 1 i lt seq.length() i)
//read the sequence and count the residues
Symbol sym seq.symbolAt(i) // check
if symbol in alphabet if
(a.contains(sym)) int count
((Integer) hash.get(sym.getName())).intValue()
hash.put(sym.getName(), new
Integer(count)) else
// print warning //output the
results for this sequence System.out.println(s
eq.getName()) Collection keys
hash.keySet() Iterator keyI
keys.iterator() while (keyI.hasNext())
String c (String) keyI.next()
Integer count (Integer)hash.get(c)
float percent count.floatValue() /
(float) seq.length() System.out.println(c
" " count " " percent)
37
ResidueCount
HashMap hash new HashMap() FiniteAlphabet a
null //get the alphabet a (FiniteAlphabet)
AlphabetManager.alphabetForName(alphname) //open
the file and read the sequences FileInputStream
fis new FileInputStream(filename) SequenceDB
seqdb SeqIOTools.readFasta(fis,
a) SequenceIterator stream seqdb.sequenceIterat
or() while (stream.hasNext())
//initialise the hash - all counts to 0
Iterator ai a.iterator() while
(ai.hasNext()) Symbol s (Symbol)
ai.next() hash.put(s.getName(), new
Integer(0)) //get the next sequence
Sequence seq stream.nextSequence() for
(int i 1 i lt seq.length() i)
//read the sequence and count the residues
Symbol sym seq.symbolAt(i) // check
if symbol in alphabet if
(a.contains(sym)) int count
((Integer) hash.get(sym.getName())).intValue()
hash.put(sym.getName(), new
Integer(count)) else
// print warning //output the
results for this sequence System.out.println(s
eq.getName()) Collection keys
hash.keySet() Iterator keyI
keys.iterator() while (keyI.hasNext())
String c (String) keyI.next()
Integer count (Integer)hash.get(c)
float percent count.floatValue() /
(float) seq.length() System.out.println(c
" " count " " percent)
38
Input of ResidueCount

Accession AF000580.1
Plasmid from Dictyostelium discoideum
Known to have approx 80 AT
Ameboid protozoan

39
Output of ResidueCount
40
Probability Distributions

Package org.biojava.bio.dist
Distribution interface
Alphabet getAlphabet()
The alphabet from which this spectrum emits
symbols.
Distribution getNullModel()
Retrieve the null model Distribution that this
Distribution recognizes.
Double getWeight(Symbol s)
Return the probability that Symbol s is emitted
by this spectrum.
void registerWithTrainer(DistributionTrainerContex
t dtc)
Register this distribution with a training
context.
Symbol sampleSymbol()
Sample a symbol from this state's probability
distribution.
void setNullModel(Distribution nullDist)
Set the null model Distribution that this
Distribution recognizes.
void setWeight(Symbol s, double w)
Set the probability or odds that Symbol s is
emitted by this state.

41
Example GenerateRandomSequences

Motivation
Often we need a probability distribution of the
residue composition of a set of sequences
We may want to emit random sequences from this
distribution to represent the background
Test some algorithm over the real sequences and
the random sequences and compare results
Tools
Distribution

42
GenerateRandomSequences
// set up the distribution DistributionTrainerCont
ext dtc new SimpleDistributionTrainerContext()
Distribution dist DistributionFactory.DEFAUL
T.createDistribution( AlphabetManager.alphabetForN
ame(alph)) dtc.registerDistribution(dist) //ope
n the file and read the sequences FileInputStream
fis new FileInputStream(inputFile) SequenceDB
seqdb SeqIOTools.readFasta(fis,
AlphabetManager.alphabetForName(alph)) SequenceIt
erator stream seqdb.sequenceIterator() while
(stream.hasNext()) Sequence seq
(Sequence) stream.nextSequence() for (int j
1 j lt seq.length() j)
dtc.addCount(dist, seq.symbolAt(j), 1.0)
// create the Distribution dtc.train() FileO
utputStream fos new FileOutputStream(outputFile)
for (int i 0 i lt numOutputSeqs i)
String name "Random Sequence " i
Sequence seq DistributionTools.generateSequence(
name,dist,lengthOutputSeqs)
SeqIOTools.writeFasta(fos, seq)
43
Biological Simulation

From DNA-gtRNA-gtProtein
BioJava has rigourously defined methods to do
this

44
Sequence Tools

Package org.biojava.bio.seq
DNATools
RNATools
ProteinTools

45
DNATools

a() returns static AtomicSymbol for a
c() returns static AtomicSymbol for c
complement(Symbol sym) Complement the symbol.
complement(SymbolList list) Retrieve a
complement view of list.
complementTable() Get a translation table for
complementing DNA symbols.
createDNA(java.lang.String dna) Return a new DNA
SymbolList for dna.
dnaToken(Symbol sym) Get a single-character
token for a DNA symbol
forIndex(int index) Return the symbol for an
index - compatible with index.
forSymbol(char token) Retrieve the symbol for a
char.
g() returns static AtomicSymbol for g
getDNA() Return the DNA alphabet.
index(Symbol sym) Return an integer index for a
symbol - compatible with forIndex.
n() returns static AtomicSymbol for n
reverseComplement(SymbolList list) Retrieve a
reverse-complement view of list.
t() returns static AtomicSymbol for t

46
RNATools

complementTable()
Get a translation table for complementing DNA
symbols
createRNA(java.lang.String rna)
Return a new RNA SymbolList for rna.
getGeneticCode(java.lang.String name)
Retrieve a TranslationTable by name.
getGeneticCodeNames() Retrieve a Set
containing the name of each genetic code.
getRNA() Return the RNA alphabet.
reverseComplement(SymbolList list)
Retrieve a reverse-complement view of
list.
transcribe(SymbolList list) Transcribe
DNA into RNA.
transcriptionTable() Get a translation
table for converting DNA to RNA.
translate(SymbolList syms) Translate
RNA into protein (with termination symbols).

47
ProteinTools

createProtein(java.lang.String theProtein)
Return a new Protein SymbolList for
protein.
getAlphabet() Gets the protein
alphabet
getSymbolPropertyTable(java.lang.String name)
getTAlphabet() Gets the protein
alphabet including the translation termination
symbols

48
Biological example Selenoproteins

Selenoproteins carry out the function of selenium
in our cells
Selenium levels implicated in neurological
disease
Alzheimers
Parkinsons
Most selenoprotein mRNAs contain a single UGA
codon encoding a single selenocysteine residue
per polypeptide chain, and a single specific RNA
secondary structure, termed a selenocysteine
insertion sequence (SECIS) element, directing
incorporation of this amino acid.

Castellano S, Morozova N, Morey M, Berry MJ,
Serras F, Corominas M, Guigo R. In silico
identification of novel selenoproteins in the
Drosophila melanogaster genome. EMBO Rep. 2001
Aug2(8)697-702.
49
Example Translate

Motivation
From a cDNA output the protein sequence
Tools
DNATools
RNATools
ProteinTools
Question what do we see when we translate a
selenoprotein mRNA?

50
Translating an mRNA

NM_080430

Summary This gene encodes a selenoprotein
containing selenocysteine at the active site. The
selenocysteine is encoded by the ususal nonsense
codon UGA. This gene is expressed in a variety of
tissues, and the protein is localized to the
perinuclear structures.
51
NM_080430
DEFINITION Homo sapiens selenoprotein M (SELM),
mRNA.
CDS 61..498 /gene"SELM" /note"selenoprotein
SelM" /selenocysteine /codon_start1
/transl_except(pos202..204,aaSec)
/product"selenoprotein M precursor"
/protein_id"NP_536355.1"
52
Example Translate
import java.io. import org.biojava.bio.symbol.
import org.biojava.bio.seq. import
org.biojava.bio.seq.io. import
org.biojava.bio.seq.db. public class Translate
public static void main(String args) throws
Exception String filename args0 //
open the file FileInputStreamfis new
FileInputStream(filename) SequenceDB seqdb
SeqIOTools.readFasta(fis,DNATools.getDNA())
SequenceIterator si seqdb.sequenceIterator()
while (si.hasNext()) // get the dna
sequence Sequence dna
si.nextSequence() // transcribe the
dna SymbolList rna RNATools.transcribe(d
na) System.out.println(dna.getName()
"") SymbolList protein
RNATools.translate(rna)
System.out.println(protein.seqString())

53
Output of Translate with NM_080430

gi4637009261-498
MSLLLPPLALLLLLAALVAPATAATAYRPDWNRLSGLTRARVETCGGQ
LNRLKEVKAFVTQDIPFYHNLVMKHLPGADPELVLLGRRYEELERIPLSE
MTREEINALVQELGFYRKAAPDAQVPPEYVWAPAKPPEETSDHADL

54
Sequences and Features

In most data models sequences have information
attached to them global information
(identifiers, publications) and biological
features corresponding to locations on the
sequences
Global annotations key-value pairs

Sequence seq getSequence() Annotation seqAn
seq.getAnnotation() for (Iterator i
seqAn.keys().iterator() i.hasNext() )
Object key i.next() Object value
seqAn.getProperty(key) System.out.println(key.t
oString() " " value.toString())
//source biojava.org Thomas Down
55
Feature Class

features() Iterate over any child
features which are held by this feature.
getLocation() The location of this
feature.
getParent() Return the FeatureHolder
to which this feature has been attached.
getSequence() Return the Sequence
object to which this feature is (ultimately)
attached.
getSource() The source of the feature
getSymbols() Return a list of symbols
that are contained in this feature.
getType() The type of the feature.
makeTemplate() Create a new Template
that could be used to generate a feature
identical to this one.

56
Using the Feature interface FeatureExtract
//import statements public class FeatureExtract
public static void main(String args)
throws Exception if (args.length ! 2)
throw new Exception("usage java FeatureExtract
GENBANKFILE FEATURETYPE") String filename
args0 String feature args1 BufferedRead
er gbreader new BufferedReader(new FileReader
(filename)) SequenceIterator si
SeqIOTools.readGenbank(gbreader) while
(si.hasNext()) Sequence seq
si.nextSequence() Iterator feati
seq.features() while (feati.hasNext())
Feature feat (Feature) feati.next() if
(feat.getType().equals(feature))
System.out.println(feat.getType() " "
feat.getLocation().toString())

57
Input to FeatureExtract

AF000580.gbff gene
AF000580.gbff repeat_region

58
Output of FeatureExtract
59
FeatureExtract Output
Genbank FlatFile of 80kb of human sequence
containing 2 genes
LOCUS HSDJ155G6 80600 bp DNA
PRI 13-APR-2000 DEFINITION Human DNA
sequence from clone RP1-155G6 on chromosome 20
Contains part of the gene for
brefeldin A-inhibited guanine
nucleotide-exchange protein 2, part of the gene
for CSE1L (chromosome segregation 1
(yeast homolog)-like), ESTs, STSs, GSSs
and a CpG Island, complete sequence. ACCESSION
AL121903 VERSION AL121903.13
GI7330682 KEYWORDS HTG brefeldin CpG
Island CSE1L. SOURCE human. ORGANISM
Homo sapiens Eukaryota Metazoa
Chordata Craniata Vertebrata Euteleostomi
Mammalia Eutheria Primates Catarrhini
Hominidae Homo. REFERENCE 1 (bases 1 to
80600)
gt java FeatureExtract AL121903.genbank gene
60
FeatureExtract Output
gt java FeatureExtract AL121903.genbank mRNA
61
FeatureExtract Output
Cgtjava FeatureExtract AL121903.genbank
repeat_region
62
Proteomics Tools

Package org.biojava.bio.proteomics
Digest
This class contains methods for calculating the
results of proteolytic digestion of a protein
sequence this class is not designed to be thread
safe
IsoelectricPointCalc
class that computes isoelectric point for
proteins
MassCalc
MassCalc calculates the mass of peptides which
for our purposes are SymbolLists which contain
Symbolsfrom the protein Alphabet.
Protease
The protease class stores parameters needed by
Digest to digest a protein sequence.
ProteaseManager
Registry and utility methods for Proteases.
StructureTools

63
Example TrypticDigest

Motivation
Widely used method in proteomics to identify
which proteins are present in a cell state
using mass spectrometry
Tools
Digest
MassCalc
Protease

64
TrypticDigest
public static void trypticDigest(Sequence seq)
throws BioException, ChangeVetoException
Digest digest new Digest()
digest.setMaxMissedCleavages(MAXMISSCLEAVAGE)
digest.setProtease(ProteaseManager.getProteaseByN
ame(Protease.TRYPSIN)) digest.setSequence(seq
) digest.addDigestFeatures() Iterator
iterator digest.getSequence().features()
while (iterator.hasNext()) // iterate
through the features SimpleFeature f
(SimpleFeature) iterator.next()
SymbolList symList f.getSymbols() //
calculate the mass (in Daltons) MassCalc
mc new MassCalc(SymbolPropertyTable.AVG_MASS,
true) double mass mc.getMass(symList)
// output System.out.println(f.
getLocation().toString() "\t" mass
"\t" symList.seqString())
65
TrypticDigest

Input
gtgi16761190refNP_456807.1 ecotin precursor
Salmonella enterica subsp.
MKMFVPAVVFAALASASAWANNGDTAQPLEKIAPYPQAEKGMKRQVITLT
PQQDESTLKVELLIGQTLNV
DCNQHRLGGTLETKTLEGWGYDYYVFDNVTSPVSTMMACPEGKKEQKFVT
AWLGEDGMLRYNSKLPIVVY
TPANVDVKYRIWKADANVQNAIAR

66
TrypticDigest
67
Annotations

Tell us something about the whole record
Structured as keyvalue pairs
Examples
Author list
PubMed ID
Organism

68
Annotation Interface

asMap()
Return a map that contains the same key/values as
this Annotation.
containsProperty(java.lang.Object key)
Returns whether there is any property under that
key in this Annotation.
getProperty(java.lang.Object key)
Retrieve the value of a property by key.
keys()
Get a set of key objects.
setProperty(java.lang.Object key,
java.lang.Object value)
Set the value of a property.

69
Example AnnotationExtract
public class AnnotationExtract public static
void main(String args) throws Exception if
(args.length ! 1) throw new
Exception("usage java AnnotationExtract
GENBANKFILE") String filename
args0 BufferedReader gbreader new
BufferedReader(new FileReader (filename)) Seque
nceIterator si SeqIOTools.readGenbank(gbreader)
while (si.hasNext()) Sequence seq
si.nextSequence() Annotation annot
seq.getAnnotation() java.util.Set keys
annot.keys() java.util.Iterator i
keys.iterator() while (i.hasNext())
Object key i.next() if
(annot.containsProperty(key))
System.out.println(key " "
annot.getProperty(key))
70
AnnotationExtract Output
AnnotationExtract AF000580.gbff
TYPE DNA ACCESSION AF000580 genbank_accessions
AF000580 KEYWORDS . ORGANISM Dictyostelium
discoideum Eukaryota Mycetozoa Dictyosteliida
Dictyostelium. LOCUS AF000580 JOURNAL Genetics
148 (3), 1117-1125 (1998), Submitted
(21-APR-1997) Biology, Utah State University,
Logan, UT 84322-5305, USA DIVISION
INV REFERENCE 1 (bases 1 to 14955), 2 (bases
1 to 14955) MDAT 21-APR-1998 DEFINITION
Dictyostelium discoideum plasmid Ddp5, complete
genome. VERSION AF000580.1 TITLE Dictyostelium
discoideum nuclear plasmid Ddp5 is a chimera
related to the Ddp1 and Ddp2 plasmid families,
Direct Submission CIRCULAR circular SIZE
14955 ORIGIN SOURCE Dictyostelium
discoideum MEDLINE 98198836 GI 3068582 AUTHORS
Rieben,W.K. Jr., Gonzales,C.M., Gonzales,S.T.,
Pilkington,K.J., Kiyosawa,H., Hughes,J.E. and
Welker,D.L., Rieben,W.K., Gonzales,C.,
Gonzales,S.T., Pilkington,K., Kiyosawa,H.,
Hughes,J.E. and Welker,D.L. PUBMED 9539429
71
Sequence Tools

DNATools
RNATools
ProteinTools

72
DNATools

a() returns static AtomicSymbol for a
c() returns static AtomicSymbol for c
complement(Symbol sym) Complement the symbol.
complement(SymbolList list) Retrieve a
complement view of list.
complementTable() Get a translation table for
complementing DNA symbols.
createDNA(java.lang.String dna) Return a new DNA
SymbolList for dna.
dnaToken(Symbol sym) Get a single-character
token for a DNA symbol
forIndex(int index) Return the symbol for an
index - compatible with index.
forSymbol(char token) Retrieve the symbol for a
char.
g() returns static AtomicSymbol for g
getDNA() Return the DNA alphabet.
index(Symbol sym) Return an integer index for a
symbol - compatible with forIndex.
n() returns static AtomicSymbol for n
reverseComplement(SymbolList list) Retrieve a
reverse-complement view of list.
t() returns static AtomicSymbol for t

73
RNATools

complementTable()
Get a translation table for complementing DNA
symbols
createRNA(java.lang.String rna)
Return a new RNA SymbolList for rna.
getGeneticCode(java.lang.String name)
Retrieve a TranslationTable by name.
getGeneticCodeNames() Retrieve a Set
containing the name of each genetic code.
getRNA() Return the RNA alphabet.
reverseComplement(SymbolList list)
Retrieve a reverse-complement view of
list.
transcribe(SymbolList list) Transcribe
DNA into RNA.
transcriptionTable() Get a translation
table for converting DNA to RNA.
translate(SymbolList syms) Translate
RNA into protein (with termination symbols).

74
ProteinTools

createProtein(java.lang.String theProtein)
Return a new Protein SymbolList for
protein.
getAlphabet() Gets the protein
alphabet
getSymbolPropertyTable(java.lang.String name)
getTAlphabet() Gets the protein
alphabet including the translation termination
symbols

75
sequence tools Translate
public class Translate public static void
main(String args) throws Exception if
(args.length ! 1) throw new
Exception("usage java Translate Genbankfile")
String filename args0 // open the
file FileInputStream fis new
FileInputStream(filename) SequenceDB seqdb
SeqIOTools.readFasta(fis, DNATools.getDNA()) Se
quenceIterator si seqdb.sequenceIterator() wh
ile (si.hasNext()) // get the dna
sequence Sequence dna si.nextSequence()
// reverse complement the dna
sequence SymbolList dnaList
DNATools.reverseComplement(dna) // transcribe
the dna SymbolList rna RNATools.transcribe(dn
aList) System.out.println(dna.getName()
"") for (int i 1 i lt 3 i) //
translate the rna in 3 reading frames int
length (rna.length())-((rna.length()-i1)3)
SymbolList protein RNATools.translate(rna.sub
List(i, length)) System.out.println("Frame "
i " " protein.seqString())
76
Translate Input

FASTA File of ESTs from Fugu rubripes

gtFEEFRa010apsH5 AGACAACACACGGGGAACCAGGAACGTCAGGCC
CCTTACACTGGGATCCAGAGCCACAGC GCGACAGAAAGTTAAAGGATAA
ACTCTAACGGAACACAATGGCAGACAAAATTAAAGATG CCAAGATCATC
TTTGTCGTGGGTGGGCCTGGTTCTGGAAAGGGCACTCAGTGTGAGAAAA
TAGTTGCAAAGTATGGCTACACCCATCTGTCATCCGGGGATCTGCTCCGT
GCTGAAGTGG CCTCTGGCTCCGAGAGGGGCAAGCAGCTCCAGGCCATCA
TGCAGAAGGGAGAGCTTGTTC CCCTGGACACCGTCTTAGACATGATTAA
GGATGCCATGATCGCCAAGGCCGACGTGTCCA AGGGCTTCCTCATTGAC
GGCTACCCCCGTGAGGTGGAGCAGGGCGAGGAGTTTGAGAAGA AGATCG
GCAGACCCTGCCTGGTGCTGGACGTTGACGCCAGGGGGGAGACCATGGGC
AAGA gtFEEFRa010apsC2 CGAGGTGCGGACTGTGCTGTCGCTCGTC
AGGGCCCAGGACGCCTACGCTCGCCTGCCCGA GAGCTACAGGAACGGCG
TGAATCTGACTCTGGAGCAGCTGAACTCTCACACTGGGGTCCA TCACCA
TTTCCGCTTCTTGAAAAGTCTGGAAAAGTCAGAAATTGAGTCTGGTTTTG
GTGT GAGTTACCTCTACCACCATTTTTACCTGAAGCCAACGTGGTGCGC
CAAAGGAACAGAAGA GTCGAACCCTGAGGCGTGTGCCTTCAGGAACGAC
AGGCCGCTGATGGACTGCGCGATCTG CTACAAAACAGCGAACAATGTGA
TGGAGGCCAACCCAAAGCCATATGTGCACTGCATCCA AAAGCCAAGGAC
TCACACCGGGCATGAGGAGCAGTAGGACAGAACATTGCAGAAAAATGA G
CTACAACAGCGGAGCTCCAACACTTTTGGCTGTGAAAGTCGGCTGAGGCA
ACACGTGGA GCAACT
77
Translate Output
78
Loading Sequences

Three main ways to do this
Local Files
Local Databases
Remote Databases

79
Loading Sequences Remote URL

How do we retrieve a FASTA file remotely from
Genbank?
Code from Ratiba Terbaoui

//String idSeq SOMEACCESSIONNUMBER URL url
new URL("http//www.ncbi.nlm.nih.gov/htbin-post/En
trez/query?dbnform6doptfhtmlnotitlenouid
" idSeq) BufferedReader in new
BufferedReader(new InputStreamReader(url.openStre
am())) SequenceIterator stream
SeqIOTools.readFastaDNA(in) String
InputLine while ((InputLine in.readLine()) !
null) Sequence seq1 getSequence(stream) Sys
tem.out.println(seq1.seqString())
80
Adding Annotation to a Sequence

Our main goal in bioinformatics
GFF format
Widely used for sequence analysis and annotation

NAME lttabgt SOURCE lttabgt FEATURE lttabgt START lttabgt
END lttabgt SCORE lttabgt STRAND lttabgt PHASE lttabgt
ANYTHING COMMENT
81
GFF Parsing

BioJava has a gff package containing libraries
for parsing and manipulating GFF files
org.biojava.bio.program.gff.GFFParser

82
Example Fasta2Features

Takes in a FASTA file, a GFF file and a source
string
Filters the gff file by source
Annotates the FASTA file
Prints out the features and subsequences
associated with those features

83
Fasta2Features
public class Fasta2Features public static
void main(String args) throws Exception
if(args.length ! 3) throw new
Exception("Use Fasta2Features fastafile GFFfile
source") String fastafile args0
String gfffile args1 String source
args2 //load the sequence
FileInputStream fis new FileInputStream(fastafil
e) SequenceDB seqdb SeqIOTools.readFasta(fi
s, DNATools.getDNA()) // load in the GFF
System.out.println("Loading gff with inputted
source") GFFEntrySet gffEntries new
GFFEntrySet() GFFRecordFilter.SourceFilter
sFilter new GFFRecordFilter.SourceFilter()
sFilter.setSource(source) GFFFilterer
filterer new GFFFilterer(gffEntries.getAddHandle
r(), sFilter) GFFParser parser new
GFFParser() parser.parse(new
BufferedReader( new InputStreamReader( new
FileInputStream(new File(gfffile)))), filterer)
// add the features to the sequences
System.out.println("Adding features from gff to
sequences") SequenceDB aSeqDB new
AnnotatedSequenceDB(seqdb, gffEntries.getAnnotator
()) // use member method to do the
printing PrintFeatures(aSeqDB)
84
Fasta2Features methods
public static void PrintFeatures(SequenceDB
aSeqDB) // print the features we just added
to our sequence SequenceIterator si
aSeqDB.sequenceIterator() while (si.hasNext())
Sequence seq si.nextSequence() Iterator
feati seq.features() while
(feati.hasNext()) Feature feat (Feature)
feati.next() System.out.println(feat.getType(
) " " feat.getLocation().toString())
System.out.println(seq.subStr(feat.getLocat
ion().getMin(), feat.getLocation().getMax())
)
85
Fasta2Features example input
Example Input
FASTA file containing 1 sequence
gtID1 aaaaaactgggaacaaagcaatcaaaagaaatgtatgtggccggg
cgtcatggctcacgc ctgtaatcccagcactttgggaggctgaggcagg
tggatcacaggtcaggagatcgagac cgtcctggctaacatggtgaaac
cccatctctactaaaaatacaaaagtttagcccggcg tggtggcgagcg
cctgtagtcccagctactcaggaggctgaggcagaagaatggcgtgaa c
ccgggaggcagagcttgcagtgagctgagatcgtgccactgccctccagc
ctgggcgac agagcgagactccgtcttaaaaaaaaaaaaaaaaaaaaaa
gaaatgtatgtatgtatgta tctgtgatgtcatttgcctcttaagggaa
aggtaataataacagcttcctaccaaaacac atttcatgatctcaagaa
aattttctcaaacctggtggtgaagagttataataccaatgc Ttaaaaa
tagtgttagtaaaaaaacaatccagaaattctgtacattgaatagaggaa
aat
GFF file containing 1 gene and 1 exon annotation
gff-version 1 date 2000-02-18 I'm just
making this up ID1 cbw gene 10 500 0 . cbw_gene
ID1 cbw exon 15 400 0 . cbw_exon
86
Fasta2Features output
gt java Fasta2Features ID1.fa.txt gfftest.gff.txt
cbw
87
BLAST parsing

Similar parser for BLAST, WuBLAST, FASTA, HMMer
See Biojava in Anger for detailed example
http//www.biojava.org/docs/bj_in_anger/BlastParse
r.htm

88
MEME parsing

Sequence motif finder
Use Meme class

89
Other BioJava facilities

BioSQL interface
HMM utilities
Dynamic Programming
Model trainers
Profile makers
Weight Matrices
DAS distributed annotation server
Allows you to serve up your annotations on
Ensembl

90
BioJava concluding thoughts

Still in rapid development
Will always be
Missing some critical data types
gene expression, molecular interactions
Subscribe to mailing lists
Always more informative than static documentation
Contribute code
Forces you to write good code
Everyone wins
OpenSource philosophy is very powerful

91
Summary

Existing packages can be extremely useful and can
dramatically increase development time
Look before you code
Biojava provides a useful codebase for working
with some bioinformatics data
Very well-developed tools for sequence modeling
and manipulation
Go Open Source

92
End

Resources
http//www.biojava.org
Biojava wwwsite
http//biojava.org/docs/api/index.html
Javadoc API docs
http//biojava.org/docs/bj_in_anger/index.htm
Biojava in Anger
sohrab_at_bioinformatics.ubc.ca
me

93
Sequence I/O

Motivation
Want to input sequence data from different
databases in different formats
Want to be able to write sequence records in
different formats
Want to convert between formats
Want to easily load data into memory
Single records
Multiple records

94
Sequence I/O

Code

95
Sequence manipulation

Motivation
Often, we want to compute on sub-sequences of
an inputted sequence
Could correspond to a feature in a sequence
Could need to do a sliding window analysis
Perform biological operations
Transcription on DNA
Translation on RNA
Tryptic digest

96
Sequence manipulation

Code

97
Sequence distribution

Motivation
We often need to know the background noise
level relative to a signal
Gives statistical robustness to a detected
pattern
Can generate random sequences from the
distribution and compare observations in real and
random data

98
Sequence distribution

Code

99
Interpreting Similarity search data

Motivation
Volume of computational results to great to sift
through manually
We need to be able to extract parts of reports
and effectively summarize the results
Top BLAST hits

100
Similarity searching results parsing