Lecture 3'0 Biopackages I sequence manipulation with BioJava - PowerPoint PPT Presentation

1 / 71
About This Presentation
Title:

Lecture 3'0 Biopackages I sequence manipulation with BioJava

Description:

How do I make a Sequence from a String or make a Sequence Object back into a String? ... How do I reverse complement a DNA or RNA Sequence? ... – PowerPoint PPT presentation

Number of Views:321
Avg rating:3.0/5.0
Slides: 72
Provided by: stephe78
Category:

less

Transcript and Presenter's Notes

Title: Lecture 3'0 Biopackages I sequence manipulation with BioJava


1
Lecture 3.0 Biopackages I - sequence
manipulation with BioJava
  • Sohrab Shah
  • UBC Bioinformatics Centre
  • sohrab_at_bioinformatics.ubc.ca
  • http//bioinformatics.ubc.ca/people/sohrab

2
My Work
  • Pegasys workflow management for bioinformatics
  • Uses Biojava heavily
  • Comparative vertebrate genomics
  • Annotation of Cryptococcus

http//bioinformatics.ubc.ca
3
Outline
  • When to use biopackages
  • A brief tour of biojava
  • Sequence IO
  • Sequence manip
  • Distributions
  • Features
  • Proteomics
  • Learn by examples

4
http//www.jb.man.ac.uk/drl/
5
Using APIs/Libraries
  • Libraries are devoted to avoid re-inventing the
    wheel in software development
  • Include API contracts that specify the rules of
    engagement for programmers
  • What are the inputs to a method?
  • What is the output of a method?
  • Designed to speed up development time for
    application programmers
  • Standard APIs promote consistent code and a
    commonality among diverse/unrelated applications

6
What to look for in a software library
  • Does the software have the functionality you
    need?
  • Is the software extensible?
  • Is the software the accepted standard for the
    domain?
  • Does the package have a user community?
  • Is the software well-supported by the developers?
  • Documentation
  • Mailing lists
  • Bug fixing process
  • Can I use and distribute this legally?

7
Pitfalls
  • For open source
  • Public domain ! correctness
  • Public domain ! robustness
  • Spend some time evaluating the code
  • BioPerl anecdote
  • Does the documentation (publication or other)
    accurately reflect the algorithm?
  • Is there documentation??
  • NCBI C Toolkit anecdote

8
Rolling your own
  • Sometimes this is the best way
  • Understanding code at many or all levels is
    critical
  • Understanding an algorithm
  • Implement it yourself?
  • Debugging/modifying a library may take more time
    than writing it yourself

9
BIOJAVA
10
BioJava
  • http//biojava.org
  • Dedicated to providing Java tools for processing
    biological data
  • Released under the Open Source LGPL
  • Less restrictive than GPL
  • Last stable release v1.3 June 15th 2003
  • Designed for programmers with experience in Java
  • Is a work in progress
  • Robust architecture that models biology

11
biojava.org
12
API docs
13
biojava.org
14
BioJava in Anger
  • Setup
  • Where do I get a Java installation?
  • How do I get and install BioJava?
  • Alphabets and Symbols
  • How do I get a DNA, RNA or Protein Alphabet?
  • How do I make a custom Alphabet from custom
    Symbols?
  • How do I make a CrossProductAlphabet such as a
    codon Alphabet?
  • How do I break Symbols from CrossProduct
    Alphabets into their component Symbols?
  • How can I tell if two Alphabets or Symbols are
    equal?
  • How can I make an ambiguous Symbol like Y or R?
  • Basic Sequence Manipulation
  • How do I make a Sequence from a String or make a
    Sequence Object back into a String?
  • How do I get a subsection of a Sequence?
  • How do I transcribe a DNA Sequence to a RNA
    Sequence?
  • How do I reverse complement a DNA or RNA
    Sequence?
  • Sequences are immutable so how can I change it's
    name?
  • How can I edit a Sequence or SymbolList?

15
BioJava in Anger
  • Translation
  • How do I translate a DNA or RNA Sequence or
    SymbolList to Protein?
  • How do I translate a single codon to a single
    amino acid?
  • How do I use a non standard translation table?
  • How do I translate a nucleotide sequence in all
    six frames
  • Proteomics
  • How do I calculate the mass and pI of a peptide?
  • Sequence I/O
  • How do I write Sequences in Fasta format?
  • How do I read in a Fasta file?
  • How do I read a GenBank/EMBL/SwissProt file?
  • How do I extract GenBank/EMBL/Swissprot sequences
    and write them as Fasta?
  • How do I turn an ABI sequence trace into a
    BioJava Sequence?

16
BioJava Functionality
  • Manipulating sequences
  • Sequence File I/O
  • Probability distributions over sets of sequences
  • Parsing sequence similarity search results
  • Taxonomy data structure
  • Bibliographic data structures
  • Annotation file formats
  • GFF, GAME XML, DAS
  • Dynamic programming
  • Support Vector machine classification and
    regression

17
BioJava
  • Rag-bag collection of useful stuff!
  • Focus
  • Sequence IO
  • Sequence manip
  • Probability Distributions
  • Features
  • Proteomics

18
BioJava Design Ethos
  • Models biopolymers as a list of symbols, and not
    as strings
  • Strings have 3 disadvantages
  • Validation
  • It is possible to pass any string to a routine
    which is expecting a biological sequence. Any
    validation has to be performed on an ad hoc
    basis.
  • Ambiguity
  • The meaning of each symbol is not necessarily
    clear. The T' which means thymidine in DNA is
    the same T' which is a threonine residue in a
    protein sequence
  • Limited alphabet
  • While there are obvious encodings for nucleic
    acid and sequence data as strings, the same
    approach does not always work well for other
    kinds of data generated in biological sequence
    analysis software
  • Closer to the biology of the sequence can
    attach biochemical annotation to the symbol
    objects cant do this with strings

Thomas Down biojava.org
19
Sequence manipulation
  • The Sequence Object
  • abstract data type for biological sequences
  • residues of the sequence are stored in a
    SymbolList

20
Introduction what is BioJava?
  • OpenSource initiative to create Java libraries
    and tools for processing biological data
  • Contains objects for manipulating sequences, file
    parsers, Dynamic Programming routines for HMM
    analysis, basic statistics, sequence database
    interoperability
  • Work in progress now at version 1.21
  • www.biojava.org
  • Generally designed for programmers with Java
    experience
  • Many ideas from this lecture are taken from
  • http//www.biojava.org/

21
BioJava Overview
  • Sequences and features
  • IO
  • Processing, storing, manipulating
  • Visualising
  • Dynamic programming
  • Single-sequence and pair-wise HMMs
  • Viterbi-path, Forward and Backward algorithms
  • Training models
  • Sampling sequences from models
  • External file formats and programs
  • GFF
  • Blast
  • Meme
  • Sequence Databases
  • BioCorba interoperability
  • ACeDB client
  • BioSQL

22
Symbol Interface
  • BioJava takes a rather different approach to
    modeling sequence data. Instead of using a string
    of ASCII characters, a sequence is modelled as a
    list of Java objects implementing the Symbol
    interface. This class, and the others described
    here, are part of the Java package
    org.biojava.bio.symbol.

23
Symbol Interface
  • public interface Symbol
  • public String getName()
  • public Annotation getAnnotation()
  • public Alphabet getMatches()
  • All Symbol instances have
  • a name property (ie, Thymine).
  • An optional Annotation property (ie, information
    about the chemical properties of a DNA base)
  • getMatches, is only important for ambiguous
    symbols

24
Alphabet interface
  • A set of symbol objects that can be found in a
    particular type of sequence is defined by the
    Alphabet Interface
  • Allows you to define your own alphabet
  • A set of standard alphabets is available from the
    AlphabetManager

25
SequenceAlphabet
import org.biojava.bio.symbol. import
org.biojava.bio.seq. import java.util. public
class SequenceAlphabet public static void
main(String args) throws Exception // print
the names of the alphabets available by
default // get an iterator on the
Alphabets Iterator iam AlphabetManager.alphabe
ts() while (iam.hasNext()) Alphabet a
((Alphabet) iam.next()) System.out.println("Al
phabet " a.getName()) // get all the
DNA symbols FiniteAlphabet dna
DNATools.getDNA() Iterator dnaI
dna.iterator() while (dnaI.hasNext())
Symbol dnaSymbol (Symbol)
dnaI.next() System.out.println(dnaSymbol.getNa
me()) // get all the Protein
symbols FiniteAlphabet prot
ProteinTools.getAlphabet() Iterator protI
prot.iterator() while (protI.hasNext())
Symbol protSymbol (Symbol)
protI.next() System.out.println(protSymbol.get
Name())
26
SequenceAlphabet Output
Alphabet Alphabet of all integers. Alphabet
RNA Alphabet DNA Alphabet STRUCTURE Alphabet
PROTEIN Alphabet NUCLEOTIDE Alphabet
PROTEIN-TERM Alphabet Alphabet of all
doubles. DNA symbols cytosine thymine adenine gu
anine Protein symbols LYS ALA CYS TYR TRP ASP HI
S ILE ARG VAL GLN GLY PHE MET THR PRO LEU SEC GLU
ASN SER
27
Modeling Sequences
  • Some rules
  • BioJava models sequences with a SymbolList
    interface
  • Every SymbolList has an associated Alphabet, and
    may only contain Symbols from that alphabet.
  • SymbolLists can be seen as strings which are made
    up of Symbol objects rather than characters. The
    interface specifies methods for querying the
    alphabet and length, and accessing the Symbols
  • Effectively modeling biopolymers

28
SymbolList Interface
  • getAlphabet() The alphabet that this SymbolList
    is over.  
  • iterator() An Iterator over all Symbols in this
    SymbolList.  
  • length() The number of symbols in this
    SymbolList.  
  • seqString() Return the symbol list as a String
    Object.  
  • subList(int start, int end) Return a new
    SymbolList for the symbols start to end
    inclusive.  
  • subStr(int start, int end) Return a region of
    this symbol list as a String.  
  • symbolAt(int index) Return the symbol at index,
    counting from 1.  
  • toList() Returns a List of symbols.

29
Sequence Interface
  • Extends SymbolList
  • getName( )
  • Extracted from some file format
  • getURN( )
  • Could be a URL or a A Uniform Resource Identifier
    (URI) which identifies the sequence represented
    by this object.

30
Loading Sequences Local Files
  • SeqIOTools
  • Read formats
  • Genbank, EMBL, FASTA, GFF, SwissProt, etc
  • Write formats
  • FASTA,
  • SequenceBuilderFactories

31
SeqIOTools Interface
  • readEmbl (java.io.BufferedReader br)
  • Iterate over the sequences in an EMBL-format
    stream.
  • readFasta (java.io.InputStream seqFile,
    Alphabet alpha)
  • Create a sequence database from a fasta file
    provided as an input stream.
  • readFastaDNA (java.io.BufferedReader br)
  • Iterate over the sequences in an FASTA-format
    stream of DNA sequences.
  • readFastaProtein (java.io.BufferedReader br)
  • Iterate over the sequences in an FASTA-format
    stream of Protein sequences.
  • readGenbank (java.io.BufferedReader br)
  • Iterate over the sequences in an GenBank-format
    stream.
  • readSwissprot (java.io.BufferedReader br)
  • Iterate over the sequences in an Swissprot-format
    stream.
  • writeFasta (java.io.OutputStream os,
    SequenceDB db)
  • Write a sequenceDB to an output stream in fasta
    format
  • writeFasta (java.io.OutputStream os,
    SequenceIterator in)
  • Writes sequences from a SequenceIterator to an
    OutputStream in Fasta Format.

32
Example Genbank2Fasta
  • Motivation
  • Read in a sequence record in one format and
    export it in another
  • Tools
  • SeqIOTools static methods
  • readGenbank
  • writeFasta

33
Example SeqIOTools Genbank2Fasta
import java.io. import org.biojava.bio.seq. im
port org.biojava.bio.seq.io. public class
Genbank2Fasta public static void
main(String args) throws Exception String
genbank args0 String fasta args1
FileOutputStream faout new
FileOutputStream(fasta) BufferedReader
gbreader new BufferedReader(new
FileReader(genbank)) SequenceIterator
si SeqIOTools.readGenbank(gbreader)
SeqIOTools.writeFasta(faout, si)
34
Check-in
  • Collections ?
  • Iterators ?

35
Example ResidueCount
  • Motivation
  • Print out the counts/percentages of each symbol
    in an alphabet for every sequence in a given
    FASTA file
  • Tools
  • SeqIOTools
  • SequenceDB
  • Sequence
  • Symbol

36
ResidueCount
HashMap hash new HashMap() FiniteAlphabet a
null //get the alphabet a (FiniteAlphabet)
AlphabetManager.alphabetForName(alphname) //open
the file and read the sequences FileInputStream
fis new FileInputStream(filename) SequenceDB
seqdb SeqIOTools.readFasta(fis,
a) SequenceIterator stream seqdb.sequenceIterat
or() while (stream.hasNext())
//initialise the hash - all counts to 0
Iterator ai a.iterator() while
(ai.hasNext()) Symbol s (Symbol)
ai.next() hash.put(s.getName(), new
Integer(0)) //get the next sequence
Sequence seq stream.nextSequence() for
(int i 1 i lt seq.length() i)
//read the sequence and count the residues
Symbol sym seq.symbolAt(i) // check
if symbol in alphabet if
(a.contains(sym)) int count
((Integer) hash.get(sym.getName())).intValue()
hash.put(sym.getName(), new
Integer(count)) else
// print warning //output the
results for this sequence System.out.println(s
eq.getName()) Collection keys
hash.keySet() Iterator keyI
keys.iterator() while (keyI.hasNext())
String c (String) keyI.next()
Integer count (Integer)hash.get(c)
float percent count.floatValue() /
(float) seq.length() System.out.println(c
" " count " " percent)
37
ResidueCount
HashMap hash new HashMap() FiniteAlphabet a
null //get the alphabet a (FiniteAlphabet)
AlphabetManager.alphabetForName(alphname) //open
the file and read the sequences FileInputStream
fis new FileInputStream(filename) SequenceDB
seqdb SeqIOTools.readFasta(fis,
a) SequenceIterator stream seqdb.sequenceIterat
or() while (stream.hasNext())
//initialise the hash - all counts to 0
Iterator ai a.iterator() while
(ai.hasNext()) Symbol s (Symbol)
ai.next() hash.put(s.getName(), new
Integer(0)) //get the next sequence
Sequence seq stream.nextSequence() for
(int i 1 i lt seq.length() i)
//read the sequence and count the residues
Symbol sym seq.symbolAt(i) // check
if symbol in alphabet if
(a.contains(sym)) int count
((Integer) hash.get(sym.getName())).intValue()
hash.put(sym.getName(), new
Integer(count)) else
// print warning //output the
results for this sequence System.out.println(s
eq.getName()) Collection keys
hash.keySet() Iterator keyI
keys.iterator() while (keyI.hasNext())
String c (String) keyI.next()
Integer count (Integer)hash.get(c)
float percent count.floatValue() /
(float) seq.length() System.out.println(c
" " count " " percent)
38
Input of ResidueCount
  • Accession AF000580.1
  • Plasmid from Dictyostelium discoideum
  • Known to have approx 80 AT
  • Ameboid protozoan

39
Output of ResidueCount
40
Probability Distributions
  • Package org.biojava.bio.dist
  • Distribution interface
  • Alphabet getAlphabet()
  • The alphabet from which this spectrum emits
    symbols. 
  • Distribution getNullModel()
  • Retrieve the null model Distribution that this
    Distribution recognizes. 
  • Double getWeight(Symbol s)
  • Return the probability that Symbol s is emitted
    by this spectrum. 
  • void registerWithTrainer(DistributionTrainerContex
    t dtc)
  • Register this distribution with a training
    context. 
  • Symbol sampleSymbol()
  • Sample a symbol from this state's probability
    distribution. 
  • void setNullModel(Distribution nullDist)
  • Set the null model Distribution that this
    Distribution recognizes. 
  • void setWeight(Symbol s, double w)
  • Set the probability or odds that Symbol s is
    emitted by this state.

41
Example GenerateRandomSequences
  • Motivation
  • Often we need a probability distribution of the
    residue composition of a set of sequences
  • We may want to emit random sequences from this
    distribution to represent the background
  • Test some algorithm over the real sequences and
    the random sequences and compare results
  • Tools
  • Distribution

42
GenerateRandomSequences
// set up the distribution DistributionTrainerCont
ext dtc new SimpleDistributionTrainerContext()
Distribution dist DistributionFactory.DEFAUL
T.createDistribution( AlphabetManager.alphabetForN
ame(alph)) dtc.registerDistribution(dist) //ope
n the file and read the sequences FileInputStream
fis new FileInputStream(inputFile) SequenceDB
seqdb SeqIOTools.readFasta(fis,
AlphabetManager.alphabetForName(alph)) SequenceIt
erator stream seqdb.sequenceIterator() while
(stream.hasNext()) Sequence seq
(Sequence) stream.nextSequence() for (int j
1 j lt seq.length() j)
dtc.addCount(dist, seq.symbolAt(j), 1.0)
// create the Distribution dtc.train() FileO
utputStream fos new FileOutputStream(outputFile)
for (int i 0 i lt numOutputSeqs i)
String name "Random Sequence " i
Sequence seq DistributionTools.generateSequence(
name,dist,lengthOutputSeqs)
SeqIOTools.writeFasta(fos, seq)
43
Biological Simulation
  • From DNA-gtRNA-gtProtein
  • BioJava has rigourously defined methods to do
    this

44
Sequence Tools
  • Package org.biojava.bio.seq
  • DNATools
  • RNATools
  • ProteinTools

45
DNATools
  • a() returns static AtomicSymbol for a
  • c() returns static AtomicSymbol for c
               
  • complement(Symbol sym) Complement the symbol.
  • complement(SymbolList list) Retrieve a
    complement view of list.
  • complementTable() Get a translation table for
    complementing DNA symbols.
  • createDNA(java.lang.String dna) Return a new DNA
    SymbolList for dna.
  • dnaToken(Symbol sym) Get a single-character
    token for a DNA symbol
  • forIndex(int index) Return the symbol for an
    index - compatible with index.
  • forSymbol(char token) Retrieve the symbol for a
    char.
  • g() returns static AtomicSymbol for g         
  • getDNA() Return the DNA alphabet.
  • index(Symbol sym) Return an integer index for a
    symbol - compatible with forIndex.
  • n() returns static AtomicSymbol for n
  • reverseComplement(SymbolList list) Retrieve a
    reverse-complement view of list.
  • t() returns static AtomicSymbol for t

46
RNATools
  • complementTable()
  • Get a translation table for complementing DNA
    symbols
  • createRNA(java.lang.String rna)
  • Return a new RNA SymbolList for rna.
  • getGeneticCode(java.lang.String name)
              Retrieve a TranslationTable by name.
  • getGeneticCodeNames()           Retrieve a Set
    containing the name of each genetic code.
  • getRNA()           Return the RNA alphabet.
  • reverseComplement(SymbolList list)
              Retrieve a reverse-complement view of
    list.
  • transcribe(SymbolList list)           Transcribe
    DNA into RNA.
  • transcriptionTable()           Get a translation
    table for converting DNA to RNA.
  • translate(SymbolList syms)           Translate
    RNA into protein (with termination symbols).

47
ProteinTools
  • createProtein(java.lang.String theProtein)
              Return a new Protein SymbolList for
    protein.
  • getAlphabet()           Gets the protein
    alphabet
  • getSymbolPropertyTable(java.lang.String name)
               
  • getTAlphabet()           Gets the protein
    alphabet including the translation termination
    symbols

48
Biological example Selenoproteins
  • Selenoproteins carry out the function of selenium
    in our cells
  • Selenium levels implicated in neurological
    disease
  • Alzheimers
  • Parkinsons
  • Most selenoprotein mRNAs contain a single UGA
    codon encoding a single selenocysteine residue
    per polypeptide chain, and a single specific RNA
    secondary structure, termed a selenocysteine
    insertion sequence (SECIS) element, directing
    incorporation of this amino acid.

Castellano S, Morozova N, Morey M, Berry MJ,
Serras F, Corominas M, Guigo R. In silico
identification of novel selenoproteins in the
Drosophila melanogaster genome. EMBO Rep. 2001
Aug2(8)697-702.
49
Example Translate
  • Motivation
  • From a cDNA output the protein sequence
  • Tools
  • DNATools
  • RNATools
  • ProteinTools
  • Question what do we see when we translate a
    selenoprotein mRNA?

50
Translating an mRNA
  • NM_080430

Summary This gene encodes a selenoprotein
containing selenocysteine at the active site. The
selenocysteine is encoded by the ususal nonsense
codon UGA. This gene is expressed in a variety of
tissues, and the protein is localized to the
perinuclear structures.
51
NM_080430
DEFINITION Homo sapiens selenoprotein M (SELM),
mRNA.
CDS 61..498 /gene"SELM" /note"selenoprotein
SelM" /selenocysteine /codon_start1
/transl_except(pos202..204,aaSec)
/product"selenoprotein M precursor"
/protein_id"NP_536355.1"
52
Example Translate
import java.io. import org.biojava.bio.symbol.
import org.biojava.bio.seq. import
org.biojava.bio.seq.io. import
org.biojava.bio.seq.db. public class Translate
public static void main(String args) throws
Exception String filename args0 //
open the file FileInputStreamfis new
FileInputStream(filename) SequenceDB seqdb
SeqIOTools.readFasta(fis,DNATools.getDNA())
SequenceIterator si seqdb.sequenceIterator()
while (si.hasNext()) // get the dna
sequence Sequence dna
si.nextSequence() // transcribe the
dna SymbolList rna RNATools.transcribe(d
na) System.out.println(dna.getName()
"") SymbolList protein
RNATools.translate(rna)
System.out.println(protein.seqString())

53
Output of Translate with NM_080430
  • gi4637009261-498
  • MSLLLPPLALLLLLAALVAPATAATAYRPDWNRLSGLTRARVETCGGQ
    LNRLKEVKAFVTQDIPFYHNLVMKHLPGADPELVLLGRRYEELERIPLSE
    MTREEINALVQELGFYRKAAPDAQVPPEYVWAPAKPPEETSDHADL

54
Sequences and Features
  • In most data models sequences have information
    attached to them global information
    (identifiers, publications) and biological
    features corresponding to locations on the
    sequences
  • Global annotations key-value pairs

Sequence seq getSequence() Annotation seqAn
seq.getAnnotation() for (Iterator i
seqAn.keys().iterator() i.hasNext() )
Object key i.next() Object value
seqAn.getProperty(key) System.out.println(key.t
oString() " " value.toString())
//source biojava.org Thomas Down
55
Feature Class
  • features()           Iterate over any child
    features which are held by this feature.
  • getLocation()           The location of this
    feature.
  • getParent()           Return the FeatureHolder
    to which this feature has been attached.
  • getSequence()           Return the Sequence
    object to which this feature is (ultimately)
    attached.
  • getSource()           The source of the feature
  • getSymbols()           Return a list of symbols
    that are contained in this feature.
  • getType()           The type of the feature.
  • makeTemplate()           Create a new Template
    that could be used to generate a feature
    identical to this one.

56
Using the Feature interface FeatureExtract
//import statements public class FeatureExtract
public static void main(String args)
throws Exception if (args.length ! 2)
throw new Exception("usage java FeatureExtract
GENBANKFILE FEATURETYPE") String filename
args0 String feature args1 BufferedRead
er gbreader new BufferedReader(new FileReader
(filename)) SequenceIterator si
SeqIOTools.readGenbank(gbreader) while
(si.hasNext()) Sequence seq
si.nextSequence() Iterator feati
seq.features() while (feati.hasNext())
Feature feat (Feature) feati.next() if
(feat.getType().equals(feature))
System.out.println(feat.getType() " "
feat.getLocation().toString())

57
Input to FeatureExtract
  • AF000580.gbff gene
  • AF000580.gbff repeat_region

58
Output of FeatureExtract
59
FeatureExtract Output
Genbank FlatFile of 80kb of human sequence
containing 2 genes
LOCUS HSDJ155G6 80600 bp DNA
PRI 13-APR-2000 DEFINITION Human DNA
sequence from clone RP1-155G6 on chromosome 20
Contains part of the gene for
brefeldin A-inhibited guanine
nucleotide-exchange protein 2, part of the gene
for CSE1L (chromosome segregation 1
(yeast homolog)-like), ESTs, STSs, GSSs
and a CpG Island, complete sequence. ACCESSION
AL121903 VERSION AL121903.13
GI7330682 KEYWORDS HTG brefeldin CpG
Island CSE1L. SOURCE human. ORGANISM
Homo sapiens Eukaryota Metazoa
Chordata Craniata Vertebrata Euteleostomi
Mammalia Eutheria Primates Catarrhini
Hominidae Homo. REFERENCE 1 (bases 1 to
80600)
gt java FeatureExtract AL121903.genbank gene
60
FeatureExtract Output
gt java FeatureExtract AL121903.genbank mRNA
61
FeatureExtract Output
Cgtjava FeatureExtract AL121903.genbank
repeat_region
62
Proteomics Tools
  • Package org.biojava.bio.proteomics
  • Digest
  • This class contains methods for calculating the
    results of proteolytic digestion of a protein
    sequence this class is not designed to be thread
    safe
  • IsoelectricPointCalc
  • class that computes isoelectric point for
    proteins
  • MassCalc
  • MassCalc calculates the mass of peptides which
    for our purposes are SymbolLists which contain
    Symbolsfrom the protein Alphabet.
  • Protease
  • The protease class stores parameters needed by
    Digest to digest a protein sequence.
  • ProteaseManager
  • Registry and utility methods for Proteases.
  • StructureTools

63
Example TrypticDigest
  • Motivation
  • Widely used method in proteomics to identify
    which proteins are present in a cell state
    using mass spectrometry
  • Tools
  • Digest
  • MassCalc
  • Protease

64
TrypticDigest
public static void trypticDigest(Sequence seq)
throws BioException, ChangeVetoException
Digest digest new Digest()
digest.setMaxMissedCleavages(MAXMISSCLEAVAGE)
digest.setProtease(ProteaseManager.getProteaseByN
ame(Protease.TRYPSIN)) digest.setSequence(seq
) digest.addDigestFeatures() Iterator
iterator digest.getSequence().features()
while (iterator.hasNext()) // iterate
through the features SimpleFeature f
(SimpleFeature) iterator.next()
SymbolList symList f.getSymbols() //
calculate the mass (in Daltons) MassCalc
mc new MassCalc(SymbolPropertyTable.AVG_MASS,
true) double mass mc.getMass(symList)
// output System.out.println(f.
getLocation().toString() "\t" mass
"\t" symList.seqString())
65
TrypticDigest
  • Input
  • gtgi16761190refNP_456807.1 ecotin precursor
    Salmonella enterica subsp.
  • MKMFVPAVVFAALASASAWANNGDTAQPLEKIAPYPQAEKGMKRQVITLT
    PQQDESTLKVELLIGQTLNV
  • DCNQHRLGGTLETKTLEGWGYDYYVFDNVTSPVSTMMACPEGKKEQKFVT
    AWLGEDGMLRYNSKLPIVVY
  • TPANVDVKYRIWKADANVQNAIAR

66
TrypticDigest
67
Annotations
  • Tell us something about the whole record
  • Structured as keyvalue pairs
  • Examples
  • Author list
  • PubMed ID
  • Organism

68
Annotation Interface
  • asMap()
  • Return a map that contains the same key/values as
    this Annotation.
  • containsProperty(java.lang.Object key)
  • Returns whether there is any property under that
    key in this Annotation.
  • getProperty(java.lang.Object key)
  • Retrieve the value of a property by key.
  • keys()
  • Get a set of key objects.
  • setProperty(java.lang.Object key,
    java.lang.Object value)
  • Set the value of a property.

69
Example AnnotationExtract
public class AnnotationExtract public static
void main(String args) throws Exception if
(args.length ! 1) throw new
Exception("usage java AnnotationExtract
GENBANKFILE") String filename
args0 BufferedReader gbreader new
BufferedReader(new FileReader (filename)) Seque
nceIterator si SeqIOTools.readGenbank(gbreader)
while (si.hasNext()) Sequence seq
si.nextSequence() Annotation annot
seq.getAnnotation() java.util.Set keys
annot.keys() java.util.Iterator i
keys.iterator() while (i.hasNext())
Object key i.next() if
(annot.containsProperty(key))
System.out.println(key " "
annot.getProperty(key))
70
AnnotationExtract Output
AnnotationExtract AF000580.gbff
TYPE DNA ACCESSION AF000580 genbank_accessions
AF000580 KEYWORDS . ORGANISM Dictyostelium
discoideum Eukaryota Mycetozoa Dictyosteliida
Dictyostelium. LOCUS AF000580 JOURNAL Genetics
148 (3), 1117-1125 (1998), Submitted
(21-APR-1997) Biology, Utah State University,
Logan, UT 84322-5305, USA DIVISION
INV REFERENCE 1 (bases 1 to 14955), 2 (bases
1 to 14955) MDAT 21-APR-1998 DEFINITION
Dictyostelium discoideum plasmid Ddp5, complete
genome. VERSION AF000580.1 TITLE Dictyostelium
discoideum nuclear plasmid Ddp5 is a chimera
related to the Ddp1 and Ddp2 plasmid families,
Direct Submission CIRCULAR circular SIZE
14955 ORIGIN SOURCE Dictyostelium
discoideum MEDLINE 98198836 GI 3068582 AUTHORS
Rieben,W.K. Jr., Gonzales,C.M., Gonzales,S.T.,
Pilkington,K.J., Kiyosawa,H., Hughes,J.E. and
Welker,D.L., Rieben,W.K., Gonzales,C.,
Gonzales,S.T., Pilkington,K., Kiyosawa,H.,
Hughes,J.E. and Welker,D.L. PUBMED 9539429
71
Sequence Tools
  • DNATools
  • RNATools
  • ProteinTools

72
DNATools
  • a() returns static AtomicSymbol for a
  • c() returns static AtomicSymbol for c
               
  • complement(Symbol sym) Complement the symbol.
  • complement(SymbolList list) Retrieve a
    complement view of list.
  • complementTable() Get a translation table for
    complementing DNA symbols.
  • createDNA(java.lang.String dna) Return a new DNA
    SymbolList for dna.
  • dnaToken(Symbol sym) Get a single-character
    token for a DNA symbol
  • forIndex(int index) Return the symbol for an
    index - compatible with index.
  • forSymbol(char token) Retrieve the symbol for a
    char.
  • g() returns static AtomicSymbol for g         
  • getDNA() Return the DNA alphabet.
  • index(Symbol sym) Return an integer index for a
    symbol - compatible with forIndex.
  • n() returns static AtomicSymbol for n
  • reverseComplement(SymbolList list) Retrieve a
    reverse-complement view of list.
  • t() returns static AtomicSymbol for t

73
RNATools
  • complementTable()
  • Get a translation table for complementing DNA
    symbols
  • createRNA(java.lang.String rna)
  • Return a new RNA SymbolList for rna.
  • getGeneticCode(java.lang.String name)
              Retrieve a TranslationTable by name.
  • getGeneticCodeNames()           Retrieve a Set
    containing the name of each genetic code.
  • getRNA()           Return the RNA alphabet.
  • reverseComplement(SymbolList list)
              Retrieve a reverse-complement view of
    list.
  • transcribe(SymbolList list)           Transcribe
    DNA into RNA.
  • transcriptionTable()           Get a translation
    table for converting DNA to RNA.
  • translate(SymbolList syms)           Translate
    RNA into protein (with termination symbols).

74
ProteinTools
  • createProtein(java.lang.String theProtein)
              Return a new Protein SymbolList for
    protein.
  • getAlphabet()           Gets the protein
    alphabet
  • getSymbolPropertyTable(java.lang.String name)
               
  • getTAlphabet()           Gets the protein
    alphabet including the translation termination
    symbols

75
sequence tools Translate
public class Translate public static void
main(String args) throws Exception if
(args.length ! 1) throw new
Exception("usage java Translate Genbankfile")
String filename args0 // open the
file FileInputStream fis new
FileInputStream(filename) SequenceDB seqdb
SeqIOTools.readFasta(fis, DNATools.getDNA()) Se
quenceIterator si seqdb.sequenceIterator() wh
ile (si.hasNext()) // get the dna
sequence Sequence dna si.nextSequence()
// reverse complement the dna
sequence SymbolList dnaList
DNATools.reverseComplement(dna) // transcribe
the dna SymbolList rna RNATools.transcribe(dn
aList) System.out.println(dna.getName()
"") for (int i 1 i lt 3 i) //
translate the rna in 3 reading frames int
length (rna.length())-((rna.length()-i1)3)
SymbolList protein RNATools.translate(rna.sub
List(i, length)) System.out.println("Frame "
i " " protein.seqString())
76
Translate Input
  • FASTA File of ESTs from Fugu rubripes

gtFEEFRa010apsH5 AGACAACACACGGGGAACCAGGAACGTCAGGCC
CCTTACACTGGGATCCAGAGCCACAGC GCGACAGAAAGTTAAAGGATAA
ACTCTAACGGAACACAATGGCAGACAAAATTAAAGATG CCAAGATCATC
TTTGTCGTGGGTGGGCCTGGTTCTGGAAAGGGCACTCAGTGTGAGAAAA
TAGTTGCAAAGTATGGCTACACCCATCTGTCATCCGGGGATCTGCTCCGT
GCTGAAGTGG CCTCTGGCTCCGAGAGGGGCAAGCAGCTCCAGGCCATCA
TGCAGAAGGGAGAGCTTGTTC CCCTGGACACCGTCTTAGACATGATTAA
GGATGCCATGATCGCCAAGGCCGACGTGTCCA AGGGCTTCCTCATTGAC
GGCTACCCCCGTGAGGTGGAGCAGGGCGAGGAGTTTGAGAAGA AGATCG
GCAGACCCTGCCTGGTGCTGGACGTTGACGCCAGGGGGGAGACCATGGGC
AAGA gtFEEFRa010apsC2 CGAGGTGCGGACTGTGCTGTCGCTCGTC
AGGGCCCAGGACGCCTACGCTCGCCTGCCCGA GAGCTACAGGAACGGCG
TGAATCTGACTCTGGAGCAGCTGAACTCTCACACTGGGGTCCA TCACCA
TTTCCGCTTCTTGAAAAGTCTGGAAAAGTCAGAAATTGAGTCTGGTTTTG
GTGT GAGTTACCTCTACCACCATTTTTACCTGAAGCCAACGTGGTGCGC
CAAAGGAACAGAAGA GTCGAACCCTGAGGCGTGTGCCTTCAGGAACGAC
AGGCCGCTGATGGACTGCGCGATCTG CTACAAAACAGCGAACAATGTGA
TGGAGGCCAACCCAAAGCCATATGTGCACTGCATCCA AAAGCCAAGGAC
TCACACCGGGCATGAGGAGCAGTAGGACAGAACATTGCAGAAAAATGA G
CTACAACAGCGGAGCTCCAACACTTTTGGCTGTGAAAGTCGGCTGAGGCA
ACACGTGGA GCAACT
77
Translate Output
78
Loading Sequences
  • Three main ways to do this
  • Local Files
  • Local Databases
  • Remote Databases

79
Loading Sequences Remote URL
  • How do we retrieve a FASTA file remotely from
    Genbank?
  • Code from Ratiba Terbaoui

//String idSeq SOMEACCESSIONNUMBER URL url
new URL("http//www.ncbi.nlm.nih.gov/htbin-post/En
trez/query?dbnform6doptfhtmlnotitlenouid
" idSeq) BufferedReader in new
BufferedReader(new InputStreamReader(url.openStre
am())) SequenceIterator stream
SeqIOTools.readFastaDNA(in) String
InputLine while ((InputLine in.readLine()) !
null) Sequence seq1 getSequence(stream) Sys
tem.out.println(seq1.seqString())
80
Adding Annotation to a Sequence
  • Our main goal in bioinformatics
  • GFF format
  • Widely used for sequence analysis and annotation

NAME lttabgt SOURCE lttabgt FEATURE lttabgt START lttabgt
END lttabgt SCORE lttabgt STRAND lttabgt PHASE lttabgt
ANYTHING COMMENT
81
GFF Parsing
  • BioJava has a gff package containing libraries
    for parsing and manipulating GFF files
  • org.biojava.bio.program.gff.GFFParser

82
Example Fasta2Features
  • Takes in a FASTA file, a GFF file and a source
    string
  • Filters the gff file by source
  • Annotates the FASTA file
  • Prints out the features and subsequences
    associated with those features

83
Fasta2Features
public class Fasta2Features public static
void main(String args) throws Exception
if(args.length ! 3) throw new
Exception("Use Fasta2Features fastafile GFFfile
source") String fastafile args0
String gfffile args1 String source
args2 //load the sequence
FileInputStream fis new FileInputStream(fastafil
e) SequenceDB seqdb SeqIOTools.readFasta(fi
s, DNATools.getDNA()) // load in the GFF
System.out.println("Loading gff with inputted
source") GFFEntrySet gffEntries new
GFFEntrySet() GFFRecordFilter.SourceFilter
sFilter new GFFRecordFilter.SourceFilter()
sFilter.setSource(source) GFFFilterer
filterer new GFFFilterer(gffEntries.getAddHandle
r(), sFilter) GFFParser parser new
GFFParser() parser.parse(new
BufferedReader( new InputStreamReader( new
FileInputStream(new File(gfffile)))), filterer)
// add the features to the sequences
System.out.println("Adding features from gff to
sequences") SequenceDB aSeqDB new
AnnotatedSequenceDB(seqdb, gffEntries.getAnnotator
()) // use member method to do the
printing PrintFeatures(aSeqDB)
84
Fasta2Features methods
public static void PrintFeatures(SequenceDB
aSeqDB) // print the features we just added
to our sequence SequenceIterator si
aSeqDB.sequenceIterator() while (si.hasNext())
Sequence seq si.nextSequence() Iterator
feati seq.features() while
(feati.hasNext()) Feature feat (Feature)
feati.next() System.out.println(feat.getType(
) " " feat.getLocation().toString())
System.out.println(seq.subStr(feat.getLocat
ion().getMin(), feat.getLocation().getMax())
)
85
Fasta2Features example input
Example Input
FASTA file containing 1 sequence
gtID1 aaaaaactgggaacaaagcaatcaaaagaaatgtatgtggccggg
cgtcatggctcacgc ctgtaatcccagcactttgggaggctgaggcagg
tggatcacaggtcaggagatcgagac cgtcctggctaacatggtgaaac
cccatctctactaaaaatacaaaagtttagcccggcg tggtggcgagcg
cctgtagtcccagctactcaggaggctgaggcagaagaatggcgtgaa c
ccgggaggcagagcttgcagtgagctgagatcgtgccactgccctccagc
ctgggcgac agagcgagactccgtcttaaaaaaaaaaaaaaaaaaaaaa
gaaatgtatgtatgtatgta tctgtgatgtcatttgcctcttaagggaa
aggtaataataacagcttcctaccaaaacac atttcatgatctcaagaa
aattttctcaaacctggtggtgaagagttataataccaatgc Ttaaaaa
tagtgttagtaaaaaaacaatccagaaattctgtacattgaatagaggaa
aat
GFF file containing 1 gene and 1 exon annotation
gff-version 1 date 2000-02-18 I'm just
making this up ID1 cbw gene 10 500 0 . cbw_gene
ID1 cbw exon 15 400 0 . cbw_exon
86
Fasta2Features output
gt java Fasta2Features ID1.fa.txt gfftest.gff.txt
cbw
87
BLAST parsing
  • Similar parser for BLAST, WuBLAST, FASTA, HMMer
  • See Biojava in Anger for detailed example
  • http//www.biojava.org/docs/bj_in_anger/BlastParse
    r.htm

88
MEME parsing
  • Sequence motif finder
  • Use Meme class

89
Other BioJava facilities
  • BioSQL interface
  • HMM utilities
  • Dynamic Programming
  • Model trainers
  • Profile makers
  • Weight Matrices
  • DAS distributed annotation server
  • Allows you to serve up your annotations on
    Ensembl

90
BioJava concluding thoughts
  • Still in rapid development
  • Will always be
  • Missing some critical data types
  • gene expression, molecular interactions
  • Subscribe to mailing lists
  • Always more informative than static documentation
  • Contribute code
  • Forces you to write good code
  • Everyone wins
  • OpenSource philosophy is very powerful

91
Summary
  • Existing packages can be extremely useful and can
    dramatically increase development time
  • Look before you code
  • Biojava provides a useful codebase for working
    with some bioinformatics data
  • Very well-developed tools for sequence modeling
    and manipulation
  • Go Open Source

92
End
  • Resources
  • http//www.biojava.org
  • Biojava wwwsite
  • http//biojava.org/docs/api/index.html
  • Javadoc API docs
  • http//biojava.org/docs/bj_in_anger/index.htm
  • Biojava in Anger
  • sohrab_at_bioinformatics.ubc.ca
  • me

93
Sequence I/O
  • Motivation
  • Want to input sequence data from different
    databases in different formats
  • Want to be able to write sequence records in
    different formats
  • Want to convert between formats
  • Want to easily load data into memory
  • Single records
  • Multiple records

94
Sequence I/O
  • Code

95
Sequence manipulation
  • Motivation
  • Often, we want to compute on sub-sequences of
    an inputted sequence
  • Could correspond to a feature in a sequence
  • Could need to do a sliding window analysis
  • Perform biological operations
  • Transcription on DNA
  • Translation on RNA
  • Tryptic digest

96
Sequence manipulation
  • Code

97
Sequence distribution
  • Motivation
  • We often need to know the background noise
    level relative to a signal
  • Gives statistical robustness to a detected
    pattern
  • Can generate random sequences from the
    distribution and compare observations in real and
    random data

98
Sequence distribution
  • Code

99
Interpreting Similarity search data
  • Motivation
  • Volume of computational results to great to sift
    through manually
  • We need to be able to extract parts of reports
    and effectively summarize the results
  • Top BLAST hits

100
Similarity searching results parsing
  • Code

101
Translating an mRNA
  • NM_080430

Summary This gene encodes a selenoprotein
containing selenocysteine at the active site. The
selenocysteine is encoded by the ususal nonsense
codon UGA. This gene is expressed in a variety of
tissues, and the protein is localized to the
perinuclear structures.
102
NM_080430
DEFINITION Homo sapiens selenoprotein M (SELM),
mRNA.
CDS 61..498 /gene"SELM" /note"selenoprotein
SelM" /selenocysteine /codon_start1
/transl_except(pos202..204,aaSec)
/product"selenoprotein M precursor"
/protein_id"NP_536355.1"
103
Other Packages
  • PAL
  • MADAM TIGR
  • CABIO
  • MAGE SDK
  • Ensj/MartJ
  • LSID IBM Life Science
  • Many, many more.

104
TODO
  • FeatureExtract output for Dicty seq
Write a Comment
User Comments (0)
About PowerShow.com