Introduction to EMBOSS - PowerPoint PPT Presentation

1 / 119
About This Presentation
Title:

Introduction to EMBOSS

Description:

User required to apply for a BIOINFO account to use the tools on ... No good solution yet but advantageously replaceable by indexsearch. Stringsearch (mode A) ... – PowerPoint PPT presentation

Number of Views:394
Avg rating:3.0/5.0
Slides: 120
Provided by: kcch8
Category:

less

Transcript and Presenter's Notes

Title: Introduction to EMBOSS


1
Introduction to EMBOSS
  • Christine Ho
  • chrisho_at_cc.hku.hk

2
Web page of EMBOSS
  • The programs of EMBOSS is available at
    http//bioinfo.hku.hk/EMBOSS/
  • The files required for this lecture is available
    at
  • http//bioinfo.hku.hk/tutorial/
  • User required to apply for a BIOINFO account to
    use the tools on the web and off-line, and to
    download the databases.
  • BIOINFO account is open freely to the public to
    register, and usage on the BIOINFO is restricted
    for academic and research purposes only.
  • How to apply BIOINFO account
  • HKU members Submit the HKUESD application
    Form(Cfe-139)
  • Non-HKU members submit the application form of
    http//www.hku.hk/ccoffice/forms/cf139.pdf
  • Question and comment biosupport_at_bioinfo.hku.hk

3
What is EMBOSS?
  • EMBOSS (The European Molecular Biology Open
    Software Suite) is a free Open Source software
    analysis package that provides a comprehensive
    set of sequence analysis package specially
    developed for the needs of the molecular biology
    user community.
  • Within EMBOSS you will find around 100 programs
    (applications).
  • More information about EMBOSS can be found at
    http//www.uk.embnet.org/Software/EMBOSS/

4
Main Programs in EMBOSS
  • Retrieve sequences from database
  • Sequence alignment
  • Nucleic gene finding and translation
  • Protein secondary structure prediction
  • Rapid database searching with sequence patterns
  • Protein motif identification, including domain
    analysis
  • Nucleotide sequence pattern analysis, for example
    to identify CpG islands or repeats.
  • Codon usage analysis for small genomes
  • Rapid identification of sequence patterns in
    large scale sequence sets
  • Presentation tools for publication

5
Starting EMBOSS
  • There are three ways to start EMBOSS
  • Command line after login bioinfo.hku.hk
  • Web interface (EMBOSS-GUI)

6
Command line of EMBOSS
  • Inside HKU campus
  • telnet bioinfo.hku.hk
  • Outside HKU campus
  • Windows machine
  • Use putty, see http//bioinfo.hku.hk FAQ Q13
  • Linux or UNIX machine
  • ssh ltusernamegt_at_bioinfo.hku.hk

7
Web interface of EMBOSS
  • Directly access the web page at
  • http//bioinfo.hku.hk/EMBOSS/
  • Or browse the BIOSUPPORT Homepage
    http//bioinfo.hku.hk/ and select Tools Option

8
Web interface of EMBOSS
  • Click on the link EMBOSS - GUI

9
Programs in EMBOSS
  • Parameters in EMBOSS
  • Input can be
  • Uniform Sequence Addresses (USAs) path in the
    format
  • database
  • databaseentry_name or databaseaccession_number
  • (e.g. emblxlrhodop or emblL07770)
  • databasewildcard (swopsd_a)
  • filename
  • filenameentry
  • formatfilename
  • _at_list
  • The sequence data to be pasted in the text area.

10
Programs in EMBOSS
  • Output will be
  • Textual and/or graphical representation of data.
  • The output can be saved as text file or in some
    cases image file in PNG or PS format.

11
EMBOSS online help
  • The documentation for EMBOSS is available at
    http//bioinfo.hku.hk/emboss/

12
Difference between GCG and EMBOSS
13
Replacement of GCG programs
  • Exchanging sequences between packages

14
Replacement of GCG programs
  • Sequence editing, manipulation and display

15
Replacement of GCG programs
  • Translation
  • Sequence comparison and alignment

16
Replacement of GCG programs
  • Patterns and gene finding

17
Replacement of GCG programs
  • Phylogeny
  • Mapping

18
Replacement of GCG programs
  • Protein analysis
  • Primer selection

19
Replacement of GCG programs
  • Keyword-based databank searching

20
Running EMBOSS program
  • EMBOSS programs are run by typing them at the
    Unix prompt, or by using an interface.
  • The EMBOSS command syntax follows normal Unix
    command conventions.
  • Programname -help
  • to get some help on the options.
  • Programname -opt
  • to make the program prompt you for common
    options.
  • tfm programname
  • to get the full help on a program.

21
Login bioinfo
  • Login bioinfo with telnet bioinfo.hku.hk
  • If you are using the temp account, please create
    a directory of your username at hkusua
  • bioinfo mkdir ltusernamegt
  • E.g. bioinfo mkdir chantaiman
  • Change directory to your created directory
  • Bioinfo cd ltusernamegt
  • E.g. bioinfo cd chantaiman

22
wossname
  • It is easy to forget the name of a program.
  • To find EMBOSS programs, use wossname
  • wossname finds programs by looking for keywords
    in the description or the name of the program.

23
wossname
  • Type wossname at the Unix prompt
  • bioinfo wossname
  • Displays one-line description.
  • Prompts you for information
  • Finds programs by keywords in their one-line
    documentation
  • Keyword to search for restrict
  • SEARCH FOR 'RESTRICT
  • recode Remove restriction sites but
    maintain the same translation
  • remap Display a sequence with
    restriction cut sites, translation
  • etc..

24
Optional parameters
  • To get prompted for all the optional parameters,
    type the following
  • bioinfo wossname -opt
  • Finds programs by keywords in their one-line
    documentation
  • Keyword to search for protein
  • Output program details to a file stdout myfile
  • Format the output for HTML N
  • String to form the first half of an HTML link
  • String to form the second half of an HTML link
  • Output only the group names N
  • Output an alphabetic list of programs N
  • Use the expanded group name N

25
help
  • bioinfo wossname -help
  • Mandatory qualifiers
  • -search string Enter a word or words
    here.
  • Optional qualifiers ( if not always prompted)
  • -outfile outfile this program will write the
    program names
  • Advanced qualifiers
  • -noemboss bool EMBOSS program
  • documentation will be
    searched.
  • Mandatory - required, are often parameters (in
    )
  • Optional - use -opt to be prompted for these.
  • Advanced - things that are not often used!

26
Writing to the screen
  • Note that the default output file for wossname
    was
  • stdout (Standard output)
  • Use this whenever prompted for an output file.
  • This is a magic file name.
  • It displays the output on the screen, not a file.

27
Working with sequences
  • EMBOSS reads sequences from files or databases.
  • It automatically recognizes the input sequence
    format.
  • You can easily specify many output formats.

28
Getting sequences from the databases
  • Database single entry (ID)
  • databaseentry
  • For example emblhsfau
  • Wildcarded entries (Query)
  • databasehs
  • For example swfos_
  • All entries
  • database
  • Most databases will support all 3 methods - some
    may not.

29
showdb
  • bioinfo showdb
  • Displays information on the currently available
    databases
  • Name Type ID Qry All Comment
  • domo P OK OK OK DOMO sequences
  • enspep P OK OK OK ENSEMBL PEP
    sequences
  • gp P OK OK OK GENPEPT sequences
  • gpnew P OK OK OK New GENPEPT
    sequences
  • kabatp P OK OK OK KABAT Protein
    sequences
  • nrl P OK OK OK NRL_3d
  • pdb P OK OK OK PDB sequences
  • pir P OK OK OK PIR using NBRF
    access for 4 files
  • rem P OK OK OK REMTREMBL sequences

30
seqret
  • Reads in a sequence, and writes it out.
  • bioinfo seqret
  • Reads and writes (returns) a sequence
  • Input sequence emblxlrhodop
  • Output sequence xlrhodop.fasta
  • bioinfo more xlrhodop.fasta
  • gtXLRHODOP L07770 Xenopus laevis rhodopsin
  • ggtagaacagcttcagttgggatcacaggcttctagggatcctttgggca
    aaaaagaaac
  • acagaaggcattctttctatacaagaaaggactttatagagctgctacca
    tgaacggaac
  • .
  • .

31
seqret from the command line
  • Give seqret all of its data on the command-line.
  • It doesnt need to prompt for anything else.
  • bioinfo seqret emblxlrhodop -outseq
    xlrhodop.fasta
  • The -outseq can be abbreviated to -out.
  • Any abbreviation must be unique.
  • Even shorter, leave out the qualifier
  • bioinfo seqret emblxlrhodop xlrhodop.fasta

32
Changing output formats (reformatting)
  • seqret can reformat sequences by specifying the
    output format
  • bioinfo seqret emblxlrhodop xlrhodop.gcg
    -osformat gcg
  • bioinfo more xlrhodop.gcg
  • !!NA_SEQUENCE 1.0
  • Xenopus laevis rhodopsin mRNA, complete cds.
  • XLRHODOP Length 1684 Type N Check 9453 ..
  • 1 ggtagaacag cttcagttgg gatcacaggc ttctagggat
    cctttgggca
  • 51 aaaaagaaac acagaaggca ttctttctat acaagaaagg
    actttataga
  • .
  • .

33
Multiple sequences, single files
  • You can use seqret to retrieve multiple sequences
    into a file
  • bioinfo seqret swopsd_a opsd_a.seqs
  • This retrieves all the sequences whose
    identifiers start with opsd_a into a file
    called opsd_a.seqs.

34
Multiple sequences, many files
  • If you wish to write one sequence per file, use
  • bioinfo seqret swopsd_a -ossingle
  • The output filenames will be based on the
    sequence entry names.
  • The program seqretsplit will split an existing
    multiple sequence file into many files.

35
Asterisk on the command line
  • You can't use a on the UNIX command-line.
  • UNIX tries to match it to filenames.
  • Use it quoted, either with quotes or a backslash
  • "embl"
  • embl\
  • For example
  • bioinfo seqret emblhsf hsf.seq

36
EMBOSS web interface
  • On the left, you can choose the program to run.
    You can also see all the program sorted
    alphabetically instead of sorted by group by
    clicking on the link.

37
Getting help in EMBOSS
  • Help on the program is available by clicking on
    the question mark.

38
Input to EMBOSS
  • If you know the entry_name or accession number,
    enter the sequence in the Uniform Sequence
    Addresses (USAs) format
  • E.g. emblxlrhodop

39
Input to EMBOSS
  • If you have your own sequence file, upload the
    sequence by clicking the browse button.

40
Input to EMBOSS
  • You can also copy and paste your own sequence
    into the text area.

41
seqret web interface
  • E.g. seqret - retrieving single sequence
  • Input
  • USA path emblxlrhodop
  • Output file format GCG 9.x/10.x
  • Output
  • The sequence retrieved in GCG format

42
seqret
43
seqret
44
seqret
  • Seqret retrieving multiple sequences
  • Input swops2_. Output file format Pearson
    FASTA
  • Output multiple sequences with the identifier
    starting with swops2_.
  • Save the file as ops2.fasta by right clicking on
    the link

45
coderet
  • Extract CDS, mRNA and translations from feature
    tables. If any sequences are in other entries of
    that database, they are automatically fetched and
    incorporated correctly into the final sequence.
  • Input emblX03487

46
coderet
  • Output

47
dottup
  • dottup Comparison between 2 sequences using
    dot-plots.
  • Input
  • 1st sequence emblxl23808 (Xenopus laevis
    rhodopsin gene)
  • Second sequence emblxlrhodop (Xenopus laevis
    rhodopsin cDNA from complement of mRNA)
  • Output
  • A dotplot showing the diagonal lines representing
    areas where the two sequences align well in PNG
    format.
  • The image can be saved into the computer.

48
dottup
49
dottup
  • The 5 diagonal lines represent areas where the
    two sequences align well.
  • Since this is aligning genomic and cDNA, the five
    diagonals represent the five exons of the gene.

50
Pairwise Sequence Alignment
  • An alignment is an arrangement of two sequences
    which shows where the two sequences are similar,
    and where they differ.
  • There is no unique, precise, or universally
    applicable notion of similarity.

51
Global Alignment
  • A global alignment is one that compares the two
    sequences over their entire lengths, and is
    appropriate for comparing sequences that are
    expected to share similarity over the whole
    length.
  • The alignment maximizes regions of similarity and
    minimizes gaps using the scoring matrices and gap
    parameters provided to the program.

52
needle
  • Function
  • Needleman-Wunsch global alignment
  • Description
  • This program uses the Needleman-Wunsch global
    alignment algorithm to find the optimum alignment
    (including gaps) of two sequences when
    considering their entire length.
  • The computation is rigorous.
  • It can be time consuming to run if the sequences
    are long.

53
Input sequence for needle
54
needle
  • needle - Needleman-Wunsch global alignment
  • Input1st sequence emblxlrhodop, 2nd sequence
    emblxl23808
  • Output Global alignment showing the 5 aligned
    regions.

55
Local alignment
  • Local alignment searches for regions of local
    similarity and need not include the entire length
    of the sequences.
  • Local alignment methods are very useful for
    scanning databases or other circumstances when
    you wish to find matches between small regions of
    sequences, for example, between protein domains.

56
water
  • Function
  • Smith-Waterman local alignment.
  • Description
  • Water uses the Smith-Waterman algorithm (modified
    for speed enhancements) to calculate the local
    alignment.

57
water
  • water - Smith-Waterman local alignment.
  • Input1st sequence emblxlrhodop, 2nd sequence
    emblxl23808
  • Output Local alignment showing the 5 aligned
    region.

58
Multiple Sequence Analysis
  • Multiple sequence alignments are used
  • To find patterns to characterize protein
    families.
  • To detect or demonstrate homology between new
    sequence and existing families of sequences.
  • To help predict the secondary and tertiary
    structures of the new sequences.
  • As an essential prelude to molecular
    evolutionary analysis.

59
emma
  • Function
  • Multiple alignment program - interface to
    ClustalW program
  • Description
  • EMMA calculates the multiple alignment of nucleic
    acid or protein sequences according to the method
    of Thompson, J.D., Higgins, D.G. and Gibson, T.J.
    (1994). This is an interface to the ClustalW
    distribution.

60
Upload file to emma
  • Input output from seqret (ops2.fasta) retrieving
    all swissprot sequences whose identifiers begin
    with swops2_
  • Click on browse button to upload the file
    ops2.fasta

61
Input sequence to emma
  • ops2.fasta

62
emma
  • emma interface to ClustalW program
  • Output multiple alignment saved as file
    ops2.aln.

63
prettyplot
  • Prettyplot displays aligned sequences, with
    colouring and boxing
  • Input output from program emma ops2.aln
  • Output graphic display of aligned sequences.
    Identical residues in red, similar residues in
    green.

64
prophecy
  • Function
  • Creates matrices/profiles from multiple
    alignments
  • Description
  • This creates a profile matrix file from a nucleic
    acid or a protein sequence alignment.
  • The profile matrix file can then be used by
    program profit or prophet.

65
prophecy
  • Input
  • Sequence output from program emma ops2.aln
  • Select type Gribskov

66
prophecy
  • Output A profile to be saved as ops2.prophecy.
    This profile allows a new sequence to be aligned
    optimally to a family of similar sequences in the
    program prophet.

67
prophet
  • Prophet Gapped alignment for profiles
  • Input
  • Input sequence The file xlrhodop.pep, output
    from transeq of the sequence emblxlrhodop from
    110-1171 region.
  • Profile or matrix file ops2.prophecy
  • Output file ops2.prophet
  • Output The gapped alignment to profile. The
    vertical bars () represent residues that are
    identical between the ops2 consensus and our
    rhodopsin, while the colons () represent
    conservative substitutions. Aligning members of a
    family can reveal conserved regions that may be
    important for structure and/or function.

68
prophet
  • Output

69
plotorf
  • plotorf plots potential opening reading frames
  • Input sequence emblxlrhodop
  • Output graphical output showing the potential
    opening reading frames in all six frames.
  • The longest protein is in second frame.
  • The correct open reading frame is the second
    frame.

70
getorf
  • getorf - Finds and extracts open reading frames
    (ORFs)
  • Input
  • Sequence emblxlrhodop
  • Type of sequence to output Nucleic sequence
    between START and STOP codons
  • Output Textual information of the region and the
    sequence of that region.

71
transeq
  • transeq - Translate nucleic acid sequences
  • Input
  • sequence emblxlrhodop
  • regions to translate 110-1171 (from information
    of getorf)
  • Output Translated sequence of the given region.
  • Save the file as xlrhodop.pep

72
Exercise 1 Q1
  • Align HER2 _ERB2_HUMAN and UNKNOWN_AAL39899.1
    with needle and water. What is the main
    difference between the two types of alignment in
    these two cases (the files HER2-fasta.prt and
    ALL39899_1.prt are at http//bioinfo.hku.hk/tutori
    al/)?
  • Repeat the Smith-Waterman alignment of
    HER2-fasta.prt and ALL39899_1.prt with different
    parameters. What happens if gap penalties are
    changed to 30 and 2 instead of the defaults 10
    and 0.5?
  • BLOSUM62 is default. What happens to the local
    alignment (using program water) when using other
    matrices, e.g. EPAM10?

73
Exercise 1 Q2
  • Type gbA7120FTSZ in the text box and run seqret.
    Run entret with the same sequence USA and examine
    the entry. What is the difference between the two
    entries?

74
Exercise 1 Q3
  • With the program infoseq, display information on
    all sequences whose name starts with 10 in the
    SwissProt database. (hint the sequence is
    sw10, choose the information you want to
    display by changing to yes)

75
Exercise 1 answer (A1)
  • Needle output

76
Exercise 1 answer (A1)
  • Water output

77
Exercise 1 answer (A1)
  • Water output with gap opening penality of 30 and
    gap extension penality of 2.

78
Exercise 1 answer (A1)
  • Water output with matrix of EPAM10

79
Exercise 1 answer (A1)
  • The global alignment (needle) require the whole
    sequences to be aligned. The identity and
    similarity is much less than local alignment
    (water).
  • If the gap penalties are changed to 30 and 2, no
    gap appears in the alignment
  • If EPAM10 is used, the score and alignment length
    drops. Since PAM is derived from global
    alignment, it gives worser result for the local
    alignment program water. EPAM10 is more suitable
    for very similar protein with no more than 10
    evolutionary divergent.

80
Exercise 1 answer (A1)
  • Amino Acid substitution matrices
  • PAM (percent accepted mutation) lists the
    likelihood of change from one amino acid to
    another in homologous sequences during evolution.
  • One PAM is a unit of evolutionary divergence in
    which 1 of the amino acids have been changed.
  • some amino acid substitutions occurred more
    readily than others, probably because they did
    not have a great effect on the structure and
    function of a protein.

81
Exercise 1 answer (A1)
  • Amino Acid substitution matrices (cont)
  • BLOSUM matrix values are based on a large set
    of 2000 conserved amino acid patterns called
    blocks. Blocks come from a database of protein
    sequences representing more than 500 families of
    related proteins.
  • PAM is derived from global alignments of
    proteins, while BLOSUM comes from alignments of
    shorter sequences.
  • The matrix built from blocks with no more than x
    of similarity is called BLOSUM X

82
Exercise 1 answer (A1)
  • PAM100 gt Blosum90
  • PAM120 gt Blosum80
  • PAM160 gt Blosum62
  • PAM200 gt Blosum52
  • PAM250 gt Blosum45
  • The Blosum matrices are best for detecting local
    alignments.
  • The Blosum62 matrix is the best for detecting the
    majority of weak protein similarities.
  • The Blosum45 matrix is the best for detecting
    long and weak alignments.

83
Exercise 1 answer (A1)
  • If the BLOSUM62 matrix is compared to PAM160 then
    it is found that the BLOSUM matrix is less
    tolerant of substitutions to or from hydrophilic
    amino acids, while more tolerant of hydrophobic
    changes and of cysteine and tryptophan mismatches.

84
Exercise 1 answer (A2)
  • seqret output

85
Exercise 1 answer (A2)
  • entreq output

86
Exercise 1 answer (A2)
  • You will see the sequence for the Anabaena 7120
    ftsZ and gsh-III genes.
  • EMBOSS is also capable of extracting more
    information than just the sequence from a
    database entry. The program entret will return
    the entire entry as a text file.

87
Exercise 1 answer (A3)
  • Output

88
garnier
  • Garnier - Predicts protein secondary structure
    using the Garnier-Osguthorpe-Robson (GOR)  method
  • Secondary structure prediction is notoriously
    difficult to do accurately. The GOR I alogorithm
    is one of the first semi-successful methods.
  • The Garnier method is not regarded as the most
    accurate prediction, but is simple to calculate
    on most workstations.
  • Input translated sequence (xlrhodop.pep)
    emblxlrhodop from 110-1171 region with program
    transeq.
  • Output Predicted protein secondary structure

89
garnier
  • Output

90
pepinfo
  • pepinfo - Plots simple amino acid properties in
    parallel.
  • Input sequence translated sequence
    (xlrhodop.pep) emblxlrhodop from 110-1171 region
    with program transeq.
  • Output A textual and graphical representation of
    amino acid properties (size, polarity,
    aromaticity, charge, etc). Hydrophobicity
    profiles useful for locating turns, potential
    antigenic peptides and transmembrane helices.

91
pepinfo
  • Showing the residues distribution

92
pepinfo
  • Hydrophobicity profiles are useful for locating
    turns, potential antigentic peptides and
    transmembrane helices.
  • positive score -gt a hydrophobic region.
  • negative score -gt hydrophilic region.
  • show seven highly hydrophobic regions.
  • use the program tmap to investigate further.

93
patmatmotifs
  • Patmatmotifs search a PROSITE motif database
    with a protein sequence. It can identify to which
    known family of protein (if any) the new sequence
    belongs.
  • PROSITE currently contains patterns and profiles
    specific for more than a thousand protein
    families or domains.
  • PROSITE patterns (Biologically significant amino
    acid patterns can be summarized in the form of
    regular expressions)
  • PROSITE profile (techniques based on weight
    matrices allows the detection extreme sequence
    divergence protein families and
    functional/structural domains)

94
patmatmotifs
  • Input sequence The file xlrhodop.pep, which is
    output from transeq of the sequence emblxlrhodop
    from 110-1171 region.
  • Output A textual representation showing where
    the sequence match with a motif.

95
pscan
  • Pscan Scans proteins using PRINTS
  • PRINTS is a database of diagnostic protein
    signatures, or fingerprints.
  • Fingerprints are groups of conserved motifs or
    elements that together form a diagnostic
    signature for particular protein families.
  • An uncharacterised sequence matching all motifs
    or elements can then be readily diagnosed as a
    true match to a particular family fingerprint.
  • Input sequence The file xlrhodop.pep, which is
    output from transeq of the sequence emblxlrhodop
    from 110-1171 region.

96
pscan
  • Output A textual representation showing where
    the short sequences match with the PRINTS
    database that defines functional protein families.

97
fuzznuc
  • fuzznuc uses PROSITE style patterns to search
    nucleotide sequences.
  • Letter code for pattern
  • ACG stands for A or C or G.
  • AG stands for any nucleotides except A and G.
  • N(3) corresponds to N-N-N, N(2,4) corresponds to
    N-N or N-N-N or N-N-N-N.
  • CG(5)TGAN(1,5)C
  • Input
  • sequence emblhhtetra
  • Pattern AAGCTT

98
fuzznuc
  • Output

99
Exercise 2 Q1
  • Use tmap to displays membrane spanning regions
    with the input sequence of xlrhodop.pep (
    translated with program transeq from
    emblxlrhodop at 110-1171 region). Does the
    result agree with pepinfo?

100
Exercise 2 Q2
  • Use fuzzpro to search sequence CREAp_m.txt
    pattern CXXXXC (the file CREAp_m.txt is from
    http//bioinfo.hku.hk/tutorial/)

101
Exercise 2 Q3
  • Use patmatmotifs to find pattern in swissprot
    sequences fos_human or fos_rat, and use these
    pattern to do fuzzpro. Search other fos genes of
    different organisms. (Hint Use swfos_human for
    the input Other organisms bovin, chick, mouse,
    sheep.)

102
Exercise 2 Q4
  • Sometimes it is better to run the program fuzznuc
    in command line because more parameters can be
    given
  • In the BIOINFO terminal, type the following (you
    must put the command in one line in the UNIX
    prompt)
  • bioinfo fuzznuc -sequenceemblhhtetra
  • -patternAAGCTT -mismatch1 -complement
  • -outfoutf.out
  • How is the result different from previous run in
    web interface?

103
Exercise 2 answer (A1)
  • Bars are displayed in the plot above the regions
    predicted as being most likely to form
    transmembrane regions
  • May be seven transmembrane helices in this
    protein.
  • Result agree with pepinfo.

104
Exercise 2 answer (A2)
  • The symbol x is used for a position where any
    amino acid is accepted.
  • There, the pattern CXXXXC matches the result
    patterns of CQFPGC and CMFPGC.

105
Exercise 2 answer (A2)
  • Patmatmotifs output using swFOS_HUMAN

106
Exercise 2 answer (A3)
  • When run with patmatmotifs, the sequences
    swFOS_HUMAN and swFOS_RAT returns the same
    motifs of AMIDATION, LEUCINE_ZIPPER, and
    BZIP_BASIC.
  • When run with fuzzpro with one of the pattern,
    the start and end position agrees with
    patmatmotifs.

107
Exercise 2 answer (A3)
  • Fuzzpro output with pattern GRAQSIGRRGKVEQ and
    sequence swfos_human

108
Exercise 2 answer (A4)
  • You can add no. of mismatches in input parameters
    for command line. The result with 1 mismatch can
    now be shown

109
cpgplot
  • CPGPLOT Plot the CpG rich areas
  • CpG refers to a C nucleotide immediately followed
    by a G. The 'p' in 'CpG' refers to the phosphate
    group linking the two bases.
  • By default, this program defines a CpG island as
    a region where
  • over an average of 10 windows, the calculated
    composition is over 50
  • and the calculated Obs/Exp (i.e.
    Observed/Expected) ratio is over 0.6
  • and the conditions hold for a minimum of 200
    bases.
  • These conditions can be modified by setting the
    values of the appropriate parameters.

110
cpgplot
  • The Observed number of CpG patterns in a window
    is simply the count of the number of times a 'C'
    is found followed immediately by a 'G'.
  • The Expected frequency of CpG's in a window is
    calculated as the number of 'C's in the window
    multiplied by the number of 'G's in the window,
    divided by the window length.
  • Expected (number of C's number of G's) /
    window length

111
cpgplot
  • Input emblrnu68037
  • Output

112
cpgplot
  • Output

113
cusp
  • CUSP reads one or more coding sequences (CDS
    sequence only) and calculates a codon frequency
    table.
  • It is important to use a codon frequency table
    that is appropriate for the species that your
    protein comes from.
  • Input
  • Seq emblpaamir
  • Codon usage table Default (Ehum.cut)

114
cusp
  • Output
  • Fract the faction of all amino acids coded for
    this codon triplet.
  • /1000 the number of codons per 1000 bases

115
cusp
  • Running the program in command line allows you to
    specify the sequence begin and sequence end
  • bioinfo cusp -sbeg 135 -send 1292
  • Create a codon usage table
  • Input sequence(s) emblpaamir
  • Output file paamir.cusp

116
cusp
  • bioinfo more paamir.cusp

117
hmoment
  • hmoment plots or writes out the hydrophobic
    moment. Hydrophic moment is the hydrophobicity of
    a peptide measured for a specified angle of
    rotation per residue.
  • Assumption The angle of rotation (bonds of the
    backbone and amino acid side-chains) per residue
    in alpha helices is 100 degrees. The angle of
    rotation per residue in beta sheets is 160
    degrees.
  • Input
  • Sequenceswhbb_human
  • Produce graph yes
  • Plot two graph yes

118
hmoment
  • Output
  • one for the alpha helix moment and one for the
    beta sheet moment.

119
  • End of lecture
  • Thank you!
Write a Comment
User Comments (0)
About PowerShow.com