Tools and Softwares - PowerPoint PPT Presentation

About This Presentation
Title:

Tools and Softwares

Description:

MACAW uses scores for aligning amino acids called PAM-250. ... MACAW can use some different, more biologically realistic set of scores. ... – PowerPoint PPT presentation

Number of Views:32
Avg rating:3.0/5.0
Slides: 31
Provided by: usha7
Learn more at: https://www.cs.uic.edu
Category:
Tags: macaw | softwares | tools

less

Transcript and Presenter's Notes

Title: Tools and Softwares


1
Tools and Softwares
  • Group 13
  • Shashi Ranjan
  • Murali Sivaramakrishnan
  • Ognjen Perisic

2
Introduction
  • Importance of aligning two sequences
  • Identify the structure and function
  • Evolutionary similarities
  • Why use computers to do this alignment?
  • Sequences are very, very, very long
  • Need to evaluate many alignments

3
Tools and Softwares
  • BLAST
  • PIP MAKER
  • FASTA
  • Motif Explorer
  • MACAW
  • Five Methods for finding
    Conserved Segments.

4
BLAST Overview Basic Local Alignment Search Tool
  • Dynamic programming O(mn) !
  • BLAST heuristically finds high scoring segment
    pairs (HSP)
  • Identical length segments from 2 sequences with
    high scores
  • I.e. ungapped local alignments

5
BLAST Nuts Bolts
  • Given query sequence q, word length w, word
    score threshold T, segment score threshold S
  • Generate a word set (neighborhood words) that
    score at least T when compared to words form q
  • Search the dB for matches to words in list, a
    match indicates a possible alignment
  • Extend all matches to seek high-scoring segment
    pairs and return those pairs

6
(No Transcript)
7
BLAST Extensions
  • Gapped BLAST
  • 3 Changes to the original BLAST
  • Word extension criteria modified
  • Allow gapped alignments, need to find just one
    ungapped alignment
  • Use dynamic programming of Smith-Waterman to
    produce final alignment

8
BLAST Extensions
  • Position Specific Iterated BLAST
  • Basic Idea
  • Use results from BLAST query to build a profile
  • Search dB with profile instead of initial query
  • Iterate

9
BLAST Programs
  • NCBI BLAST ---- http//ncbi.nlm.nih.gov/BLAST/

10
PipMaker
  • WWW site for comparing DNA sequences
  • Identifies conserved sequences
  • Provides graphical output in PDF or PS format
  • Uses a variant of Gapped BLAST - BLASTz
  • http//bio.cse.psu.edu

11
Tools And Software
  • FASTA
  • Motif Explorer
  • Accessibility

12
FASTA Overview
  1. Given two sequences x,y and ksize of exactly
    matching strings.
  2. Tabulate the offset between x,y for all matching
    strings of size k (k-tuples)
  3. For offset with high frequency, try to combine
    their k-tuples into regions
  4. Extend the best regions without gaps as long as
    score improves. (using the substitution matrix
    such as PAM250) (init1)
  5. Check if un-gapped regions can be combined with a
    small gapped region. (initn used to rank the
    library sequences.)
  6. Construct an optimal alignment of the highest
    ranking seq. (opt)

13
FASTA- offset tabulation
  • X HARFYAAQIVL
  • Y VDMAAQIA
  • K-tuple offsets of x (k1)
  • A(2,6,7), I(9), , Q(8), V(10),
  • Offset difference for each k-tuple of Y

V D M A A Q I A
9 -2 -3 2 2 -6
2 1 -2
3 2 -1
Offsets -6 -5 -4 -3 -2 -1 0 1 2 3 4 8 9
Frequency 1 1 2 1 1 4 1 1
14
FASTA
15
FASTA Package
FASTA scan a protein or DNA sequence library for similar sequences.
TFASTA compare a protein sequence to a DNA sequence library.
FASTX, FASTY compare a DNA sequence to a protein sequence database.
FASTF, TFASTF compare an ordered peptide mixture against a protein or translated DNA database.
FASTS, TFASTS compare a set of short peptide fragments against a protein or translated DNA database.
LFASTA, PLFASTA concentrates more on the local regions and reports more than on one sequence.
RDF2 for evaluating the statistical significance of a similarity score.
16
Motif Explorer
17
Motif Explorer
18
Motif Explorer - Architecture
19
Accessibility
  • FASTA
  • www.ebi.ac.uk/fasta33/
  • http//bioweb.pasteur.fr/seqanal/interfaces/fasta.
    html
  • http//fasta.bioch.virginia.edu/
  • Motif Explorer
  • www.cbc.med.umn.edu/gst/MotifExplorer.html
  • www.arabidopsis.org/links/motif_search.html

20
MACAWA Workbench for Multiple Alignment
Construction and Analysis
  • Allows user to construct multiple protein
    alignments by locating, analyzing, editing and
    combining blocks of aligned sequence segments.
  • MACAW incorporate several features
  • Regions of local similarity are located by a
    search algorithm that avoids many of the
    limitations of individual algorithms
  • The statistical significance of blocks of
    similarity is evaluated using a mathematical
    theory developed during paste 15 years
  • User can edit each block by moving its boundaries
    or by eliminating particular segments, and blocks
    may be linked to form a composite multiple
    alignment.

21
Theoretical boundaries
  • Problem For a set of n protein sequences each of
    length l, a region of similarity common to all
    may begin anywhere in each sequence and it should
    be compared to all other sequences, that is ln
    alignments should be checked. This search space
    is too large.
  • MACAW imposes a single condition on the alignment
    it seeks that all segments show a minimal
    amount of mutual similarity. This can examine all
    of the search space in O(n2l2) time.

22
Terminology
  • For a set of n sequences, one subset of segments
    of some specific length from each of m (mltn)
    sequences forms a m-block, or simply a block.
  • Any set of m sequences locked into a specific
    alignment, with no gaps allowed, is a m-diagonal
    or simply a diagonal.
  • An aligned set of n amino acids is n-column or
    a column.

23
Algorithm
  • Set of scores is used for comparison of two
    proteins, with or without gaps. MACAW uses scores
    for aligning amino acids called PAM-250.
  • The score of the block is the sum of the scores
    assigned to each of its columns.
  • Score of the column is the sum of all pairwise
    similarity scores of the amino acids it
    comprises. Those SP scores are called Sum of
    the Pairs. MACAW can use some different, more
    biologically realistic set of scores.
  • Search routine seeks only blocks in which all
    pairs of segments are contained in pairwise
    subalignments with score greater than or equal to
    some threshold T. T should be chosen so that
    about 10 of the diagonals are marked. For n
    sequences of total (aggregate) length L, each
    sequence must be compared with others, and this
    takes O(L2) time. This is the most time-consuming
    step.
  • For any multiple diagonal, all implied
    2-diagonals must have been marked during first
    phase.

24
Algorithm
  • The basic problem is that homologies that exist
    in a n-diagonal may be represented by n-blocks,
    or by 4, 3 or 2 blocks or some other combinations
    of disconnected blocks. To solve this problem
    MACAW uses heuristic approach in two steps
  • Program searches for the highest 2-block in every
    2-diagonal using threshold T and marks the amino
    acid pairs it contains.
  • Columns are represented as graphs. Every vertex
    is amino acid and edge between them exist if they
    are marked in step 1. MACAW searches for the most
    connected (with as many as possible edges) graphs
    and connects them into one block.

25
Algorithm
  • This procedure doesnt mark all the relations
    between sequences, so MACAW allows user to edit
    blocks.
  • Every amino acid is colored according to the
    number of edges it is connected.

26
Example
27
Five experimental methods for finding conserved
sequences in multiple alignments of gene
regulatory regions
  • A conserved character in DNA is one that was
    probably present in the common ancestral species
    and has been preserved in the contemporary
    species being examined.
  • Two of the methods are already in common use
    they are based on good column agreement and high
    information content.
  • Three additional methods find blocks with minimal
    evolutionary, blocks that differ in at most k
    positions pre row from a center sequence that is
    unknown a priori. The center sequence in the
    latter two methods is a way to model potential
    binding sites for known or unknown proteins in
    DNA sequences or it is a common ancestor of the
    species represented in the alignment.
  • Parameters common to all of the
    tools/programs/algorithm
  • The minimum length of the regions to be reported
  • The minimum number of sequences, which must be
    active, are selectable by the user

28
Five tools
  • agree This utility locates regions in a given
    alignment that have good column agreement.
  •  
  • The length of the region is often a reliable
    indicator that some functionality was preserved
    across the species, but conservation doesnt need
    to be perfect and such regions might be
    fragmented into conserved pieces too small to be
    detected, so a systematic way to link the smaller
    regions is needed.
  • The two utilities infocon and phylogen are trying
    to solve this problem. The idea is to assign a
    numerical score to each column and then look for
    runs of columns meeting the following two
    conditions
  • their cumulative score (obtained by adding
    together the individual column scores) is no
    smaller than the score of any of their sub-runs
  • they are maximal with this property, i.e. they
    are not contained in any longer run having the
    property 1.
  •  
  • The infocon tool finds full runs of columns with
    high information content in the given alignment.
    To do this, each column is assigned an
    intermediate score that measures its information
    content, based on the frequencies of the letter
    both within the column and within the alignment
    as whole.
  •  

29
Five tools
  • phylogen This program scores the columns by
    their evolutionary relationships among the
    sequences of the given alignment implied by a
    supplied phylogenetic tree. The phylogenetic tree
    has a leaf node for each species and each
    internal node represents a putative common
    ancestor for the species in its sub-tree.

30
Five tools
  • kkno This program scans the alignment to
    determine, starting at each position, the longest
    region in which no row differs from a specified,
    known center sequence in more than k positions.
  • center A C C G T G C A G
  • 1 2 3 4 5 6 7 8 9
  • human A G C G T G C A C
  • rabbit A C C G T A C A T
  • mouse T C C G T A C A C
  • kunk This program is similar to kkno except
    that the center sequence is not known a priori
    instead, the program computes the best center
    sequence for each conserved region it finds. For
    each column in the alignment, the algorithm
    recursively examines all possible center
    sequences starting at that position to see how
    far the region can be extended and back-tracks
    when the extension becomes impossible.
Write a Comment
User Comments (0)
About PowerShow.com