Tools and Softwares - PowerPoint PPT Presentation

About This Presentation

Title:

Tools and Softwares

Description:

MACAW uses scores for aligning amino acids called PAM-250. ... MACAW can use some different, more biologically realistic set of scores. ... – PowerPoint PPT presentation

Number of Views:32

Avg rating:3.0/5.0

Slides: 31

Provided by: usha7

Learn more at: https://www.cs.uic.edu

Category:

more less

Transcript and Presenter's Notes

Title: Tools and Softwares

1
Tools and Softwares

Group 13
Shashi Ranjan
Murali Sivaramakrishnan
Ognjen Perisic

2
Introduction

Importance of aligning two sequences
Identify the structure and function
Evolutionary similarities
Why use computers to do this alignment?
Sequences are very, very, very long
Need to evaluate many alignments

3
Tools and Softwares

BLAST
PIP MAKER
FASTA
Motif Explorer
MACAW
Five Methods for finding
Conserved Segments.

4
BLAST Overview Basic Local Alignment Search Tool

Dynamic programming O(mn) !
BLAST heuristically finds high scoring segment
pairs (HSP)
Identical length segments from 2 sequences with
high scores
I.e. ungapped local alignments

5
BLAST Nuts Bolts

Given query sequence q, word length w, word
score threshold T, segment score threshold S
Generate a word set (neighborhood words) that
score at least T when compared to words form q
Search the dB for matches to words in list, a
match indicates a possible alignment
Extend all matches to seek high-scoring segment
pairs and return those pairs

6
(No Transcript)
7
BLAST Extensions

Gapped BLAST
3 Changes to the original BLAST
Word extension criteria modified
Allow gapped alignments, need to find just one
ungapped alignment
Use dynamic programming of Smith-Waterman to
produce final alignment

8
BLAST Extensions

Position Specific Iterated BLAST
Basic Idea
Use results from BLAST query to build a profile
Search dB with profile instead of initial query
Iterate

9
BLAST Programs

NCBI BLAST ---- http//ncbi.nlm.nih.gov/BLAST/

10
PipMaker

WWW site for comparing DNA sequences
Identifies conserved sequences
Provides graphical output in PDF or PS format
Uses a variant of Gapped BLAST - BLASTz
http//bio.cse.psu.edu

11
Tools And Software

FASTA
Motif Explorer
Accessibility

12
FASTA Overview

Given two sequences x,y and ksize of exactly
matching strings.
Tabulate the offset between x,y for all matching
strings of size k (k-tuples)
For offset with high frequency, try to combine
their k-tuples into regions
Extend the best regions without gaps as long as
score improves. (using the substitution matrix
such as PAM250) (init1)
Check if un-gapped regions can be combined with a
small gapped region. (initn used to rank the
library sequences.)
Construct an optimal alignment of the highest
ranking seq. (opt)

13
FASTA- offset tabulation

X HARFYAAQIVL
Y VDMAAQIA
K-tuple offsets of x (k1)
A(2,6,7), I(9), , Q(8), V(10),
Offset difference for each k-tuple of Y

V D M A A Q I A
9 -2 -3 2 2 -6
2 1 -2
3 2 -1
Offsets -6 -5 -4 -3 -2 -1 0 1 2 3 4 8 9
Frequency 1 1 2 1 1 4 1 1
14
FASTA
15
FASTA Package
FASTA scan a protein or DNA sequence library for similar sequences.
TFASTA compare a protein sequence to a DNA sequence library.
FASTX, FASTY compare a DNA sequence to a protein sequence database.
FASTF, TFASTF compare an ordered peptide mixture against a protein or translated DNA database.
FASTS, TFASTS compare a set of short peptide fragments against a protein or translated DNA database.
LFASTA, PLFASTA concentrates more on the local regions and reports more than on one sequence.
RDF2 for evaluating the statistical significance of a similarity score.
16
Motif Explorer
17
Motif Explorer
18
Motif Explorer - Architecture
19
Accessibility

FASTA
www.ebi.ac.uk/fasta33/
http//bioweb.pasteur.fr/seqanal/interfaces/fasta.
html
http//fasta.bioch.virginia.edu/
Motif Explorer
www.cbc.med.umn.edu/gst/MotifExplorer.html
www.arabidopsis.org/links/motif_search.html

20
MACAWA Workbench for Multiple Alignment
Construction and Analysis

Allows user to construct multiple protein
alignments by locating, analyzing, editing and
combining blocks of aligned sequence segments.
MACAW incorporate several features
Regions of local similarity are located by a
search algorithm that avoids many of the
limitations of individual algorithms
The statistical significance of blocks of
similarity is evaluated using a mathematical
theory developed during paste 15 years
User can edit each block by moving its boundaries
or by eliminating particular segments, and blocks
may be linked to form a composite multiple
alignment.

21
Theoretical boundaries

Problem For a set of n protein sequences each of
length l, a region of similarity common to all
may begin anywhere in each sequence and it should
be compared to all other sequences, that is ln
alignments should be checked. This search space
is too large.
MACAW imposes a single condition on the alignment
it seeks that all segments show a minimal
amount of mutual similarity. This can examine all
of the search space in O(n2l2) time.

22
Terminology

For a set of n sequences, one subset of segments
of some specific length from each of m (mltn)
sequences forms a m-block, or simply a block.
Any set of m sequences locked into a specific
alignment, with no gaps allowed, is a m-diagonal
or simply a diagonal.
An aligned set of n amino acids is n-column or
a column.

23
Algorithm

Set of scores is used for comparison of two
proteins, with or without gaps. MACAW uses scores
for aligning amino acids called PAM-250.
The score of the block is the sum of the scores
assigned to each of its columns.
Score of the column is the sum of all pairwise
similarity scores of the amino acids it
comprises. Those SP scores are called Sum of
the Pairs. MACAW can use some different, more
biologically realistic set of scores.
Search routine seeks only blocks in which all
pairs of segments are contained in pairwise
subalignments with score greater than or equal to
some threshold T. T should be chosen so that
about 10 of the diagonals are marked. For n
sequences of total (aggregate) length L, each
sequence must be compared with others, and this
takes O(L2) time. This is the most time-consuming
step.
For any multiple diagonal, all implied
2-diagonals must have been marked during first
phase.

24
Algorithm

The basic problem is that homologies that exist
in a n-diagonal may be represented by n-blocks,
or by 4, 3 or 2 blocks or some other combinations
of disconnected blocks. To solve this problem
MACAW uses heuristic approach in two steps
Program searches for the highest 2-block in every
2-diagonal using threshold T and marks the amino
acid pairs it contains.
Columns are represented as graphs. Every vertex
is amino acid and edge between them exist if they
are marked in step 1. MACAW searches for the most
connected (with as many as possible edges) graphs
and connects them into one block.

25
Algorithm

This procedure doesnt mark all the relations
between sequences, so MACAW allows user to edit
blocks.
Every amino acid is colored according to the
number of edges it is connected.

26
Example
27
Five experimental methods for finding conserved
sequences in multiple alignments of gene
regulatory regions

A conserved character in DNA is one that was
probably present in the common ancestral species
and has been preserved in the contemporary
species being examined.
Two of the methods are already in common use
they are based on good column agreement and high
information content.
Three additional methods find blocks with minimal
evolutionary, blocks that differ in at most k
positions pre row from a center sequence that is
unknown a priori. The center sequence in the
latter two methods is a way to model potential
binding sites for known or unknown proteins in
DNA sequences or it is a common ancestor of the
species represented in the alignment.
Parameters common to all of the
tools/programs/algorithm
The minimum length of the regions to be reported
The minimum number of sequences, which must be
active, are selectable by the user

28
Five tools

agree This utility locates regions in a given
alignment that have good column agreement.
The length of the region is often a reliable
indicator that some functionality was preserved
across the species, but conservation doesnt need
to be perfect and such regions might be
fragmented into conserved pieces too small to be
detected, so a systematic way to link the smaller
regions is needed.
The two utilities infocon and phylogen are trying
to solve this problem. The idea is to assign a
numerical score to each column and then look for
runs of columns meeting the following two
conditions
their cumulative score (obtained by adding
together the individual column scores) is no
smaller than the score of any of their sub-runs
they are maximal with this property, i.e. they
are not contained in any longer run having the
property 1.
The infocon tool finds full runs of columns with
high information content in the given alignment.
To do this, each column is assigned an
intermediate score that measures its information
content, based on the frequencies of the letter
both within the column and within the alignment
as whole.

29
Five tools

phylogen This program scores the columns by
their evolutionary relationships among the
sequences of the given alignment implied by a
supplied phylogenetic tree. The phylogenetic tree
has a leaf node for each species and each
internal node represents a putative common
ancestor for the species in its sub-tree.

30
Five tools

kkno This program scans the alignment to
determine, starting at each position, the longest
region in which no row differs from a specified,
known center sequence in more than k positions.
center A C C G T G C A G
1 2 3 4 5 6 7 8 9
human A G C G T G C A C
rabbit A C C G T A C A T
mouse T C C G T A C A C
kunk This program is similar to kkno except
that the center sequence is not known a priori
instead, the program computes the best center
sequence for each conserved region it finds. For
each column in the alignment, the algorithm
recursively examines all possible center
sequences starting at that position to see how
far the region can be extended and back-tracks
when the extension becomes impossible.