Title: Tools and Softwares
1Tools and Softwares
- Group 13
- Shashi Ranjan
- Murali Sivaramakrishnan
- Ognjen Perisic
2Introduction
- Importance of aligning two sequences
- Identify the structure and function
- Evolutionary similarities
- Why use computers to do this alignment?
- Sequences are very, very, very long
- Need to evaluate many alignments
3Tools and Softwares
- BLAST
- PIP MAKER
- FASTA
- Motif Explorer
- MACAW
- Five Methods for finding
Conserved Segments.
4BLAST Overview Basic Local Alignment Search Tool
- Dynamic programming O(mn) !
- BLAST heuristically finds high scoring segment
pairs (HSP) - Identical length segments from 2 sequences with
high scores - I.e. ungapped local alignments
5BLAST Nuts Bolts
- Given query sequence q, word length w, word
score threshold T, segment score threshold S - Generate a word set (neighborhood words) that
score at least T when compared to words form q - Search the dB for matches to words in list, a
match indicates a possible alignment - Extend all matches to seek high-scoring segment
pairs and return those pairs
6(No Transcript)
7BLAST Extensions
- Gapped BLAST
- 3 Changes to the original BLAST
- Word extension criteria modified
- Allow gapped alignments, need to find just one
ungapped alignment - Use dynamic programming of Smith-Waterman to
produce final alignment
8BLAST Extensions
- Position Specific Iterated BLAST
- Basic Idea
- Use results from BLAST query to build a profile
- Search dB with profile instead of initial query
- Iterate
9BLAST Programs
- NCBI BLAST ---- http//ncbi.nlm.nih.gov/BLAST/
10PipMaker
- WWW site for comparing DNA sequences
- Identifies conserved sequences
- Provides graphical output in PDF or PS format
- Uses a variant of Gapped BLAST - BLASTz
- http//bio.cse.psu.edu
11Tools And Software
- FASTA
- Motif Explorer
- Accessibility
12FASTA Overview
- Given two sequences x,y and ksize of exactly
matching strings. - Tabulate the offset between x,y for all matching
strings of size k (k-tuples) - For offset with high frequency, try to combine
their k-tuples into regions - Extend the best regions without gaps as long as
score improves. (using the substitution matrix
such as PAM250) (init1) - Check if un-gapped regions can be combined with a
small gapped region. (initn used to rank the
library sequences.) - Construct an optimal alignment of the highest
ranking seq. (opt)
13FASTA- offset tabulation
- X HARFYAAQIVL
- Y VDMAAQIA
- K-tuple offsets of x (k1)
- A(2,6,7), I(9), , Q(8), V(10),
- Offset difference for each k-tuple of Y
V D M A A Q I A
9 -2 -3 2 2 -6
2 1 -2
3 2 -1
Offsets -6 -5 -4 -3 -2 -1 0 1 2 3 4 8 9
Frequency 1 1 2 1 1 4 1 1
14FASTA
15FASTA Package
FASTA scan a protein or DNA sequence library for similar sequences.
TFASTA compare a protein sequence to a DNA sequence library.
FASTX, FASTY compare a DNA sequence to a protein sequence database.
FASTF, TFASTF compare an ordered peptide mixture against a protein or translated DNA database.
FASTS, TFASTS compare a set of short peptide fragments against a protein or translated DNA database.
LFASTA, PLFASTA concentrates more on the local regions and reports more than on one sequence.
RDF2 for evaluating the statistical significance of a similarity score.
16Motif Explorer
17Motif Explorer
18Motif Explorer - Architecture
19Accessibility
- FASTA
- www.ebi.ac.uk/fasta33/
- http//bioweb.pasteur.fr/seqanal/interfaces/fasta.
html - http//fasta.bioch.virginia.edu/
- Motif Explorer
- www.cbc.med.umn.edu/gst/MotifExplorer.html
- www.arabidopsis.org/links/motif_search.html
20MACAWA Workbench for Multiple Alignment
Construction and Analysis
- Allows user to construct multiple protein
alignments by locating, analyzing, editing and
combining blocks of aligned sequence segments. - MACAW incorporate several features
- Regions of local similarity are located by a
search algorithm that avoids many of the
limitations of individual algorithms - The statistical significance of blocks of
similarity is evaluated using a mathematical
theory developed during paste 15 years - User can edit each block by moving its boundaries
or by eliminating particular segments, and blocks
may be linked to form a composite multiple
alignment.
21Theoretical boundaries
- Problem For a set of n protein sequences each of
length l, a region of similarity common to all
may begin anywhere in each sequence and it should
be compared to all other sequences, that is ln
alignments should be checked. This search space
is too large. - MACAW imposes a single condition on the alignment
it seeks that all segments show a minimal
amount of mutual similarity. This can examine all
of the search space in O(n2l2) time.
22Terminology
- For a set of n sequences, one subset of segments
of some specific length from each of m (mltn)
sequences forms a m-block, or simply a block. - Any set of m sequences locked into a specific
alignment, with no gaps allowed, is a m-diagonal
or simply a diagonal. - An aligned set of n amino acids is n-column or
a column.
23Algorithm
- Set of scores is used for comparison of two
proteins, with or without gaps. MACAW uses scores
for aligning amino acids called PAM-250. - The score of the block is the sum of the scores
assigned to each of its columns. - Score of the column is the sum of all pairwise
similarity scores of the amino acids it
comprises. Those SP scores are called Sum of
the Pairs. MACAW can use some different, more
biologically realistic set of scores. - Search routine seeks only blocks in which all
pairs of segments are contained in pairwise
subalignments with score greater than or equal to
some threshold T. T should be chosen so that
about 10 of the diagonals are marked. For n
sequences of total (aggregate) length L, each
sequence must be compared with others, and this
takes O(L2) time. This is the most time-consuming
step. - For any multiple diagonal, all implied
2-diagonals must have been marked during first
phase.
24Algorithm
- The basic problem is that homologies that exist
in a n-diagonal may be represented by n-blocks,
or by 4, 3 or 2 blocks or some other combinations
of disconnected blocks. To solve this problem
MACAW uses heuristic approach in two steps - Program searches for the highest 2-block in every
2-diagonal using threshold T and marks the amino
acid pairs it contains. - Columns are represented as graphs. Every vertex
is amino acid and edge between them exist if they
are marked in step 1. MACAW searches for the most
connected (with as many as possible edges) graphs
and connects them into one block. -
25Algorithm
- This procedure doesnt mark all the relations
between sequences, so MACAW allows user to edit
blocks. - Every amino acid is colored according to the
number of edges it is connected.
26Example
27Five experimental methods for finding conserved
sequences in multiple alignments of gene
regulatory regions
- A conserved character in DNA is one that was
probably present in the common ancestral species
and has been preserved in the contemporary
species being examined. - Two of the methods are already in common use
they are based on good column agreement and high
information content. - Three additional methods find blocks with minimal
evolutionary, blocks that differ in at most k
positions pre row from a center sequence that is
unknown a priori. The center sequence in the
latter two methods is a way to model potential
binding sites for known or unknown proteins in
DNA sequences or it is a common ancestor of the
species represented in the alignment. - Parameters common to all of the
tools/programs/algorithm - The minimum length of the regions to be reported
- The minimum number of sequences, which must be
active, are selectable by the user
28Five tools
- agree This utility locates regions in a given
alignment that have good column agreement. -
- The length of the region is often a reliable
indicator that some functionality was preserved
across the species, but conservation doesnt need
to be perfect and such regions might be
fragmented into conserved pieces too small to be
detected, so a systematic way to link the smaller
regions is needed. - The two utilities infocon and phylogen are trying
to solve this problem. The idea is to assign a
numerical score to each column and then look for
runs of columns meeting the following two
conditions - their cumulative score (obtained by adding
together the individual column scores) is no
smaller than the score of any of their sub-runs - they are maximal with this property, i.e. they
are not contained in any longer run having the
property 1. -
- The infocon tool finds full runs of columns with
high information content in the given alignment.
To do this, each column is assigned an
intermediate score that measures its information
content, based on the frequencies of the letter
both within the column and within the alignment
as whole. -
29Five tools
- phylogen This program scores the columns by
their evolutionary relationships among the
sequences of the given alignment implied by a
supplied phylogenetic tree. The phylogenetic tree
has a leaf node for each species and each
internal node represents a putative common
ancestor for the species in its sub-tree.
30Five tools
- kkno This program scans the alignment to
determine, starting at each position, the longest
region in which no row differs from a specified,
known center sequence in more than k positions. - center A C C G T G C A G
- 1 2 3 4 5 6 7 8 9
- human A G C G T G C A C
- rabbit A C C G T A C A T
- mouse T C C G T A C A C
- kunk This program is similar to kkno except
that the center sequence is not known a priori
instead, the program computes the best center
sequence for each conserved region it finds. For
each column in the alignment, the algorithm
recursively examines all possible center
sequences starting at that position to see how
far the region can be extended and back-tracks
when the extension becomes impossible.