Title: Data Structures and Algorithms
1Data Structures and Algorithms
2XKCD
3Outline
- Analyzing algorithms
- Designing Algorithms
- Profiling
- Heuristics
- Ex) hash-based sequence alignment
4Insertion Sort Pseudocode(review conventions
arrays, indentation, loops, logic, etc.)
- Input array A0..n-1
- Insertion-Sort(A)
- 1 for j 1 to lengthA-1
- 2 key A j
- 3 //insert A j into the sorted sequence
A0..j-1 - 4 i j-1
- 5 while i gt -1 and Ai gt key
- 6 Ai1 Ai
- 7 i i -1
- 8 Ai1 key
for loop convention iterative or counting
while loop convention do while expression
is true
Cormen, Intro to Algs.
5Insertion Sort
- Design algorithm (as opposed to
- Bubble Sort)
- 2) Implement algorithm
Left of key is sorted Right of key is unsorted
- 1 for j 1 to lengthA-1
- 2 key A j
- //insert A j into the sorted sequence
A0..j-1 - 4 i j-1
- 5 while i gt -1 and Ai gt key
- 6 Ai1 Ai
- 7 i i -1
- 8 Ai1 key
6Analyzing Algorithms
- predicting resources that an algorithm requires
- memory
- communication bandwidth
- logic gates
- computational time (most often measured)
- In other words, how many steps does Insertion
Sort take to complete???
7Analyzing Insertion Sort
- time taken by Insertion Sort depends on input
1000 takes longer than 3 numbers - can take different amounts of time to sort 2
input sequences of the same size -- Why? - in general, the time taken by an algorithm grows
with the size of the input - describe the running time of a program as
function of the size of input
8Analyzing Insertion Sort
- running time
- function of number of steps executed
- assume a constant amount of time is required to
execute each line of pseudocode
9Analyzing Insertion Sort
- Insertion-Sort(A) cost time
- 1 for j 1 to lengthA-1 c1 n
- 2 key A j c2 n-1
- 3 //insert A j 0
- 4 i j-1 c4 n-1
- 5 while i gt -1 and Ai gt key c5 ? nj0 tj
- 6 Ai1 Ai c6 ? nj0 tj -1
- 7 i i -1 c7 ? nj0 tj -1
- 8 Ai1 key c8 n-1
- ? nj0 tj n(n1)/2 -1
Algorithms, Cormen
10Analyzing Insertion Sort
- T(n) c1n c2(n-1) c4(n-1) c5 (n(n1)/2 -1)
c6(n(n-1)/2) c7(n(n-1)/2) c8(n-1) - (c5/2 c6/2 c7/2)n2 (c1c2c4c5/2 - c6/2
- c7/2 c8)n - (c2 c4c5c8) - k1n2 k2n -k3
- ?(n2) asymptotic upper and lower bound
- O(n2) asymptotic upper bound
- Typical complexities
- O(1) lt O(logn) lt O(n) lt O(nlogn) lt O(n2) lt O(n3)
lt O(2n)
Linear time
11Insertion Sort Observations
- numbers are sorted in place
- numbers are rearranged within the array, with at
most a constant number of them stored outside of
the array at any time - Run time depends on input
- the number of operations for the following 3 sets
will vary greatly due to level of
pre-sortedness - a descending sorted order will actually take more
operations than a random ordering - algorithm complexity analysis allows us to place
upper and lower asymptotic bounds for comparison
3 5 6 7 9 8 10 15 20 30 69
3 5 6 7 9 8 10 15 1 12 20
20 12 15 10 9 8 7 6 5 3 1
12Insertion Sort Profile
- A 3 2 1
- Step j key i Ai Ai1
- 1 1
- 2 1 2
- 3 1 2 0 3 2
- 4 0 gt -1 and 3 gt 2 ? true
- 1 2 0 3 3
- A 3 3 1
- 6 1 2 -1 Undef 3
- 4 -1 gt -1 and Undef gt 2 ? false
- 7 1 2 -1 Undef 2
- A 2 3 1
1 for j 1 to lengthA-1 2 key A j 3
i j-1 4 while i gt -1 and Ai gt key 5
Ai1 Ai 6 i i -1 7 Ai1
key
13What about naïve.pl?
snt array of subject nucleotides qnt
array of query nucleotieds for i 0 to
length(subject) length(query) j0 while
(snti j qntj) jj1 if (j
length (query)) found sequence at
position i end
c1 c2 c3 n c4 n c5 n n c6 n n c7 n
n c8 n n ?O( n2)
14Definitions
- procedural programming languages tend to be
action oriented (as opposed to Object Oriented
Programming OOP) - subroutine a collection of high-level
programming language operations procedure
(Pascal did not return a value) - function (Pascal did return a value)
15Machine Instructions At the lowest level, every
program consists of primitive machine
instructions. move.L D0, 20004 Language
Statements High-level languages consist of
statements that perform one or more machine
instructions. i k 9 Subroutines
Subroutines consist of groups of language
statements. sequence print_formated_sequence(_at_
qnts,i) Programs Programs consist of groups
of subroutines
C a s/w engineering approach, Darnell
16Subroutines
- programs are developed with layers of functions
- lower-level functions perform simple operations
- higher-level functions are created from
lower-level functions - analogous to abbreviations for long and
complicated sets of commands - defined once, but invoked many times
- ease of change
- modular and re-usable
- enhanced reliability (complicated tasks broken
into simpler ones) - improved readability
- with low-level details of algorithm
compartmentalized, an algorithm may be easier to
read, understand, and modify - good rule of thumb if your subroutine spans
more than 1 printed page, I would expect at least
1 bug
17Bioinformatics example
- Optimal sequence alignment (allowing for gaps and
substitutions in either query or subject sequence)
18Heuristics
- What do you do when faced with an NP-complete
problem, or problem size where algorithm takes
too long? - Example want to compare 2 genomes (brute force)
- naïve.pl O( n2)
- 31093109 110-9 S 9 109 S 285 years
- Alternative
- hash k-tuples of nucleotides to a number, and
compare numbers
19Hash-Based Alignment
- base-10 numbers
- 5805 5103 8102 01015100
- k8
- ATGCCTGGGCT
- A0, C1, G2, and T3 (base 4 number)
- ATGCCTGG 04734624514414334224
1240 14714 - Now we can compare chunks of sequence much
faster - speed increase by factor of 8 - Can pre-compute hashes for entire genome, and
only compare hashes - Premise for popular alignment tools BLAST,
BLAT,and UIcluster
20Heuristics
- Usually a trade off
- In sequence hashing example
- accuracy is traded for speed
- you cannot match/find sequences shorter than 8
nucleotides - How do you find optimal k-tuple?
- depends on question
- empirically
21End
22Overly simple example of compartmentalizing
- Count the number of nucleotides in a file.
- open file
- while there is more sequence
- read a nucleotide
- increment count
- print nt count
- close file
23Another Example (divide and conquer)
- Find the average intron size for all human genes
- Get human genome
- Get genes
- Find indices of exons/introns
- Size index2 - index1
- Tabulate and average
24Recursion
- recursion partially consists or is defined in
terms of itself - examples
- mirrors
- video camera of television
- factorial function for non-negative integers
- n!
- a) 0! 1
- b) if n gt0, then n! n(n-1)!
- 3! 3(2)! 3(2(1!)) 3(2(1(0!))) 3(2(1(1)))
6
25Recursion
- power is in ability to define an infinite set of
objects by a finite statement - tool for expressing a program recursively is the
subroutine (procedure/function) - directly recursive subroutine P contains
reference to itself - indirectly recursive P contains reference to
another subroutine Q, which contains a (direct or
indirect) reference to P
26!/usr/bin/perl simple example perl program
to calculate the factorial of a number using
(gasp) "Recursion" BUG found print "Enter
integer number to determine factorial" iltSTDIN
gt get number chomp(i) remove
"newline" i int i removes any
decimals if(i lt 0) die("Error input
must be positive integer") j
Fact(i) print "(i)!" print "j\n" end
of program sub Fact()
my num shift How can num be N, and
then N-1, then N-2, etc.???? print "num
num\n" new_num num-1 if(new_num
0) return(1) else
fact num Fact(new_num)
return(fact)
27Program Iterations or Profile
n5 tabraun_at_texas fact ./fact-test.pl Enter
integer number to determine factorial5 num
5 num 4 num 3 num 2 num 1 (5)!120
28Recursion
- Cut and paste () here
- Cut and paste (Cut and paste () here) here
- Cut and paste (Cut and paste (Cut and paste ()
here) here) here - Etc.
29Variable Scope
- global variables variables that are
accessible/visible from any part of a program - local variables accessible to a limited portion
of the program - ensures that variables are not unintentionally
manipulated - perl
- variables are always global unless you specify
otherwise - my variable_name specifies a local variable
- scope usually refers to blocks of code
- for loop
- while loop from insertion sort
- example scope.pl
30!/usr/bin/perl i 5 print "ii\n"
print "ii\n" i 3 print "ii\n"
print "ii\n"
31Can we do better than Insertion Sort O(n2)?
- Merge-Sort(A,p,r)
- 1 if p lt r
- 2 q (pr)/2
- 3 Merge-Sort(A,p,q)
- 4 Merge-Sort(A,q1,r)
- 5 Merge(A,p,q,r)
- 6 return
- Divide and conquer example -- sorted array of
length 1 is already sorted.
32Merge-Sort Split Steps
5 2 4 6 1 3 2 6
5 2 4 6
1 3 2 6
1 3
2 6
5 2
4 6
6
2
3
1
2
4
5
6
Merge-sort changes the problem from one of
sorting numbers, to one of simply combining
stacks of numbers that are already sorted. At
the leaf level of this graph, individual numbers
are already sorted (because a single number by
itself is sorted).
33Merge-Sort Analysis
O(nlog2n) -- can we do better???
34Algorithmic Concepts
- Greedy algorithms
- always makes the choice that looks best at the
moment. - makes a locally optimal choice in the hope that
this choice will lead to a globally optimal
solution - do not always yield optimal solutions
35Knapsack Problem
- A thief finds n items
- item i is worth vi dollars, and weighs wi pounds,
where vi and wi are integers - thief wants to maximize value, but is limited to
W pounds - What items should thief take?
36Knapsack
37NP-Completeness and NP-Hard
- polynomial-time algorithms
- naïve, insertion-sort, merge-sort, fact
- O(nk)
- Can all problems be solved in polynomial time?
- no
- this class of problem is called NP-Complete
- these problems are intractable
- valuable to know when a problem is NP-Complete so
that you do not waste time attempting to develop
a solution - approach is to look for approximation of solution
38NP-Complete example
- Traveling-salesman problem
- a salesman must visit N cities
- wants to visit every city exactly once
- wants to minimize travel distance
39Dynamic Programming
- like divide-and-conquer, DP solves problems by
combining the solutions of subproblems
(Progamming refers to a tabular method, NOT
writing computer code.) - D and C generally have independent subproblems
- DP is most applicable when subproblems are not
independent - i.e., D and C does more work than
necessary, repeatedly solving the common
subsubproblems - DP solves each subsubproblem once, and saves the
results in a table. - work is avoided since the answer does not have to
be recomputed every time the subsubproblem is
encountered - Example
- global sequence alignment -- Smith-Waterman
40Assignment Debugging naïve.pl
- Due
- bug exists in algorithm
- find input scenario where algorithm breaks
- Assignment 1
- Obtain naïve.pl (web)
- Execute it (with your perl)
- Alter input to determine bug
- Submit a version of the program with the 2 input
lines that illustrate the bug - _at_snt (A, A, . )
- _at_qnt (A, T, .)
- Describe a solution by inserting comments into
the program - Submit your altered and commented program to icon