Chapter 9: Text Processing - PowerPoint PPT Presentation

About This Presentation
Title:

Chapter 9: Text Processing

Description:

... T of size n and pattern. P of size m. Output starting ... More Complex Patterns ... How do you express that pattern to find? How can you find it efficiently? ... – PowerPoint PPT presentation

Number of Views:197
Avg rating:3.0/5.0
Slides: 32
Provided by: HarryPl6
Learn more at: http://cs.calvin.edu
Category:

less

Transcript and Presenter's Notes

Title: Chapter 9: Text Processing


1
Chapter 9 Text Processing
Pattern MatchingData Compression
2
Outline and Reading
  • Strings (9.1.1)
  • Pattern matching algorithms
  • Brute-force algorithm (9.1.2)
  • Knuth-Morris-Pratt algorithm (9.1.4)
  • Regular Expressions and Finite Automata
  • Data Compression
  • Huffman Coding
  • Lempel-Ziv Compression

3
Motivation Bioinformatics
  • The application of computer science techniques to
    genetic data
  • See Gene-Finding notes
  • Many interesting algorithm problems
  • Many interesting ethical issues!

4
Strings
  • A string is a sequence of characters
  • Examples of strings
  • Java program
  • HTML document
  • DNA sequence
  • Digitized image
  • An alphabet S is the set of possible characters
    for a family of strings
  • Example of alphabets
  • ASCII
  • Unicode
  • 0, 1
  • A, C, G, T
  • Let P be a string of size m
  • A substring Pi .. j of P is the subsequence of
    P consisting of the characters with ranks between
    i and j
  • A prefix of P is a substring of the type P0 ..
    i
  • A suffix of P is a substring of the type Pi ..m
    - 1
  • Given strings T (text) and P (pattern), the
    pattern matching problem consists of finding a
    substring of T equal to P
  • Applications
  • Regular expressions
  • Programming languages
  • Search engines
  • Biological research

5
Pattern matching
  • Suppose you want to find repeated ATs followed by
    a G in GAGATATATATCATATG.
  • How do you express that pattern to find?
  • How can you find it efficiently?
  • How if the strings were billions of characters
    long?

6
Finite Automata and Regular Expressions
  • How do I match perl-like regular expressions to
    text?
  • Important topic regular expressions and finite
    automata.
  • theoretician regular expressions are grammars
    that define regular languages
  • programmer compact patterns for matching and
    replacing

7
Regular Expressions
  • Regular expressions are one of
  • a literal character
  • a (regular expression) in parentheses
  • a concatenation of two REs
  • the alternation (or) of two REs, denoted in
    formal notation
  • the closure of an RE, denoted (ie 0 or more
    occurrences)
  • Possibly additional syntactic sugar
  • Examples
  • abracadabra
  • abra(cadabra) abra, abracadabra,
    abracadabracadabra,
  • (ab ac)d
  • (a(ab)b)
  • t(wo)?o ? means 0 or 1 occurrence in Perl
  • aardvark means 1 or more occurrences in Perl

8
Finite Automata
  • Regular language any language defined by a RE
  • Finite automata machines that recognize regular
    languages.
  • Deterministic Finite Automaton (DFA)
  • a set of states including a start state and one
    or more accepting states
  • a transition function given current state and
    input letter, whats the new state?
  • Non-deterministic Finite Automaton (NFA)
  • like a DFA, but there may be
  • more than one transition out of a state on the
    same letter (Pick the right one
    non-deterministically, i.e. via lucky guess!)
  • epsilon-transitions, i.e. optional transitions on
    no input letter

9
REs in common use
  • Syntactic sugar
  • a-c,x-z match one of a, b, c, x, y, z
  • abc match a character that is not an a, b, or
    c
  • . match any character
  • ? match 0 or 1 instances of what preceeded
  • \s match a whitespace character
  • , match the beginning or end of string
  • (pattern) make pattern available in
    substitutions as 1, 2, etc

10
Examples
  • Perl examples (and other languages)
  • input s/two?o/2/
  • input sltlinkgtgt\sgs
  • input s\s\_at_font-face\s.?gs
  • input s\smso-gt"""gis
  • input s/( ) ( )/2 1/
  • input m/0-9\.?0-9\.0-9/
  • (word1,word2,rest)
  • (foo m/ ( ) ( ) (.)/)
  • inputsltspangtgt\sltbr\sclear"?allgtgt\s
    lt/spangtltbr clear"all"/gtgis

11
Multiples of 3?
  • /(03692580369147147(0369147036
    9258)2582580369258(0369147036
    9258)258147(03691470369258)1
    4703691472580369258(0369147036
    9258)1470369147)/

12
DFA for (AT)C
  • Note that DFA can be represented as a 2D array,
    DFAstateinputLetter ? newstate
  • DFA
  • state letter newstate
  • 0 A 1
  • 0 TCG 0
  • 1 T 2
  • 1 ACG 0
  • 2 C 4 accept
  • 2 GT 0
  • 2 A 3
  • 3 T 2
  • 3 AGC 0
  • 4 AGCT 0

13
RE ? NFA
  • Given a Regular Expression, how can I build a
    DFA?
  • Work bottom up.
  • Letter
  • Concatenation
  • Or Closure

14
RE ? NFA Example
  • Construct an NFA for the RE(AB AC)D
  • A
  • A
  • AB
  • AB AC
  • (AB AC)D

15
NFA -gt DFA
  • Keep track of the set of states you are in.
  • On each new input letter, compute the new set of
    states you could be in.
  • The set of states for the DFA is the power set of
    the NFA states.
  • I.e. up to 2n states, where there were n in the
    DFA.

16
Recognizing Regular Languages
  • Suppose your language is given by a DFA. How to
    recognize?
  • Build a table. One row for every (state, input
    letter) pair. Give resulting state.
  • For each letter of input string, compute new
    state
  • When done, check whether the last state is an
    accepting state.
  • Runtime?
  • O(n), where n is the number of input letters
  • Another approach use a C program to simulate NFA
    with backtracking. Less space, more time.

17
Data Compression Intro
  • Suppose you have a text, abracadabra. Want to
    compress it.
  • How many bits required?
  • at 3 bits per letter, 33 bits.
  • Can we do better?
  • How about variable length codes?
  • In order to be able to decode the file again, we
    would need a prefix code no code is the prefix
    of another.
  • How do we make a prefix code that compresses the
    text?

18
Huffman Coding
  • Note Put the letters at the leaves of a binary
    tree. Left0, Right1. Voila! A prefix code.
  • Huffman coding an optimal prefix code
  • Algorithm use a priority queue.
  • insert all letters according to frequency
  • if there is only one tree left, done.
  • else, adeleteMin() bdeleteMin()
  • make tree t out of a and b with weight
    a.weight() b.weight()
  • insert(t)

19
Huffman coding example
  • abracadabra frequencies
  • a 5, b 2, c 1, d 1, r 2
  • Huffman code
  • a 0, b 100, c 1010, d 1011, r 11
  • bits 5 1 2 3 1 4 1 4 2 2 23
  • Follow the tree to decode Q(n)
  • Time to encode?
  • Compute frequencies O(n)
  • Build heap O(1) assuming alphabet has constant
    size
  • Encode O(n)

20
Huffman coding summary
  • Huffman coding is very frequently used
  • (You use it every time you watch HTDV or listen
    to mp3, for example)
  • Text files often compress to 60 of original size
    (depending on entropy)
  • In real life, Huffman coding is usually used in
    conjunction with a modeling algorithm

21
Data compression overview
  • Two stages modeling and entropy coding
  • Modeling break up input into tokens or chunks
    (the bigger, the better)
  • Entropy Coding use shorter bit strings to
    represent more frequent tokens
  • If P is the probability of a code element, the
    optimal number of bits is lg(P)

22
Lempel-Ziv Modeling
  • Consider compressing text
  • Certain byte strings are more frequent than
    others the, and, tion, es, etc. Model these with
    single tokens
  • Build a dictionary of the byte strings you see
    the second time you see a byte string, use the
    dictionary entry

23
Lempel-Ziv Compression
  • Start with a dictionary of 256 entries for the
    first 256 characters
  • At each step,
  • Output the code of the longest dictionary match
    and delete those characters from input
  • Add previous token plus last letter as new
    dictionary entry with code 256, 257, 258,
  • Note that code lengths grow by one bit as
    dictionary reaches size 512, 1024, 2048, etc.

24
Lempel-Ziv Example
  • ABRACADABRA
  • Output Add to Dictionary
  • 1 (A)
  • 2 (B) AB
  • 5 (R) BR
  • 1 (A) RA
  • 3 (C) AC
  • 1 (A) CA
  • 4 (D) AD
  • 6 (AB) DA
  • 8 (RA) ABR

Dictionary
  1. A
  2. B
  3. C
  4. D
  5. R
  6. AB
  7. BR
  8. RA
  9. AC
  10. CA
  11. AD
  12. DA
  13. ABR

25
Lempel-Ziv Variations
  • All compression algorithms like zip, gzip use
    variations on Lempel-Ziv
  • Possible variations
  • Fixed-length vs. variable length codes or
    adaptive Huffman or arithmetic coding
  • Dont add duplicate entries to the dictionary
  • Limit the number of codes or switch to larger
    ones as needed
  • Delete less frequent dictionary entries or give
    frequent entries shorter codes

26
How about this approach
  • Repeat
  • for each letter pair occurring in the text, try
  • replace the pair with a single new token
  • measure the total entropy (Huffman-compressed
    size) of the file
  • if that letter pair resulted in the greatest
    reduction in entropy so far, remember it
  • permanently substitute new token for the pair
    that caused the greatest reduction in entropy
  • until no more reductions in entropy are possible
  • Results compression to about 25 for big books
    better than gzip, zip. But not as good as bzip!

27
Compression other data
  • Modeling for audio?
  • Modeling for images?

28
Modeling for Images?
  • Wikipedia

29
JPEG, etc.
  • Modeling convert to the frequency domain with
    DCT
  • Throw away some high-frequency components
  • Throw away imperceptible components
  • Quantize coefficients
  • Encode the remaining coefficients with Huffman
    coding
  • Results up to 20-1 compressionwith good
    results, 100-1 with recognizable results
  • How the DCT changed the world

30
Data compression results
  • Best algorithms compress text to 25 of original
    size, but humans can compress to 10
  • Humans have far better modeling algorithms
    because they have better pattern recognition and
    higher-level patterns to recognize
  • Intelligence pattern recognition data
    compression?
  • Going further Data-Compression.com

31
Ethical issues on algorithms
  • Back to an issue from the start of class Can
    algorithms be unethical?
Write a Comment
User Comments (0)
About PowerShow.com