Title: Compression
1Compression
- Word document 1 page is about 2 to 4kB
- Raster Image of 1 page at 600 dpi is about 35MB
- Compression Ratio, CR , where
is the number of bits - Compression techniques take advantage of
- Sparse coverage
- Repetitive scan lines
- Large smooth gray areas
- ASCII code, always 8 bits per character
- Long words frequently used
2Entropy
- Entropy is a quantitative term used for amount of
information in a string
1.00 0.80 0.60 0.40 0.20 0.00
H(1)H(0)
H(1)
H(0)
0.0 0.2 0.4
0.6 0.8 1.0
For N clusters, where li is the length of the ith
cluster
3Binary Image Compression Techniques
- Packing 8 pixels per byte
- Run Length Encoding Assume 100 dpi, 850 bits per
line - encode only the white bits as they are long runs
- Top part of a page could be 0(200)111110(3)111110
(3) . - Huffman Coding use short length codes for
frequent messages
Encode Decode
40 (2,7) (13,2) 0 (2,7) (13,2) 0 (2,7) (13,2)
0 (2,2) (7,2) (13,2) 0 (2,2) (7,2) (13,2) 0 (2,7)
(13,2) 0 (2,2)(7,2)(13,2) 0 (2,2)(7,2)(13,2) 0 0
Bit map 160 bits 50 numbers in range 0-15 Use 4
bits per number 200 bits 2 bits per symbol 100
bits HC 1.84 x 50 92 bits
Huffman Encoding
5Predictive Coding
- Most pixels in adjacent scan lines s1 and s2 are
the same - S2 is the predicted version
2 dimensional prediction
- Probabilities gathered from document collections
- Tradeoff between context size and table size
Context size of 12 pixels common which uses a
4096 entries table
6Group III Fax
- White runs and black runs alternate
- All lines begin with a white run (possibly length
zero) - There are 1728 pixels in a scan line
- Makeup codes encode a multiple of 64 bits
- Terminating codes encode the remainder (0 to 63)
- EOL for each line
- CCITT lookup tables
- Example,
- White run of 500 pixels would be encoded as
- 500 7x 64 52
- Makeup code for 7x 64 is 0110 0100
- Terminating code for 52 is 0101 0101
- Complete code is 0110 0100 0101 0101
7(No Transcript)
8Group IV READ
b1
b2
Reference Coding
a0
a2
a1
- a0 is the reference changing pixel a1 is the
next changing pixel after a0 and a2 is the next
changing pixel after a1. - b1 is the first changing pixel on the reference
line after a0 and is of opposite color to a0 b2
is the next changing pixel after b1. - To start, a0 is located at an imaginary white
pixel point immediately to the left of the coding
line. - Follow READ algorithm chart
9Group IV READ
10Grayscale Compression- JPEG
11(No Transcript)
12Information Retrieval (Typed text documents)
- IR goal is to represent a collection of documents
were a single document is the smallest unit of
information - Typify document content and present information
upon request
Similarity Measure
Requests
Documents
- OCR translates images of text to computer
readable form and IR extracts the text upon
request - Inverted Index Transpose the document-term
relationship to a term-document relationship - Remove Stopwords the, and, to, a, in, that,
through, but, etc. - Word Stemming Remove prefixes and suffixes and
normalize
13Query 1 recognition or retrievalResponse 1 2
3 Query 2sequentially and readableResponse
3 Query 3 not translateResponse
2 Querycharacter and recognition or retrieval
14Vector Space Model
- Each document is denoted by a vector of concepts
(index terms) - If the term is present in the document 1 is
placed in the vector - Vector of document 1 from table (1 1 1 0 1 0 0 1
1 1 1 1 0 0 0 1 1) - Weighting Favor terms with high frequency in a
few documents
N total documents Dfi no. of docs containing
term i Tij frequency of term i in doc j
Document similarity measure between Dj
(wi,w2j,wmj) and Qr (q1r,q2r,..qmr)
15Relevance Feedback
N no. of documents in collection R number of
documents relevant to query q N no. of
documents containing t R no. of relevant
documents containing t F proportion of relevant
documents to non-relevant documents in which term
occurs F without relevance feedback k
constant, adjusted with collection size c
collection size fi no. of documents in which
term i occurs tij frequency term i in document
j Maxtfj maximum term frequency in document j
16Precision and Recall
- Coverage extent to which system includes
relevant documents - Time lag average time it takes to produce an
answer to a search request - Presentation quality of the output
- Effort energies put forth by user to obtain
information sought - Recall proportion of relevant material received
from a query - Precision proportion of retrieved documents
actually relevant
Recall Precision