CSCE 578 - Lecture Text Compression
ASCII - American Standard Code for Information Interchange: e.g.
a 0110 0001
Text Compression: the business reducing the amount of space required to store or reducing the amount of time taken to transmit a file
aim of compression: remove redundancy
lossy vs lossless compression
How do you remove redundancy?
Models of Text Compression
Terminology: alphabet, text, compression ratio
Shannon's Source Coding Theorem
idea: represent most frequent symbols with fewest bits
A symbol with probability p should be represented with -log p bits
information content I(s) =
entropy of model: - Sum of pi log pi
entropy measures the amount order
Assuming independence H yields a lower bound on the compression that can be achieved
Extreme Cases
P(s) = 1 if s is gauranteed to be the next symbol
I(s) =
P(s) = 0 means symbol cannot be coded
Models of a text
Huffman[1952] coding
encoding based on encoding most frequent symbols with shortest strings
student at MIT, exempted exam
character based
encoding example
word based
Algorithm generating a Huffman code
build tree from the bottom up
at each step chose two nodes with smallest probability and give them a parent, whose prob is the sum of the prob of the two children
repeat the process until the tree is connected, ignoring nodes that are already children
to generate the code work left is 0 right is 1
Breakthroughs
arithmetic coding [Guazzo 1980]
Ziv-Lempel compression [1977]
Arithmetic coding
enabling technology, enabling adaptive techniques
P(s) = .99 => I(s) = .015 bits, but Huffman requires at least one
"bccb" example
To encode s:
set low_bound =
set high_bound =
set range =
set high = low + range * high_bound
set low = low + range * low_bound
To decode given probability p
Find range -> determine symbol
adjust ranges according to encoding
repeat till decoded
inifinite precision problem
solution: when high and low are close enough lead bit is the same tramsit the bit and subtract it off
Adaptive Models
static modeling
semiadaptive modeling (sample current text to build model)
adaptive or dynamic: start with default, adapt with each character
Fixed Context Models
Order n models - based on preceding n characters
Prob(next char is `u') = 2.4%
Given preceding character is `q' Prob(next char is 'u') = 99%
order
Description
0
No context, Huffman code
1
Model based on preceding character
4
typically used in practice
-1
every character equally likely
blended models: level 1 for frequent characters, level 0 others
Dictionary Compression
replaces groups of consecutive characters with indices into a dictionary(list of phrases)
Also called `macro' or `codebook' compression
static short phrase dictionary compression not as good as finite context models
Ziv-Lempel - an adaptive system that allows larger phrases
digram coding
Ziv-Lempel Compression
almost all practical adaptive dictionary algorithms are based on Ziv-Lempel
phrases replaced with a pointer to where they have occurred earlier in the text
decoding simple: just replace pointer with already decoded text
pairs (where, length)
Example
abbaabbbabab
abba(1,3)(3,2)(8,3)
LZ77
output triples - (howFarBack, length, nextCharacter)
Example Encoded:(0,0,a)(0,0,b)(2,1,a)(3,2,b)(5,3,b)(1,10,a)
Decoded:
LZ77 limits size of howFarBack and length
howFarBack <= 8192 requiring ____ bits
length <= 16 symbols requiring ___ bits
Searching back
linear search - easy but terribly inefficient
trie - multiway tree each branch labelled by next character of string
hash tables
linked lists
Index of pairs - for two characters 'a' and 'b' index(a,b) is the head of a list of occurence of a followed by b
Current Implementations
compact (adaptive Huffman code)
compress (modified Lempel-Ziv)
gzip (Lempel-Ziv)
Readings
Modeling for Text Compression
by Bell, Witten, and Cleary, Computing Surveys, Dec 1989, vol 21, p557-591. There is a copy in the Reading Room.
Managing Gigabytes
by Witten, Moffit and Bell
Manual pages on Decstations for compact, compress, gzip, spell
Assignment Due ???
Assignment2 Link
.
.
URL = http://sourgum.cs.sc.edu/~matthews/Courses/587/Lectures/lecture4.html