CSCI 587 - Lecture 4 Text Compression
Eliza revisited
ASCII - American Standard Code for Information Interchange: e.g.
a 0110 0001
Text Compression: the business reducing the amount of space required to store or reducing the amount of time taken to transmit a file
aim of compression: remove redundancy
lossy vs lossless compression
How do you remove redundancy?
Terminology: alphabet, text, compression ratio
Shannon's Source Coding Theorem
idea: represent most frequent symbols with fewest bits
A symbol with probability p should be represented with -log p bits
entropy of model: - Sum of pi log pi
entropy measures the amount order
Models of a text
Huffman[1952] coding
character based
encoding example
generating a Huffman code
Breakthroughs
arithmetic coding [Guazzo 1980]
Ziv-Lempel compression [1977]
Adaptive Models
Rissanen and Langdon [1981] partition into encoder and modeler
static modeling
semiadaptive modeling (sample current text to build model)
adaptive or dynamic: start with default, adapt with each character
Fixed Context Models
Order n models - based on preceding n characters
Prob(next char is `u') = 2.4%
Given preceding character is `q' Prob(next char is 'u') = 99%
order
Description
0
No context, Huffman code
1
Model based on preceding character
4
typically used in practice
-1
every character equally likely
blended models: level 1 for frequent characters, level 0 others
Dictionary Compression
replaces groups of consecutive characters with indices into a dictionary(list of phrases)
Also called `macro' or `codebook' compression
static short phrase dictionary compression not as good as finite context models
Ziv-Lempel - an adaptive system that allows larger phrases
digram coding
Ziv-Lempel Compression
almost all practical adaptive dictionary algorithms are based on Ziv-Lempel
phrases replaced with a pointer to where they have occurred earlier in the text
decoding simple: just replace pointer with already decoded text
pointer pair (m,l): l characters starting at position m
Example
abbaabbbabab
abba(1,3)(3,2)(8,3)
Current Implementations
compact (adaptive Huffman code)
compress (modified Lempel-Ziv)
gzip (Lempel-Ziv)
Readings
Modeling for Text Compression
by Bell, Witten, and Cleary, Computing Surveys, Dec 1989, vol 21, p557-591. There is a copy in the Reading Room.
Manual pages on Decstations for compact, compress, gzip, spell
Assignment 2 Due Jan 30
Assignment2 Link
.
.
URL = http://sourgum.cs.sc.edu/~matthews/Courses/587/Lectures/lecture4.html