CSCE 578 - Lecture Text Compression

ASCII - American Standard Code for Information Interchange: e.g. a 0110 0001
Text Compression: the business reducing the amount of space required to store or reducing the amount of time taken to transmit a file
- aim of compression: remove redundancy
- lossy vs lossless compression
- How do you remove redundancy?
Models of Text Compression
Terminology: alphabet, text, compression ratio
Shannon's Source Coding Theorem
- idea: represent most frequent symbols with fewest bits
- A symbol with probability p should be represented with -log p bits
- information content I(s) =
- entropy of model: - Sum of pi log pi
- entropy measures the amount order
- Assuming independence H yields a lower bound on the compression that can be achieved
- Extreme Cases
  - P(s) = 1 if s is gauranteed to be the next symbol
  - I(s) =
  - P(s) = 0 means symbol cannot be coded
Models of a text
Huffman[1952] coding
- encoding based on encoding most frequent symbols with shortest strings
- student at MIT, exempted exam
- character based
- encoding example
- word based
Algorithm generating a Huffman code
- build tree from the bottom up
- at each step chose two nodes with smallest probability and give them a parent, whose prob is the sum of the prob of the two children
- repeat the process until the tree is connected, ignoring nodes that are already children
- to generate the code work left is 0 right is 1
Breakthroughs
- arithmetic coding [Guazzo 1980]
- Ziv-Lempel compression [1977]
Arithmetic coding
- enabling technology, enabling adaptive techniques
- P(s) = .99 => I(s) = .015 bits, but Huffman requires at least one
- "bccb" example
- To encode s:
  1. set low_bound =
  2. set high_bound =
  3. set range =
  4. set high = low + range * high_bound
  5. set low = low + range * low_bound
- To decode given probability p
  1. Find range -> determine symbol
  2. adjust ranges according to encoding
  3. repeat till decoded
- inifinite precision problem
- solution: when high and low are close enough lead bit is the same tramsit the bit and subtract it off
Adaptive Models
- static modeling
- semiadaptive modeling (sample current text to build model)
- adaptive or dynamic: start with default, adapt with each character

Fixed Context Models

Order n models - based on preceding n characters
- Prob(next char is `u') = 2.4%
- Given preceding character is `q' Prob(next char is 'u') = 99%

order	Description
0	No context, Huffman code
1	Model based on preceding character
4	typically used in practice
-1	every character equally likely

blended models: level 1 for frequent characters, level 0 others

Dictionary Compression
- replaces groups of consecutive characters with indices into a dictionary(list of phrases)
- Also called `macro' or `codebook' compression
- static short phrase dictionary compression not as good as finite context models
- Ziv-Lempel - an adaptive system that allows larger phrases
- digram coding
Ziv-Lempel Compression
- almost all practical adaptive dictionary algorithms are based on Ziv-Lempel
- phrases replaced with a pointer to where they have occurred earlier in the text
- decoding simple: just replace pointer with already decoded text
- pairs (where, length)
- Example
  abbaabbbabab
  abba(1,3)(3,2)(8,3)
LZ77
- output triples - (howFarBack, length, nextCharacter)
- Example Encoded:(0,0,a)(0,0,b)(2,1,a)(3,2,b)(5,3,b)(1,10,a)
- Decoded:
- LZ77 limits size of howFarBack and length
  - howFarBack <= 8192 requiring ____ bits
  - length <= 16 symbols requiring ___ bits
- Searching back
  - linear search - easy but terribly inefficient
  - trie - multiway tree each branch labelled by next character of string
  - hash tables
  - linked lists
  - Index of pairs - for two characters 'a' and 'b' index(a,b) is the head of a list of occurence of a followed by b
- Current Implementations
  - compact (adaptive Huffman code)
  - compress (modified Lempel-Ziv)
  - gzip (Lempel-Ziv)
Readings
Modeling for Text Compression by Bell, Witten, and Cleary, Computing Surveys, Dec 1989, vol 21, p557-591. There is a copy in the Reading Room.
Managing Gigabytes by Witten, Moffit and Bell
Manual pages on Decstations for compact, compress, gzip, spell
Assignment Due ??? Assignment2 Link
.
.

URL = http://sourgum.cs.sc.edu/~matthews/Courses/587/Lectures/lecture4.html