CSCI 587 - Assignment 2 Text Compression


Date: Jan 23, 1997
Date Due: Jan 30, 1997
  1. The dictionary used by the Unix system command spell is in the file /usr/dict/words. Write a small C program that will calculate the frequencies of each character.
  2. Using the Huffman coding encode the phrase "USC wins".
  3. Compare the performance of compress, compact, and gzip calculating the compression ratio for each on /usr/dict/words. Also record the time required using the command "time."
  4. Find a spelling/grammar checker on a PC or Mac.
  5. Extra Credit 3 points A digram is a two character sequence.
Character Frequency % Huffman code
space 18.21 111
E 10.53 000
T 7.68 1101
A 6.22 1011
I 6.14 1001
O 6.06 1000
R 5.87 0111
S 5.81 0110
N 5.73 0100
H 3.63 11001
C 3.11 10101
L 3.07 10100
D 2.97 01011
M 2.48 00111
U 2.27 00110
P 1.89 00100
F 1.68 110001
G 1.65 110000
B 1.32 010100
W 1.13 001011
Y 1.07 001010
V 0.70 0101010
K 0.31 01010110
X 0.25 010101110
Q 0.10 0101011110
J 0.06 01010111110
Z 0.06 01010111111
Figure from Introduction to Natural Language Processing by Mary Dee Harris.