CSCI 587 - Solutions to Assignment 2 Text Compression


Date: Jan 23, 1997
Date Due: Jan 30, 1997
  1. The dictionary used by the Unix system command spell is in the file /usr/dict/words. Write a small C program that will calculate the frequencies of each character.
    #include 
    
    
    main() {
       int c;
       int charCount[256];
       int i;
    
       for(i=0; i < 256; ++i) charCount[i] = 0;
    
       while((c = getchar())!= EOF)
           ++charCount[c];
    
       for(i=0; i < 256; ++i)
           if(charCount[i] != 0) printf("%15d\t%c\n", charCount[i], (char)i);
    }
    
  2. Using the Huffman coding encode the phrase "USC wins".
    00110 0110 10101 111 001011 1001 0100 0110
    
  3. Compare the performance of compress, compact, and gzip calculating the compression ratio for each on /usr/dict/words. Also record the time required using the command "time."
    What I got was:
    compress    102727 bytes    1.91 seconds
    compact     111646 bytes    5.43 seconds
    gzip         79269 bytes    5.76 seconds
    Note these are actual sizes. 
    Compression ratios are calculated by 
        CR = (OriginalSize - NewSize)/OriginalSize
    
  4. Find a spelling/grammar checker on a PC or Mac.
    I didn't do this one.
    
  5. Extra Credit 3 points A digram is a two character sequence.
Character Frequency % Huffman code
space 18.21 111
E 10.53 000
T 7.68 1101
A 6.22 1011
I 6.14 1001
O 6.06 1000
R 5.87 0111
S 5.81 0110
N 5.73 0100
H 3.63 11001
C 3.11 10101
L 3.07 10100
D 2.97 01011
M 2.48 00111
U 2.27 00110
P 1.89 00100
F 1.68 110001
G 1.65 110000
B 1.32 010100
W 1.13 001011
Y 1.07 001010
V 0.70 0101010
K 0.31 01010110
X 0.25 010101110
Q 0.10 0101011110
J 0.06 01010111110
Z 0.06 01010111111
Figure from Introduction to Natural Language Processing by Mary Dee Harris.