CSCI 587 - Solutions to Assignment 2 Text Compression

Date: Jan 23, 1997
Date Due: Jan 30, 1997

The dictionary used by the Unix system command spell is in the file /usr/dict/words. Write a small C program that will calculate the frequencies of each character.

#include 


main() {
   int c;
   int charCount[256];
   int i;

   for(i=0; i < 256; ++i) charCount[i] = 0;

   while((c = getchar())!= EOF)
       ++charCount[c];

   for(i=0; i < 256; ++i)
       if(charCount[i] != 0) printf("%15d\t%c\n", charCount[i], (char)i);
}

What major differences do you see from the frequencies presented in class?
```
No spaces
```

Why is this text not a good sample text?

It is just a list of words, not a good representative example of
an English  text.

Using the Huffman coding encode the phrase "USC wins".
```
00110 0110 10101 111 001011 1001 0100 0110
```
Compare the performance of compress, compact, and gzip calculating the compression ratio for each on /usr/dict/words. Also record the time required using the command "time."
- To do this create a symbolic link to the dictionary from your home directory with the command
  ln -s /usr/dict/words words
- Then run each of the commands on the file using time using the elapsed time.
  time compact words
  Use the manual ( man 1 time ) to interpret the output.
  u = user time, s = system time, then total or elapsed time.
- Check the size using the command "wc"
- You will need to uncompact, uncompress, and guzip to get back to the original file
```
What I got was:
compress    102727 bytes    1.91 seconds
compact     111646 bytes    5.43 seconds
gzip         79269 bytes    5.76 seconds
Note these are actual sizes. 
Compression ratios are calculated by 
    CR = (OriginalSize - NewSize)/OriginalSize
```
Find a spelling/grammar checker on a PC or Mac.
- Which wordprocessor are you using?
- What is the grammar checker's evaluation of
  An hoarse is one thee gulf curse.
  Out the window, the bird flew.
- How does the spelling checker respond to
  fastly, greenly, et cetera (and I mean the phrase)
```
I didn't do this one.
```
Extra Credit 3 points A digram is a two character sequence.
- Using /usr/dict/words calculate a static model for digram compression.
- For the top ten digrams compute a Huffman code

Character	Frequency %	Huffman code
space	18.21	111
E	10.53	000
T	7.68	1101
A	6.22	1011
I	6.14	1001
O	6.06	1000
R	5.87	0111
S	5.81	0110
N	5.73	0100
H	3.63	11001
C	3.11	10101
L	3.07	10100
D	2.97	01011
M	2.48	00111
U	2.27	00110
P	1.89	00100
F	1.68	110001
G	1.65	110000
B	1.32	010100
W	1.13	001011
Y	1.07	001010
V	0.70	0101010
K	0.31	01010110
X	0.25	010101110
Q	0.10	0101011110
J	0.06	01010111110
Z	0.06	01010111111

Figure from Introduction to Natural Language Processing by Mary Dee Harris.