CSCI 587 - Assignment 2 Text Compression

Date: Jan 23, 1997
Date Due: Jan 30, 1997

The dictionary used by the Unix system command spell is in the file /usr/dict/words. Write a small C program that will calculate the frequencies of each character.
- What major differences do you see from the frequencies presented in class?
- Why is this text not a good sample text?
Using the Huffman coding encode the phrase "USC wins".
Compare the performance of compress, compact, and gzip calculating the compression ratio for each on /usr/dict/words. Also record the time required using the command "time."
- To do this create a symbolic link to the dictionary from your home directory with the command
  ln -s /usr/dict/words words
- Then run each of the commands on the file using time using the elapsed time.
  time compact words
  Use the manual ( man 1 time ) to interpret the output.
  u = user time, s = system time, then total or elapsed time.
- Check the size using the command "wc"
- You will need to uncompact, uncompress, and guzip to get back to the original file
Find a spelling/grammar checker on a PC or Mac.
- Which wordprocessor are you using?
- What is the grammar checker's evaluation of
  An hoarse is one thee gulf curse.
  Out the window, the bird flew.
- How does the spelling checker respond to
  fastly, greenly, et cetera (and I mean the phrase)
Extra Credit 3 points A digram is a two character sequence.
- Using /usr/dict/words calculate a static model for digram compression.
- For the top ten digrams compute a Huffman code

Figure from Introduction to Natural Language Processing by Mary Dee Harris.