CSCI 587 - Assignment 2 Text Compression
Date: Jan 23, 1997
Date Due: Jan 30, 1997
- The dictionary used by the Unix system command spell is in the file
/usr/dict/words. Write a small C program that will calculate the
frequencies of each character.
- What major differences do you see from the frequencies
presented in class?
- Why is this text not a good sample text?
- Using the Huffman coding encode the phrase "USC wins".
- Compare the performance of compress, compact, and gzip calculating
the compression ratio for each on /usr/dict/words.
Also record the time required using the command "time."
- To do this create a symbolic link to the dictionary
from your home directory with the command
ln -s /usr/dict/words words
- Then run each of the commands on the file using time using the elapsed time.
time compact words
Use the manual ( man 1 time ) to interpret the output.
u = user time, s = system time, then total or elapsed time.
- Check the size using the command "wc"
- You will need to uncompact, uncompress, and guzip to get back to the original file
- Find a spelling/grammar checker on a PC or Mac.
- Which wordprocessor are you using?
- What is the grammar checker's evaluation of
An hoarse is one thee gulf curse.
Out the window, the bird flew.
- How does the spelling checker respond to
fastly, greenly, et cetera (and I mean the phrase)
- Extra Credit 3 points A digram is a two character sequence.
- Using /usr/dict/words calculate a static model
for digram compression.
- For the top ten digrams compute a Huffman code
Character |
Frequency % |
Huffman code |
space | 18.21 | 111 |
E | 10.53 | 000 |
T | 7.68 | 1101 |
A | 6.22 | 1011 |
I | 6.14 | 1001 |
O | 6.06 | 1000 |
R | 5.87 | 0111 |
S | 5.81 | 0110 |
N | 5.73 | 0100 |
H | 3.63 | 11001 |
C | 3.11 | 10101 |
L | 3.07 | 10100 |
D | 2.97 | 01011 |
M | 2.48 | 00111 |
U | 2.27 | 00110 |
P | 1.89 | 00100 |
F | 1.68 | 110001 |
G | 1.65 | 110000 |
B | 1.32 | 010100 |
W | 1.13 | 001011 |
Y | 1.07 | 001010 |
V | 0.70 | 0101010 |
K | 0.31 | 01010110 |
X | 0.25 | 010101110 |
Q | 0.10 | 0101011110 |
J | 0.06 | 01010111110 |
Z | 0.06 | 01010111111 |
Figure from Introduction to Natural Language Processing
by Mary Dee Harris.