Compression using Huffman Codes

We are used to using characters that each have the same number of bits, e.g., the 7-bit ASCII code. However, some of these characters tend to occur more frequently in English (or in any languages with alphabets) than others. If we used a variable number of bits for a code such that frequent characters use fewer bits and infrequent character use more bits, we can decrease the space needed to store the same information. For example, consider the following sentence:
dead beef cafe deeded dad.  dad faced a faded cab.  dad acceded.  dad be bad.
There are 12 a's, 4 b's, 5 c's, 19 d's, 12 e's, 4 f's, 17 spaces, and 4 periods, for a total of 77 characters. If we use a fixed-length code like this:
000	(space)
001	a
010	b
011	c
100	d
101	e
110	f
111	.
Then the sentence, which is of length 77, consumes 77 * 3 = 231 bits. But if we use a variable length code like this:
100	(space)
110	a
11110	b
1110	c
0	d
1010	e
11111	f
1011	.
Then we can encode the text in 3 * 12 + 4 * 5 + 5 * 4 + 19 * 1 + 12 * 4 + 4 * 5 + 17 * 3 + 4 * 4 = 230 bits. That a savings of 1 bit. It doesn't seem like much, but it's a start. (Note that such a code must be a prefix code, where we can distinguish where one code stops and another starts; one code may not be a prefix of another code or there will be confusion.)

If the characters have non-uniform frequency distributions, then finding such a code can lead to great savings in storage space. A process with this effect is called data compression. This can be applied to any data where the frequency distribution is known or can be computed, not just sentences in languages. Examples are computer graphics, digitized sound, binary executables, etc.

A prefix code can be represented as a binary tree, where the leaves are the characters and the codes are derived by tracing a path from root to leaf, using 0 when we go left and 1 when we go right. For example, the code above would be represented by this tree: <

                                _@_
                             _/     \_
                           _/         \_
                         _/             \_
                       _/                 \_
                     _/                     \_
                   _/                         \_
                  d                            _@_
                                              /   \
                                             /     \ 
                                            /       \
                                           /         \
                                          /           \
                                         _@_          _@_
                                        /   \        /   \
                                       /     \      /     \
                                   (space)    @    a       @
                                             / \          / \
                                            /   \        /   \
                                           e    "."     c     @
                                                             / \
                                                            b   f

In this tree, the code for e is found by going right, left, right, left, i.e., 1010.

How can we find such a code? There are many codes, but we would like to find one that is optimal with respect to the number of bits needed to represent the data. Huffman's Algorithm is an algorithm that does just this.

We can label each leaf of the tree with the frequency of the letter in the text to be compressed. This quantity will be called the "value" of the leaf. The frequencies may be known beforehand from studies of the language or data, or can be computed by just counting characters the way counting sort does.

We then label each internal node recursively with the sum of the values of its children, starting at the leaves. So the tree in our example looks like this:

                                _77
                             _/     \_
                           _/         \_
                         _/             \_
                       _/                 \_
                     _/                     \_
                   _/                         \_
                  d                            _58
                 19                           /   \
                                             /     \ 
                                            /       \
                                           /         \
                                          /           \
                                         _33          _25
                                        /   \        /   \
                                       /     \      /     \
                                   (space)    16   a      13
                                     17      / \  12      / \
                                            /   \        /   \
                                           e    "."     c     8
                                          12     4      5    / \
                                                            b   f
                                                            4   4
The root node has value 77, which is just the number of characters.

The number of bits needed to encode the data is the the sum, for each character, of the number of bits in its code times its frequency. Let T be the tree, C be the set of characters c that comprise the alphabet, and f(c) be the frequency of character c. Since the number of bits is the same as the depth in the binary tree, we can express the sum in terms of dT, the depth of character c in the tree:

f(c) dT(c)
c in C
This is the sum we want to minimize. We'll call it the cost, B(T) of the tree. Now we just need an algorithm that will build a tree with minimal cost.

In the following algorithm, f is defined as above; it can be stored efficiently in an array indexed by characters. f is extended as needed to accomodate the values of internal tree nodes. C is again the set of characters represented as leaf tree nodes. We have a priority queue Q of tree nodes where we can quickly extract the minimum element; this can be done with a heap where the heap property is reversed. We build the tree in a bottom up manner, starting with the individual characters and ending up with the root of the tree as the only element of the queue:

Huffman (C)
	n = the size of C
	insert all the elements of C into Q,
		using the value of the node as the priority
	for i in 1..n-1 do
		z = a new tree node
		x = Extract-Minimum (Q)
		y = Extract-Minimum (Q)
		left node of z = x
		right node of z = y
		f[z] = f[x] + f[y]
		Insert (Q, z)
	end for
	return Extract-Minimum (Q) as the complete tree
At first, the queue contains all the leaf nodes as a "forest" of singleton binary trees. Then the two nodes with least value are grabbed out of the queue, joined by a new tree node, and put back on the queue. After n-1 iterations, there is only one node left in the queue: the root of the tree.

Let's go through the above example using Huffman's algorithm. Here are the contents of Q after each step through the for loop:

  1. Initially, all nodes are leaf nodes. We stick all 8 in Q:
    (space)  a    b    c    d    e    f    .
      17    12    4    5   19   12    4    4
    
  2. We join two of the nodes with least value; now there are 7 things in Q:
    
                 8
                / \   (space) a    b    c    d    e
               f   .    17   12    4    5   19   12
               4   4
    
  3. Then the next two with least value, Q has 6 elements:
                                       
                 8                    9 
                / \   (space) a      / \     d    e
               f   .    17   12     b   c   19   12
               4   4                4   5
    
  4. Now the two nodes with least values are the two trees we just made, and Q has 5 elements:
                          17
                      __/   \__
                   __/         \__
                  /               \
                 8                 9 
                / \               / \     d    e  (space)  a
               f   .             b   c   19   12    17    12
               4   4             4   5
    
  5. Q has 4 elements:
                          17
                      __/   \__
                   __/         \__
                  /               \              24
                 8                 9            /  \
                / \               / \     d    e    a   (space)
               f   .             b   c   19   12    12    17
               4   4             4   5
    
  6. Three items left:
                                            34
                                    ______/   \_____
                             ______/                \_____
                            /                            (space)
                          17                               17
                      __/   \__
                   __/         \__
                  /               \                      24
                 8                 9                    /  \
                / \               / \                  e    a      d
               f   .             b   c                12    12     19
               4   4             4   5
    
  7. Two big trees left:
                                            34
                                    ______/   \_____
                             ______/                \_____
                            /                          (space)
                          17                             17
                      __/   \__                             43
                   __/         \__                         /  \
                  /               \                      24    d
                 8                 9                    /  \  19
                / \               / \                  e    a      
               f   .             b   c                12    12
               4   4             4   5
    
  8. Finally, we join the whole thing up:
                                                                  77
                                               __________________/  \_______
                                              /                             \
                                            34                               43
                                    ______/   \_____                        /  \
                             ______/                \_____                24    d
                            /                          (space)           /  \  19
                          17                             17             e   a  
                      __/   \__                                        12  12
                   __/         \__ 
                  /               \
                 8                 9
                / \               / \
               f   .             b   c
               4   4             4   5
    
At each point, we chose the joining that would force less frequent characters to be deeper in the tree.

So an optimal prefix code is:

01	(space)
101	a
0010	b
0011	c
11	d
100	e
0000	f
0001	.
And B(T) = 17 * 2 + 12 * 3 + 4 * 4 + 5 * 4 + 19 * 2 + 12 * 3 + 4 * 4 = 196 bits, a savings of 15% in the size of the data.

Here is a program that uses Huffman coding to read a file named on the command line, then write it to a Huffman encoded file that is (hopefully) smaller. It reads the file twice, once to compute the frequencies of each character, then again to do the actual compression. Before writing the strings of bits, it writes the freuquency table and length of file to the output file to allow for decompression.

/*
 * Huffman Coding
 *
 * This program reads a text file named on the command line, then
 * compresses it using Huffman coding.  The file is read twice,
 * once to determine the frequencies of the characters, and again
 * to do the actual compression.
 */
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <time.h>

/* there are 256 possible characters */

#define NUM_CHARS	256

/* tree node, heap node */

typedef struct _treenode treenode;
struct _treenode {
	int		freq;	/* frequency; is the priority for heap */
	unsigned char	ch;	/* character, if any */
	treenode	*left,	/* left child of Huffman tree (not heap!) */
			*right;	/* right child of Huffman tree */
};

/* this is a priority queue implemented as a binary heap */
typedef struct _pq {
	int		heap_size;
	treenode	*A[NUM_CHARS];
} PQ;

/* create an empty queue */

void create_pq (PQ *p) {
	p->heap_size = 0;
}

/* this heap node's parent */

int parent (int i) {
	return (i-1) / 2;
}

/* this heap node's left kid */

int left (int i) {
	return i * 2 + 1;
}

/* this heap node's right kid */

int right (int i) {
	return i * 2 + 2;
}

/* makes the subheap with root i into a heap , assuming left(i) and
 * right(i) are heaps
 */
void heapify (PQ *p, int i) {
	int		l, r, smallest;
	treenode	*t;

	l = left (i);
	r = right (i);

	/* find the smallest of parent, left, and right */

	if (l < p->heap_size && p->A[l]->freq < p->A[i]->freq) 
		smallest = l;
	else
		smallest = i;
	if (r < p->heap_size && p->A[r]->freq < p->A[smallest]->freq)
		smallest = r;

	/* swap the parent with the smallest, if needed. */

	if (smallest != i) {
		t = p->A[i];
		p->A[i] = p->A[smallest];
		p->A[smallest] = t;
		heapify (p, smallest);
	}
}

/* insert an element into the priority queue.  r->freq is the priority */
void insert_pq (PQ *p, treenode *r) {
	int		i;

	p->heap_size++;
	i = p->heap_size - 1;

	/* we would like to place r at the end of the array,
	 * but this might violate the heap property.  we'll start
	 * at the end and work our way up
	 */
	while ((i > 0) && (p->A[parent(i)]->freq > r->freq)) {
		p->A[i] = p->A[parent(i)];
		i = parent (i);
	}
	p->A[i] = r;
}

/* remove the element at head of the queue (i.e., with minimum frequency) */
treenode *extract_min_pq (PQ *p) {
	treenode	*r;
	
	if (p->heap_size == 0) {
		printf ("heap underflow!\n");
		exit (1);
	}

	/* get return value out of the root */

	r = p->A[0];

	/* take the last and stick it in the root (just like heapsort) */

	p->A[0] = p->A[p->heap_size-1];

	/* one less thing in queue */

	p->heap_size--;

	/* left and right are a heap, make the root a heap */

	heapify (p, 0);
	return r;
}

/* read the file, computing the frequencies for each character
 * and placing them in v[]
 */
unsigned int get_frequencies (FILE *f, unsigned int v[]) {
	int	r, n;

	/* n will count characters */

	for (n=0;;n++) {

		/* fgetc() gets an unsigned char, converts to int */

		r = fgetc (f);
	
		/* no more?  get out of loop */

		if (feof (f)) break;

		/* one more of this character */

		v[r]++;
	}
	return n;
}

/* make the huffman tree from frequencies in freq[] (Huffman's Algorithm) */

treenode *build_huffman (unsigned int freqs[]) {
	int		i, n;
	treenode	*x, *y, *z;
	PQ		p;

	/* make an empty queue */

	create_pq (&p);

	/* for each character, make a heap/tree node with its value
	 * and frequency 
	 */
	for (i=0; i<NUM_CHARS; i++) {
		x = malloc (sizeof (treenode));

		/* its a leaf of the Huffman tree */

		x->left = NULL;
		x->right = NULL;
		x->freq = freqs[i];
		x->ch = (char) i;

		/* put this node into the heap */

		insert_pq (&p, x);
	}

	/* at this point, the heap is a "forest" of singleton trees */

	n = p.heap_size-1; /* heap_size isn't loop invariant! */

	/* if we insert two things and remove one each time,
	 * at the end of heap_size-1 iterations, there will be
	 * one tree left in the heap
	 */
	for (i=0; i<n; i++) {

		/* make a new node z from the two least frequent
		 * nodes x and y
		 */
		z = malloc (sizeof (treenode));
		x = extract_min_pq (&p);
		y = extract_min_pq (&p);
		z->left = x;
		z->right = y;

		/* z's frequency is the sum of x and y */

		z->freq = x->freq + y->freq;

		/* put this back in the queue */

		insert_pq (&p, z);
	}

	/* return the only thing left in the queue, the whole Huffman tree */

	return extract_min_pq (&p);
}

/* traverse the Huffman tree, building up the codes in codes[] */

void traverse (treenode *r, 	/* root of this (sub)tree */
		int level, 	/* current level in Huffman tree */
		char code_so_far[], /* code string up to this point in tree */
		char *codes[]) {/* array of codes */

	/* if we're at a leaf node, */

	if ((r->left == NULL) && (r->right == NULL)) {

		/* put in a null terminator */

		code_so_far[level] = 0;

		/* make a copy of the code and put it in the array */

		codes[r->ch] = strdup (code_so_far);
	} else {

		/* not at a leaf node.  go left with bit 0 */

		code_so_far[level] = '0';
		traverse (r->left, level+1, code_so_far, codes);

		/* go right with bit 1 */

		code_so_far[level] = '1';
		traverse (r->right, level+1, code_so_far, codes);
	}
}

/* global variables, a necessary evil */

int nbits, current_byte, nbytes;

/* output a single bit to an open file */

void bitout (FILE *f, char b) {

	/* shift current byte left one */

	current_byte <<= 1;

	/* put a one on the end of this byte if b is '1' */

	if (b == '1') current_byte |= 1;

	/* one more bit */

	nbits++;

	/* enough bits?  write out the byte */

	if (nbits == 8) {
		fputc (current_byte, f);
		nbytes++;
		nbits = 0;
		current_byte = 0;
	}
}

/* using the codes in codes[], encode the file in infile, writing
 * the result on outfile
 */
void encode_file (FILE *infile, FILE *outfile, char *codes[]) {
	unsigned char ch;
	char	*s;

	/* initialize globals for bitout() */

	current_byte = 0;
	nbits = 0;
	nbytes = 0;

	/* continue until end of file */

	for (;;) {

		/* get a char */

		ch = fgetc (infile);
		if (feof (infile)) break;

		/* put the corresponding bitstring on outfile */

		for (s=codes[ch]; *s; s++) bitout (outfile, *s);
	}

	/* finish off the last byte */

	while (nbits) bitout (outfile, '0');
}

/* main program */
	
int main (int argc, char *argv[]) {
	FILE		*f, *g;
	treenode	*r;		   /* root of Huffman tree */
	unsigned int	n, 		   /* number of bytes in file */
			freqs[NUM_CHARS];  /* frequency of each char */
	char		*codes[NUM_CHARS], /* array of codes, 1 per char */
			code[NUM_CHARS],   /* a place to hold one code */
			fname[100];	   /* what to call output file */

	/* hassle user */

	if (argc != 2) {
		fprintf (stderr, "Usage: %s <filename>\n", argv[0]);
		exit (1);
	}

	/* set all frequencies to zero */

	memset (freqs, 0, sizeof (freqs));

	/* open command line argument file */

	f = fopen (argv[1], "r");
	if (!f) {
		perror (argv[1]);
		exit (1);
	}

	/* compute frequencies from this file */

	n = get_frequencies (f, freqs);
	fclose (f);

	/* make the huffman tree */

	r = build_huffman (freqs);

	/* traverse the tree, filling codes[] with the codes */

	traverse (r, 0, code, codes);

	/* name the output file something.huf */

	sprintf (fname, "%s.huf", argv[1]);
	g = fopen (fname, "w");
	if (!g) {
		perror (fname);
		exit (1);
	}

	/* write frequencies to file so they can be reproduced */

	fwrite (freqs, NUM_CHARS, sizeof (int), g);

	/* write number of characters to file as binary int */

	fwrite (&n, 1, sizeof (int), g);

	/* open input file again */

	f = fopen (argv[1], "r");
	if (!f) {
		perror (argv[1]);
		exit (1);
	}

	/* encode f to g with codes[] */

	encode_file (f, g, codes);
	fclose (f);
	fclose (g);
	/* brag */
	printf ("%s is %0.2f%% of %s\n", 
		fname, (float) nbytes / (float) n, argv[1]);
	exit (0);
}