Beyond Asymptotic Analysis

Asymptotic analysis, i.e., big-O, theta, omega, etc. can tell use a lot about the behavior of an algorithm, and allow us to compare two algorithms with different asymptotic bounds and choose the best one.

But sometimes asymptotic analysis fails us. For instance:

If a rigorous analysis of an algorithm is very difficult or impossible. For example, many number theoretical algorithms for factoring large integers are conjectured to have certain running times, but these functions are largely derived empirically.
If two algorithms have the same tight bounds, like heapsort and merge sort.
If we know most instances of our problem will be small and thus not subject to asymptotic analysis in the same way large problems are. For example, the standard integer multiplication algorithm (the one you learned in grade school) takes time O(n²) to multiply two n-bit integers. There is another algorithm that uses the Fast Fourier Transform to multiply in time O(n log n). However, this is very rarely used because it is only faster when you have more than several thousand bits to multiply, i.e., n₀ is very large.
We often make simplifying assumptions when analyzing algorithms that may turn out not to be quite true in practice. For instance, we may assume that memory (array) access times are constant, but they really are not.

Memory Access Time

When doing a rigorous analysis of, say, a sorting algorithm, we often count the number of memory accesses (i.e., array accesses) performed and call it T(n). We would like to find out how long a memory access takes so that we can multiply that by T(n) to find out how long we can expect the algorithm to take. The following C program will time memory accesses on a Unix system by doing a bunch of memory accesses and timing how long they take:

#include <stdio.h>
#include <time.h>

#define MEMSIZE	(1<<24) /* 2 to the 24 power, about 16 million */

int main () {
	register int 	i, count, n;
	register char	*p;
	int		c1, c2;
	double		t;

	/* allocate a bunch of bytes */
	p = (char *) malloc (MEMSIZE);

	/* how many memory accesses to do */
	n = MEMSIZE * 10;

	/* count the number of memory accesses done */
	count = 0;

	/* start indexing the array at element 0 */
	i = 0;

	/* read number of clock ticks so far */
	c1 = clock ();

	/* do n memory accesses */
	while (count < n) {
		/* write 0 to memory; this is the memory access we're timing */
		p[i] = 0;

		/* go to next index in array of bytes */
		i++;

		/* loop around if we go out of bounds */
		if (i >= MEMSIZE) i = 0;

		/* one more memory access */
		count++;
	}

	/* what time is it now? */
	c2 = clock ();

	/* how many clock ticks have passed? */
	t = c2 - c1;

	/* how many seconds is that? */
	t /= CLOCKS_PER_SEC;
	printf ("%f seconds\n", t);

	/* how many seconds to access one byte? */
	t /= n;

	/* multiply by a billion to get nanoseconds */
	printf ("%f nanoseconds per access\n", t * 10e9);
	exit (0);
}

We can (somewhat) safely ignore the counting code as not contributing a significant amount of time to the time measured; those variables are usually stored in CPU registers, which have a much faster access time than RAM.

I ran this program on a Pentium 200MHz computer with 64MB RAM. It reported memory access (write) time as 350 nanoseconds. That's not too bad. I tried it on an old SPARCstation 10 and got 913 nanoseconds. I tried it on a very old 486DLC/33 and got about 4600 nanoseconds, yikes. I used the highest possible level of optimizations for gcc on each platform.

Is this a reliable way of measuring memory speed? No; almost all of the memory accesses are consecutive. In practive, memory accesses don't come this way. Let's look at a more robust program that measures access for different strides or distances between accesses:

#include <stdio.h>
#include <time.h>

#define MEMSIZE	(1<<24)

int main () {
	register int 	i, count, n;
	register char	*p;
	int		c1, c2, k;
	double		t;

	p = (char *) malloc (MEMSIZE);
	n = MEMSIZE * 10;
	for (k=1; k<100; k++) {
		count = 0;
		i = 0;
		c1 = clock ();
		while (count < n) {
			p[i] = 0;
			i+=k;
			if (i >= MEMSIZE) i = 0;
			count++;
		}
		c2 = clock ();
		t = c2 - c1;
		t /= CLOCKS_PER_SEC;
		printf ("k = %d\n", k);
		printf ("%f seconds\n", t);
		t /= n;
		printf ("%f nanoseconds per access\n", t * 10e9);
		fflush (stdout);
		fflush (stderr);
	}
	exit (0);
}

This time, instead of timing for just one stride, i.e., 1, I timed it for memory accesses where the distance between each varied from 1 to 100. Here are the results for all three computers. The x-axis is the stride, and the y-axis is the average time for a memory access in nanoseconds:

The Pentium gets as bad as 4000 nanoseconds. The 486, with its tiny 1K cache, does about the same for every stride. The SPARCstation starts off pretty good but reaches a plateau at almost 10,000 nanoseconds, being beat by the little 486!

The Memory Hierarchy

On most computer systems, there is a three- or four-layer memory hierarchy. When the CPU needs to read or write memory, it looks for the memory in these layers, then does the access in the one where it finds the memory. A typical system has these layers:

L1 cache, the "level-one cache." This is an area of very fast memory, often part of the CPU chip itself. This is the first layer that will report back if the memory is found there. If an address is in the L1 cache, it is immediately accessed. If not, it is found somewhere else in the memory hierarchy and it and brought into the L1 cache so that subsequent accesses will be very fast. An entire line, i.e., many bytes of memory, are brought into the L1 memory at once so that, even if a subsequent access is close to a previous access, it will still be in L1. L1 cache is usually very small, maybe about 16 kilobytes (16384 bytes).
L2 cache, the "level-two cache." This is another area of very fast memory, but a little bit slower than L1. It works the same way, though, serving up memory accesses to the CPU and L1 cache. It is usually larger than L1, from about 128K up to 4MB.
RAM. This is what people mean when they say their machine has 64MB of RAM. This is much slower but much cheaper than L2 or L1 cache memory.
Hard disk. Some systems have virtual memory, where the hard disk can act as memory. In this situation, the RAM acts as a cache for the virtual memory that can be much larger than the main memory. Disk is literally a million times slower than RAM.

Caches usually keep around the most recently used data, and the data nearby the most recently used data. It is easiest to think of L1 and L2 together as "the cache." Memory systems are designed to take advantage of the Principal of Locality.

The Principal of Locality

(From Hennessy and Patterson, Computer Architecture: A Quantitative Approach

Temporal Locality (locality in time): If an item is referenced, it will tend to be referenced again soon.
Spatial Locality (locality in space): If an item is referenced, nearby items will tend to be referenced soon.

This is how computer architects build computers, with this idea in mind. If your programs follow this principle, they will run fast. If they don't, they will run slowly. The principal of locality is also often stated as the 90/10 rule: 90% of a programs execution will concern 10% of the programs code.

Here is an example. This C function adds two N by N matrices, placing the result in a third:

void add (int A[N][N], int B[N][N], int C[N][N]) {
        int     i, j, k, sum;

        for (i=0; i<N; i++)
        for (j=0; j<N; j++)
                C[i][j] = A[i][j] + B[i][j];
}

Note that the loop takes

(N²) time asymptotically. I ran a program using this C function to add two 256 by 256 matrices together, computing the array access time for the two reads and one write. I got an average 367 nanoseconds for access to a single integer array element, which is pretty good considering an integer is four bytes. The program ran the loop 500 times, taking 3.61 seconds to complete. Then I switched the two loops:

void add (int A[N][N], int B[N][N], int C[N][N]) {
        int     i, j, k, sum;

        for (j=0; j<N; j++)
        for (i=0; i<N; i++)
                C[i][j] = A[i][j] + B[i][j];
}

and ran the program again. I got 2774 nanosecond access time and the whole thing took 27.27 seconds. The algorithm was now 755% slower! Why?

In C, two-dimensional arrays are laid out in row-major format. That is, 2D arrays are basically 1D arrays of rows, so that A[i][j] is right next to A[i][j+1] and A[i][j-1] in memory.

When I let the inner loop take j from 0 to N-1, all of those memory accesses were consecutive, following the principle of locality. The first memory access brought most of the rest of the data into the L1 cache, so the rest of the memory accesses were very fast.

When I switched the two loops, memory accesses that occurred one after another accesses different rows, whose elements were far apart in memory. Each access had to go to slower RAM to find the memory, and didn't use the caches at all.

This illustrates an important point: when you are writing an algorithm, try as much as you can to stay "in cache," i.e., if you have a choice, keep memory accesses done one after another near each other in memory.

There is a lot more to the memory hierarchy than we have seen here, that you will see in a computer architecture class. Different computers have different policies about what goes in the cache, how big the cache is, etc. However, if you stick to the principal of locality, your programs should run faster no matter what the architecture is.

OF COURSE, this doesn't mean we can now ignore asymptotic notation. Quicksort still beats bubble sort even though bubble sort obeys the principle of locality much more than Quicksort.