B-Trees

B-Trees are a variation on binary search trees that allow quick searching in files on disk. Instead of storing one key and having two children, B-tree nodes have n keys and n+1 children, where n can be large. This shortens the tree (in terms of height) and requires much less disk access than a binary search tree would. The algorithms are a bit more complicate, requiring more computation than a binary search tree, but this extra complication is worth it because computation is much cheaper than disk access.

Disk Access

Secondary storage usually refers to the fixed disks found in modern computers. These devices contain several platters of magnetically sensitive material rotating rapidly. Data is stored as changes in the magnetic properties on different portions of the platters. Data is separated into tracks, concentric circles on the platters. Each track is further divided into sectors which form the unit of a transaction between the disk and the CPU. A typical sector size is 512 bytes. The data is read and written by arms that go over the platters, accessing different sectors as they are requested. The disk is spinning at a constant rate (7200 RPM is typical for 1998 mid-range systems).

The time it takes to access data on secondary storage is a function of three variables:

The time it takes for the arm to move to the track where the requested sector lies. Usually around 10 milliseconds.
The time it takes for the right sector to spin under the arm. For a 7200 RPM drive, this is 4.1 milliseconds.
The time it takes to read or write the data. Depending on the density of the data, this time is negligible compared to the other two.

So an arbitrary 512-byte sector can be accessed (read or written) in roughly 15 milliseconds. Subsequent reads to an adjacent area of the disk will be much faster, since the head is already in exactly the right place. Data can be arranged into "blocks" that are these adjacent multi-sector aggregates.

Contrast this to access times to RAM. From the last lecture, a typical non-sequential RAM access took about 5 microseconds. This is 3000 times faster; we can do at least 3000 memory accesses in the time it takes to do one disk access, and probably more since the algorithm doing the memory accesses is typically following the principal of locality.

So, we had better make each disk access count as much as possible. This is what B-trees do.

For the purposes of discussion, records we might want to search through (bank records, student records, etc.) are stored on disk along with their keys (account number, social security number, etc.), and many are all stored on the same disk "block." The size of a block and the amount of data can be tuned with experimentation or analysis beyond the scope of this lecture. In practice, sometimes only "pointers" to other disk blocks are stored in internal nodes of a B-tree, with leaf nodes containing the real data; this allows storing many more keys and/or having smaller (and thus faster) blocks.

B-Tree Definition

Here is a sample B-tree:

				 _________
				|_30_|_60_|
			       _/    |    \_
			     _/      |      \_
			   _/        |        \_
			 _/          |          \_
		       _/            |            \_
		     _/              |              \_
		   _/                |                \_
	________ _/          ________|              ____\_________
       |_5_|_20_|           |_40_|_50_|            |_70_|_80_|_90_|
      /    |     \          /    |    \           /     |    |     \
     /     |      \        /     |     \         /      |    |      \
    /      |       |      |      |      |       |       |    |       \
   /       |       |      |      |      |       |       |    |        \
|1|3| |6|7|8| |12|16|  |32|39||42|48||51|55|  |61|64| |71|75||83|86| |91|95|99|

B-tree nodes have a variable number of keys and children, subject to some constraints. In many respects, they work just like binary search trees, but are considerably "fatter."

A B-tree is a tree with root T.root with the following properties:

Every node has the following fields:
- x.n, the number of keys currently in node x. For example, |40|50|.n in the above example B-tree is 2. |70|80|90|.n is 3.
- The x.n keys themselves, stored in nondecreasing order: x.key[1] <= x.key[2] <= ... <= x.key[x.n] For example, the keys in |70|80|90| are ordered.
- x.leaf, a boolean value that is True if x is a leaf and False if x is an internal node.
If x is an internal node, it contains x.n+1 pointers c[1], c[2], ... , x.c[n], x.c[n+1] to its children. For example, in the above B-tree, the root node has two keys, thus three children. Leaf nodes have no children so their c[i] fields are undefined.
The keys x.key[i] separate the ranges of keys stored in each subtree: if k[i] is any key stored in the subtree with root x.c[i], then
k[1] <= x.key[1] <= k[2] <= x.key[2] <= ... <= x.key[x.n] <= k[x.n+1].
For example, everything in the far left subtree of the root is numbered less than 30. Everything in the middle subtree is between 30 and 60, while everything in the far right subtree is greater than 60. The same property can be seen at each level for all keys in non-leaf nodes.
Every leaf has the same depth, which is the tree's height h. In the above example, h=2.
There are lower and upper bounds on the number of keys a node can contain. These bounds can be expressed in terms of a fixed integer t >= 2 called the minimum degree of the B-tree:
- Every node other than the root must have at least t-1 keys. Every internal node other than the root thus has at least t children. If the tree is nonempty, the root must have at least one key.
- Every node can contain at most 2t-1 keys. Therefore, an internal node can have at most 2t children. We say that a node is full if it contains exactly 2t-1 keys.

So for example a C type definition for B-trees with floating point keys might look something like:

typedef struct _btreenode {
	int	n;		/* number of keys */
	float	key[2*t-1];	/* keys */
	long	c[2*t];		/* pointers to nodes in disk blocks */
} btreenode;

Some Analysis

Any n-key B-tree with n > 1 of height h and minimum degree t satisfies the following property:

h <= log_t(n+1)/2

(Proof of this is left until Analysis of Algorithms :-) That of course gives us that the height of a B-tree is always O(log n), but that log hides an impressive performance gain over regular binary search trees (since performance of algorithms will be proportional to the height of the tree in many cases). Also, B-trees are always balanced; all leaf nodes occur on the same level. In binary search trees, we can easily create a degenerate case where some branches are very far from the root (these can be fixed up with things like AVL-trees, splay trees, etc.).

Consider a binary search tree arranged on a disk, with pointers being the byte offset in the file where a child occurs. A typical situation will have maybe 50 bytes of information, 4 bytes of key, and 8 bytes (two 32-bit integers) for left and right pointers. That makes 62 bytes that will comfortably fit in a 512-byte sector. In fact, we can put many such nodes in the same sector; however, when our n (= number of nodes) grows large, it is unlikely that the same two nodes will be accessed sequentially, so access to each node will cost roughly one disk access. In the best possible case, the a binary tree with n nodes is of height about floor(log_2]n). So searching for an arbitrary node will take about log₂n disk accesses. In a file with one million nodes, for instance, the phone book for a medium-sized city, this is about 20 disk accesses. Assuming the 15 millisecond access time. a single access will take 0.3 seconds.

Contrast this with a B-tree with records that fit into one 512-byte sector. Let t=4. Then each node can have up to 8 children, 7 keys. With 50*7 bytes of information, 4*7 bytes of keys, 4*8 bytes of children pointers, and 4 bytes to store x.n, we have 414 bytes of information fitting comfortably into a 512 byte sector. With one million records, we would have to do log₄1,000,000 = 10 disk accesses, taking 0.15 seconds, reducing by a half the time it takes. If we choose to keep all the information in the leaves as suggested above and only keep pointer and key information, we can fit up to 64 keys and let t=32. Now the number of disk accesses in our example is less than or equal to log₃₂ 1,000,000 = 4. In practice, up to a few thousand keys can be supported with blocks spanning many sectors; such blocks take only a tiny bit longer to access than a single arbitrary access, so performance is still improved.

Of course, asymptotically, the number of accesses is "the same," but for real-world numbers, B-trees are a lot better. The key is the fact that disk access times are much slower than memory and computation time. If we were to implement B-trees using real memory and pointers, there would probably be no performance improvement whatsoever because of the algorithmic overhead; indeed, there might be a performance decrease.

Operations on B-trees

Let's look at the operations on a B-tree. We assume that the root node is always kept in memory; it makes no sense to retrieve it from the disk every time since we will always need it. (In fact, it might be wise to store a "cache" of frequently used and/or low depth nodes in memory to further reduce disk accesses...)

Searching a B-tree Searching a B-tree is much like searching a binary search tree, only the decision whether to go "left" or "right" is replaced by the decision whether to go to child 1, child 2, ..., child x.n. The following procedure, B-Tree-Search, should be called with the root node as its first parameter. It returns the block where the key k was found along with the index of the key in the block, or "null" if the key was not found. Note: B-tree algorithms and some other algorithms for the remainder of the course will be presented in pseudocode, i.e., a mix of C and English with some details glossed over, so that those details don't get in the way of understanding.

B-Tree-Search (x, k) { // search starting at node x for key k
	i = 1

	// search for the correct child

	while (i <= x.n and k > x.key[i]) i++;

	// now i is the least index in the key array such that
	// k <= x.key[i], so k will be found here or
	// in the i'th child

	if (i <= x.n && k == x.key[i])
		// we found k at this node
		return (x, i) // as a pair
	
	if (x.leaf) return NULL;

	// we must read the block before we can work with it

	Disk-Read (x.c[i])
	return B-Tree-Search (x.c[i], k)
}

The time in this algorithm is dominated by the time to do disk reads. Clearly, we trace a path from root down possibly to a leaf, doing one disk read each time, so the number of disk reads for B-Tree-Search is O(h) = O(log n) where h is the height of the B-tree and n is the number of keys.

We do a linear search for the correct key. There are (t) keys (at least t-1 and at most 2t-1), and this search is done for each disk access, so the computation time is O(t log n). Of course, this time is very small compared to the time for disk accesses. If we have some spare time one day, in between reading Netscape and playing DOOM, we might consider using a binary search (remember, the keys are nondecreasing) and get this down to O(log t log n).

Creating an empty B-tree

To initialize a B-tree, we need simply to build an empty root node:

B-Tree-Create (T) {
	x = allocate-node ();
	x.leaf = True
	x.n = 0
	Disk-Write (x)
	T.root = x
}

This assumes there is an allocate-node function that returns a node with key, c, leaf fields, etc., and that each node has a unique "address" on the disk.

Clearly, the running time of B-Tree-Create is O(1), dominated by the time it takes to write the node to disk.

Inserting a key into a B-tree

Inserting into a B-tree is a bit more complicated than inserting into an ordinary binary search tree. We have to find a place to put the new key. We would prefer to put it in the root, since that is kept in RAM and so we don't have to do any disk accesses. If that node is not full (i.e., x.n for that node is not 2t-1), then we can just stick the new key in, shift around some pointers and keys, write the results back to disk, and we're done. Otherwise, we will have to split the root and do something with the resulting pair of nodes, maintaining the properties of the definition of a B-tree.

Here is the general algorithm for insertinging a key k into a B-tree T. It calls two other procedures, B-Tree-Split-Child, that splits a node, and B-Tree-Insert-Nonfull, that handles inserting into a node that isn't full.

B-Tree-Insert (T, k) {
	r = T.root
	if (r.n == 2t - 1) { 
		// uh-oh, the root is full, we have to split it
		s = allocate-node ()
		T.root = s 	// new root node
		s.leaf = False // will have some children
		s.n = 0	// for now
		s.c[1] = r // child is the old root node
		B-Tree-Split-Child (s, 1, r) // r is split
		B-Tree-Insert-Nonfull (s, k) // s is clearly not full
	} else {
		B-Tree-Insert-Nonfull (r, k)
	}
}

Let's look at the non-full case first: this procedure is called by B-Tree-Insert to insert a key into a node that isn't full. In a B-tree with a large minimum degree, this is the common case. Before looking at the pseudocode, let's look at a more English explanation of what's going to happen:

To insert the key k into the node x, there are two cases:

x is a leaf node. Then we find where k belongs in the array of keys, shift everything over to the left, and stick k in there.
x is not a leaf node. We can't just stick k in because it doesn't have any children; children are really only created when we split a node, so we don't get an unbalanced tree. We find a child of x where we can (recursively) insert k. We read that child in from disk. If that child is full, we split it and figure out which one k belongs in. Then we recursively insert k into this child (which we know is non-full, because if it were, we would have split it).

Here's the algorithm:

B-Tree-Insert-Nonfull (x, k) {
	i = x.n

	if (x.leaf) {

		// shift everything over to the "right" up to the
		// point where the new key k should go

		while (i >= 1 and k < x.key[i]) {
			x.key[i+1] = x.key[i];
			i--;
		}

		// stick k in its right place and bump up x.n

		x.key[i+1] = k;
		x.n++;
	} else {

		// find child where new key belongs:

		while (i >= 1 and k < x.key[i]) i--;

		// if k is in x.c[i], then k <= x.key[i] (from the definition)
		// we'll go back to the last key (least i) where we found this
		// to be true, then read in that child node

		i++;
		Disk-Read (x.c[i]);
		if (x.c[i].n] == 2t - 1) {

			// uh-oh, this child node is full, we'll have to split it

			B-Tree-Split-Child (x, i, x.c[i])

			// now x.c[i] and x.c[i+1] are the new children, 
			// and x.key[i] may have been changed. 
			// we'll see if k belongs in the first or the second

			if (k > x.key[i]) i++
		}

		// call ourself recursively to do the insertion

		B-Tree-Insert-Nonfull (x.c[i], k)
	}
}

Now let's see how to split a node. When we split a node, we always do it with respect to its parent; two new nodes appear and the parent has one more child than it did before. Again, let's see some English before we have to look at the pseudocode:

We will split a node y that is the ith child of its parent x. Node x will end up having one more child we'll call z, and we'll make room for it in the x.c[i] array right next to y.

We know y is full, so it has 2t-1 keys. We'll "cut" y in half, copying y.key[t+1] through y.key[2t-1] into the first t-1 keys of this new node z.

If the node isn't a leaf, we'll also have to copy over the child pointers y.c[t+1] through y.c[2t] (one more child than keys) into the first t children of z.

Then we have to shift the keys and children of x over one starting at index i+1 to accomodate the new node z, and then update the n counts on x, y and z, finally writing them to disk.

Here's the pseudocode:

B-Tree-Split-Child (x, i, y) {
	z = allocate-node ()

	// new node is a leaf if old node was 

	z.leaf = y.leaf

	// we since y is full, the new node must have t-1 keys

	z.n = t - 1

	// copy over the "right half" of y into z

	for (j=1; j<t; j++) 
		z.key[j] = y.key[j+t]

	// copy over the child pointers if y isn't a leaf

	if (not y.leaf) {
		for (j=1; j<=t; j++)
			z.c[j] = y.c[j+t]
	}

	// having "chopped off" the right half of y, it now has t-1 keys

	y.n = t - 1

	// shift everything in x over from i+1, then stick the new child in x;
	// y will half its former self as x.c[i] and z will 
	// be the other half as x.c[i+1]

	for (j=x.n+1; j>=i+1; j--)
		x.c[j+1] = x.c[j]
	c[i+1] = z

	// the keys have to be shifted over as well...

	for (j=x.n; j>=i; j--)
		x.key[j+1] = x.key[j]

	// ...to accomodate the new key we're bringing in from the middle 
	// of y (if you're wondering, since (t-1) + (t-1) = 2t-2, where 
	// the other key went, its coming into x)
	
	x.key[i] = y.key[t]
	x.n++

	// write everything out to disk

	Disk-Write (y)
	Disk-Write (z)
	Disk-Write (x)
}

Note that this is the only time we ever create a child. Doing a split doesn't increase the height of a tree, because we only add a sibling to existing keys at the same level. Thus, the only time the height of the tree ever increases is when we split the root. So we satisfy the part of the definition that says "each leaf must occur at the same depth."

Example of Insertion

Let's look at an example of inserting into a B-tree. For preservation of sanity, let t = 2. So a node is full if it has 2(2)-1 = 3 keys in it, and each node can have up to 4 children. We'll insert the sequence 5 9 3 7 1 2 8 6 0 4 into the tree:

Step 1: Insert 5
                                  ___
                                 |_5_|

Step 2: Insert 9
B-Tree-Insert simply calls B-Tree-Insert-Nonfull, putting 9 to the
right of 5:
                                 _______
                                |_5_|_9_|

Step 3: Insert 3
Again, B-Tree-Insert-Nonfull is called
                               ___ _______
                              |_3_|_5_|_9_|

Step 4: Insert 7
Tree is full.  We allocate a new (empty) node, make it the root, split
the former root, then pull 5 into the new root:
                                 ___
                                |_5_|
                             __ /   \__
                            |_3_|  |_9_|

Then insert we insert 7; it goes in with 9
                                 ___
                                |_5_|
                             __ /   \______
                            |_3_|  |_7_|_9_|

Step 5: Insert 1
It goes in with 3
                                 ___
                                |_5_|
                         ___ __ /   \______
                        |_1_|_3_|  |_7_|_9_|

Step 6: Insert 2
It goes in with 3
                                 ___
                                |_5_|
                               /     \
                       ___ __ /___    \______
                      |_1_|_2_|_3_|  |_7_|_9_|

Step 7: Insert 8
It goes in with 9
 
                                 ___
                                |_5_|
                               /     \
                       ___ __ /___    \__________
                      |_1_|_2_|_3_|  |_7_|_8_|_9_|

Step 8: Insert 6
It would go in with |7|8|9|, but that node is full.  So we split it,
bringing its middle child into the root:

                                _______
                               |_5_|_8_|
                              /    |   \
                     ___ ____/__  _|_   \__
                    |_1_|_2_|_3_||_7_| |_9_|

Then insert 6, which goes in with 7:
                                _______
                            ___|_5_|_8_|__
                           /       |      \
                  ___ ____/__    __|____   \__
                 |_1_|_2_|_3_|  |_6_|_7_|  |_9_|

Step 9: Insert 0

0 would go in with |1|2|3|, which is full, so we split it, sending the middle
child up to the root:
                             ___________
                            |_2_|_5_|_8_|
                          _/    |   |    \_
                        _/      |   |      \_
                      _/_     __|   |______  \___
                     |_1_|   |_3_| |_6_|_7_| |_9_| 

Now we can put 0 in with 1
                             ___________
                            |_2_|_5_|_8_|
                          _/    |   |    \_
                        _/      |   |      \_
                  ___ _/_     __|   |______  \___
                 |_0_|_1_|   |_3_| |_6_|_7_| |_9_| 


Step 10: Insert 4
It would be nice to just stick 4 in with 3, but the B-Tree algorithm
requires us to split the full root.  Note that, if we don't do this and
one of the leaves becomes full, there would be nowhere to put the middle
key of that split since the root would be full, thus, this split of the
root is necessary:
                                 ___
                                |_5_|
                            ___/     \___
                           |_2_|     |_8_|
                         _/    |     |    \_
                       _/      |     |      \_
                 ___ _/_     __|     |______  \___
                |_0_|_1_|   |_3_|   |_6_|_7_| |_9_| 

Now we can insert 4, assured that future insertions will work:

                                 ___
                                |_5_|
                            ___/     \___
                           |_2_|     |_8_|
                         _/    |     |    \_
                       _/      |     |      \_
                 ___ _/_    ___|___  |_______ \____
                |_0_|_1_|  |_3_|_4_| |_6_|_7_| |_9_|