Augmenting B+ Trees For Order Statistics
B+ trees are used heavily to implement index structures for many of the leading RDBMS vendors, traditionally for externally stored data. As computer architectures have evolved B+ trees are increasingly finding use as in-memory data structures for a myriad of novel uses. Order statistic trees are generally implemented ontop of self-balancing binary search trees such as AVL or Red/Black trees in order to support rank-based queries.
[ a i ]
[ a e ] [ i n r ]
[ a c ] [ e g h ] [ i l m ] [ n p ] [ r s x ]
A sample B+ Tree of order 4
Since B+ trees are themselves a generalization of binary search trees abstracted for m-ary trees we are able to augment them in a similar manner. When combined with the exceptional scalability which B+ trees posses, order statistics operations serve as a poweful extension to the search trees capabilities.
In todays post I'm going to cover augmenting B+ trees for use as an order statistic tree in order to enhance the types of queries which can be efficiently implemented for them. Seeing as we are agumenting B+ trees for new functionality, the code presented will build off of the bottom up memory resident B+ tree discussed in this previous post which you should be familiar with before continuing.
Augmenting the Node structure
In order to perform order statistics on a search tree we must be able to efficiently query the size of a given subtree. With binary trees, this is done by adding a size field to each node. Likewise we will do similarly with a B+ tree, the difference being that we need a size field for every child.
[ a (5) i (8) ]
[ a (2) e (3) ] [ i (3) n (2) r (3) ]
[ a (0) c (0) ] [ e (0) g (0) h (0) ] [ i (0) l (0) m (0) ] [ n (0) p (0) ] [ r (0) s (0) x (0) ]
B+ Tree of order 4 with subtree sizes shown
To accomplish this we add a size field to the Entry structure which also posses the child pointer and Key/value pair for a given index.
template <class K, typename V, int M>
struct Entry {
KVPair<K,V> info;
BPNode<K,V,M>* child;
int size;
Entry(K key, V value) : info(key, value), size(0), child(nullptr) { }
Entry(K key, BPNode<K,V,M>* n) : info(key, V()), size(0), child(n) { }
Entry() : size(0), child(nullptr) { }
};
template <class K, class V, int M>
struct BPNode {
Entry<K,V,M> page[M];
int n;
BPNode* next;
BPNode* prev;
BPNode(int m) : n(m), next(nullptr), prev(nullptr) {
}
};
The size field is initialized to 0, though the only way for a node to continue having a substree size of 0 is for it to belong to a leaf node. All other nodes (the internal nodes) must track their subtree sizes explicitly during insertions and deletions, updating them accordingly. Crucially, this can be done without effecting the logarithmic gurantees B+ trees provide.
Determining Subtree Sizes
Starting from a given node, we proceed down the tree to the leaf level, calling the size() method recursively for each of a nodes children. During this traversal we accumulate the size of each child node encountered in a variable to be returned as our result. When we get to a leaf node, we return its size, adding to the tally of its parents subtree size.
int size(link h) {
if (h->isLeaf())
return h->n;
int sz = 0;
for (int i = 0; i < h->n; i++) {
sz += size(h->page[i].child);
}
return sz;
}
void updateSubsize(link& curr, int idx) {
curr->page[idx].size = size(curr->page[idx].child);
}
During insertions and deletions we must update the size fields along the path to the newly created node or to the node being deleted. In trees without parent pointers, this should be done recursively and in a fashion where it does not effect the logarithmic complexity of B tree operations.
link insert(link curr, K key, V value) {
int j = curr->find(key);{
/*
Logic to handle inserting to a leaf node
*/
} else {
j = curr->floor(key);
link result = insert(curr->page[j++].child, key, value);
updateSubsize(curr, j-1); //update subtree sizes along the insertion path
if (result == nullptr) return nullptr;
nent = entry(result->page[0].info.key(), result);
curr->insertAt(nent, j);
updateSubsize(curr, j); //update subtree sizes for newly created split node
}
/*
Logic to split on overflow
*/
}
With our subtree sizes properly calculated we can move on to the interesting operations which actually make use of the field.
Selecting Entries By Rank
Selecting an entry by it's rank is analgous to accessing an array entry by index in a sorted collection, where retrieving `array[1]` returns the 2nd smallest value of a sorted sequence. The rank of a key is determined by the number of keys which lie to the left of the partitioning key. This is done by working our way down the tree, updating the "current rank" until we hit a leaf, at which point we return the entry pointed to by the current value of the rank paramater.
BPIterator<K,V,M> select(link h, int rank) {
if (h->isLeaf()) {
return BPIterator<K,V,M>(h, rank);
}
int i = 0, curr_rank = rank;
while (i < h->n) {
if (curr_rank < h->page[i].size)
break;
curr_rank = curr_rank - h->page[i++].size;
}
return select(h->page[i].child, curr_rank);
}
BPIterator<K,V,M> select(int rank) {
return select(root, rank);
}
As can be seen, the first step is hanlding the base case: checking if the current node is a leaf. Handling leaf nodes entails simply returning the entry pointed to by page[rank]. The case for internal nodes is a bit more involved. Starting from the root node, we first go as far right as we can in each node by comparing the "current rank" with the size of the subtree located at the current position. Whenever we move one position right, we subtract the count of the subtree rooted at that position from the current rank. Once we have gone as far right in the page as we can, we drop down to the next level of the tree and repeat the process until arriving at a leaf node.
Determining the rank of a key
The natural choice for the next operation to implement is very much the inverse of the selection process we just covered: obtaining the rank of an entry from it's key. To calculate the rank of a key, we perform a standard top down search for the key in question, tracking the number of entries in each subtree to the right of the path being traversed.
int rank(link h, K key) {
if (h->isLeaf()) {
return h->find(key);
}
int j = h->floor(key), rc = 0, i = 0;
while (i < j)
rc += h->page[i++].size;
return rank(h->page[j].child, key) + rc;
}
int rank(K key) {
return rank(root, key);
}
Where selection started with the rank, subtracting subtree sizes as we went, rank() begins with 0 and accumulates the rank from the subtreesize count as it goes. Aside from this, the procedures are very similar.
Using selection and rank for range queries
An immediate use for the rank() procedure we just implemented is determining the number of keys which lie with in a range which can now be done in log(N) time:
int rangeCount(K lo, K hi) {
return rank(hi) - rank(lo);
}
Gathering the actual keys in the range can done in log(N)+k where k is the size of the range and N is the height of the tree.
queue<K> keysInRange(K lo, K hi) {
queue<K> keys;
auto itr = select(rank('h'));
auto endIt = select(rank('s'));
while ( itr != endIt) {
keys.push((*itr).key());
itr++;
}
return keys;
}
And that's just a small taste of the type of operations which can now be simply and efficiently implemented over our augmented B+ tree. Anyway, thats what I've got for today. Until next time, Happy Hacking!
Further Reading
[1] Code For todays post is available on github: https://github.com/maxgoren/BPlusTree
-
Augmenting B+ Trees For Order Statistics
-
Parsing Regular Expressions to Construct an AST Top-Down
-
Implementing Balanced Deletion for B+ Trees
-
Building an AST from a Regular Expression Bottom-up
-
The Aho, Sethi, Ullman Direct DFA Construction Part 2: Building the DFA from the Followpos Table
-
The Aho, Sethi, Ullman Direct DFA Construction, Part 1: Constructing the Followpos Table
-
Procedural Map Generation with Binary Space Partitioning
-
Exact Match String Searching: The Knuth-Morris-Pratt Algorithm
-
The visitor pattern: OOP takes on the Expression Problem
-
Improving mgcLisp's define syntax
Leave A Comment