Implementing m-Ary search trees from the bottom up.

Searching is a fundamental operation in computers, with a great deal of research put towards developing both efficient data structures and their corresponding algorithms. Search trees are a corner stone of that work, and to highlight the ever changing landscape of algorithmic complexity and what is considered optimal, we will be taking look at a recent shift in some commonly held beliefs related to them.

Conventional wisdom holds that for implementing in memory ordered collections, that some form of balanced binary search tree is best. Frequently, it is an AVL tree, or as more often the case a Red Black tree has been the data structure of choice. With this view point, B-Tree’s and their many variants were relegated to the world of external storage, file systems, and data base indexing despite also being balanced search trees themselves..

But with changes in the layout of modern computer architectures, the old wisdom that B-Trees make for lousy in memory containers is falling flat. Modern computers have multiple levels of cache and B-Tree’s happen to be a very cache friendly data structure. In terms of overall memory usage, B Trees have a lower overall memory utilization per key when compared to self balancing binary search trees.

Despite their popularity, concrete implementations of B-Tree’s are few and far between for those who want to learn them, and even less so when one desires to use it as an in-memory data structure. Thus, to better understand how B-Tree’s work, I will be implementing one from scratch. In this post I will cover implementing their unbalanced counterpart, the m-Ary search tree as a jumping off point from which to further develop a B tree.

In Memory m-Ary Search Trees

Multiway search tree’s take a few disparate concepts and marry them together into one data structure. They take the cache friendly locality of an array, the ordered property of a binary search tree, and the space efficiency of an unrolled linked list, taking only the best parts of each, and leaving the rest.

The node structure of these tree’s help to clarify what is mean by the previous statement:

 class BNode {
    public:
       T keys[M-1];
T2 values[M-1];
       int n;
BNode* child[M];   
  public:
       BNode() {
            n = 0;
      }
      void add(T info, T2 value) {
              int j = n;
              while (j > 0 && keys[j - 1] > info) {
                 keys[j] = keys[j - 1];
values[j] = values[j - 1];
                 j--;
             }
             keys[j] = info;
values[j] = value;
             n++;
      }
      bool isFull() {
            return n == M-1;
      }
void traverse() {
            int i = 0;
            for (i = 0; i < n; i++) {
                  if (child[i] != nullptr)
                       child[i]->traverse();
                 cout<<keys[i]<<" ";
            }
            if (child[i] != nullptr)
                child[i]->traverse();
      }
};

A few things are immediately obvious from this code snippet: The structure is complicated enough that it makes sense for the nodes to have internal helper methods instead of abstracting the code out into the main tree code. Each node of the tree can hold one less key/value pair than the number of child nodes it can hold. This allows each key/value pair to have a link pointing to nodes having keys less than and a link pointing to nodes with keys having values greater than the current key being examined.
This is what maintains the search tree property that is also present in binary search trees.

To aid in maintaining the search tree property, the add() method is comprised of the inner loop of an an insertion sort, so key/value pairs are maintained in the array in sorted order.

The traverse() method performs an in order traversal of the current node, calling the child nodes traverse() method recursively.

And now for the actual tree:

 
template <class T, class T2, int M = 4>
class Tree {
    private:
        class BNode {
/* code shown above */
};
BNode* root;
        int n;
        T2 nullItem;
        BNode* addR(BNode* h, T info, T2 value) {
            if (h == nullptr) {
                h = new BNode();
                h->add(info, value);
                return h;
            }
            if (h->isFull()) {
                int j = 0;
                while (j < h->n && h->keys[j] < info)
                    j++;
                h->child[j] = addR(h->child[j], info, value);
            } else {
                h->add(info, value);
            }
            return h;
        }
T2& search(BNode* h, T key) {
            if (h == nullptr)
                return nullItem;
            int
            i = 0;
            while (i < h->n && h->keys[i] < key)
                i++;
            if (key == h->keys[i])
                return h->values[i];
            return search(h->child[i], key);
        }
    public:
        Tree(T2 _nullItem) {
            n = 0;
            root = nullptr;
            nullItem = _nullItem;
        }
        void insert(T info, T2 value) {
            root = addR(root, info, value);
        }
        T2& search(T info) {
            return search(root, info);
        }
        void sort() {
            if (root != nullptr)
                root->traverse();
            cout<<endl;
        }
};

Despite the name of the Node class the code shown here is not a B-Tree. I am using this code as the foundation for building a B-Tree, what’s shown is an unbalanced multiway search tree.

The search method is a recursively searches each node, performing a linear search of the keys to find where the key should be, or the child node to explore next. Some implementations use binary search to find the insertion point. They keys array would have to be fairly large to warrant a binary search.

the addR() method first checks if it has encountered a null node: this means the node isn’t in the tree, and so creates one at that position and returns: just like a binary search tree. Next, It checks if the node is full or if it has space for insertion, If full, it find the proper child node and recurs to the next node, otherwise it calls the nodes internal add method and adds the new key value pair to the node.

When the data being added to the tree is random, this data structure performs all operations in O(logN) on average. Using the following code for level order traversal, we can get an idea of the tree’s structure.

void levelorder() {
    queue<BNode*> fq;
    fq.push(root);
    while (!fq.empty()) {
         int nc = fq.size();
         while (nc > 0) {
             BNode* curr = fq.front();
             fq.pop(); nc--;
             if (curr != nullptr) {
                  cout<<"[ ";
                  int i;
                  for (i = 0; i < curr->n; i++) {
                      cout<<curr->keys[i]<<" ";
                      fq.push(curr->child[i]);
                  }
                  fq.push(curr->child[i]);
                  cout<<"] ";
            }
        }
        cout<<endl;
    }
}


int main(int argc, char* argv[]) {
    BTree<char, char, 5> bt('#');
    string sed = "ASEARCHINGEXAMPLE";
    for (char c : sed) bt.insert(c, c);
    bt.sort();
    bt.levelorder();
    for (char c : sed) cout<<bt.search(c)<<" ";
    cout<<endl;
return 0;
}
max@MaxGorenLaptop:~/DS/Algorithms$ ./a.out
A A A C E E E G H I L M N P R S X
[ A A E S ]
[ A ] [ C E E ] [ H I N R ] [ X ]
[ G ] [ L M ] [ P ]
A S E A R C H I N G E X A M P L E
max@MaxGorenLaptop:~/DS/Algorithms$

Unfortunately, it still has the same O(n) worst case behavior that all unbalanced search trees posses, though it would still offer the same type of benefits that an unrolled linked list would: better locality for values in the current node, and less pointers to traverse overall.

Thankfully we don’t have to accept that sub optimal performance. As we did with binary search trees, we can use certain techniques to force m-way search trees to maintain balance. Though it would be an interesting exercise to try and adapt rotation based red-black tree algorithms to this data structure, that we be using an abstraction of an abstraction on the structure being abstracted: see where this is going?

No, we’re not going to do that. The gold standard for balanced multiway search trees are the family of data structures known as B-Trees introduced by Beyer and McCreight in 1972. And with a little bit of tweaking the structure we have developed thus far is a very good starting point for implementing them, which will be covered in a future post.

Until then, Happy hacking!