Implementing Efficient Algorithms for Ordered Sets

The set ADT is an important and unique (see what I did there?) data structure with many uses, and many ways to implement them. Often implemented over a linked structure, sets are not quite a list and not quite a dictionary, but often have similar functionality of both. They can be ordered or unordered, though ordered sets while more costly to construct have distinct performance advantages once they are constructed.

The single most important property of sets is that the keys must be unique. This property should be enforced by whatever underlying data structure is being used to implement the set. The three most common data structures used are linked lists, hash tables, and balanced search trees. Which data structure is most appropriate to use is a case-by-case consideration that shouldn’t be taken lightly, as it can have a significant impact on performance.

In todays post I’m going to discuss algorithms for performing the following operations or ordered sets:

  • union, which produces a set from combining two sets.
  • difference – If comparing set A and set B, the set difference is the set containing the elements present in Set B which are not in Set A.
  • symmetric difference – similar to difference, but will produce a set containing the elements which are present in B but not in A, as well as the elements that are present in A that are not present in B.
  • intersection – produces a set of elements that are present in both sets.

In the above descriptions I say “both” and will be showing examples using 2 sets, but these algorithms can be (and frequently are) scaled up to work with multiple sets at a time.

The Set ADT

For the sake of simplicity, I will be using linked list, kept in sorted order as the underlying data structure. I will also be using dummy head and tail nodes, as it allows for a cleaner implementation allowing us to dereference pointers without having to worry about checking for null pointers.

The Set ADT must support the following operations at the minimum:

  • insert(T item) – add an item to the set, only if the item is not already present
  • remove(T item) – delete an item from the set
  • find(T item) – check if an item is in the set, and retrieve it if so.
  • size() – return the number of elements in the set

The above list is the bare minimum required for a proper set ADT. Many implementations offer other operations, such as a check for emptiness, range searches, and swapping, just to name a few.

template <typename T>
struct listnode {
    T info;
    listnode* next;
    listnode(T i = T(), listnode* n = nullptr) : info(i), next(n) { }
    listnode(T i) : info(i), next(nullptr) { }
};

template <class T>
class Set {
    private:
        using node = listnode<T>;
        using link = node*;
        using iterator = ForwardIterator<T>;
        link head, z;
        int count;
        void init() {
            head = new node(std::numeric_limits<T>::min());
            z = new node(std::numeric_limits<T>::max());
            z->next = z;
            head->next = z;
            count = 0;
        }
    public:
        Set() {
            init();
        }
        Set(const Set& set) {
            init();
            link c = head;
            for (link it = set.head->next; it != set.z; it = it->next) {
                c->next = new node(it->info, z);
                c = c->next;
                count++;
            }
        }
        ~Set() {
            while (head != z) {
                link x = head;
                head = head->next;
                delete x;
            }
        }
        bool empty() {
            return head == z;
        }
        int size() {
            return count;
        }
        void insert(T key) {
            link t = new node(key);
            link x = head, p = x;
            while (x != z && key > x->info) {
                p = x;
                x = x->next;
            }
            if (key != x->info) {
                t->next = p->next;
                p->next = t;
                count++;
            }
        }
        iterator find(T key) {
            for (link it = head->next; it != z; it = it->next) {
                if (key == it->info) {
                    return iterator(it);
                }
            }
            return end();
        }
        iterator begin() {
            return iterator(head->next);
        }
        iterator end() {
            return iterator(z);
        }
        Set operator=(const Set& set) {
            if (this == &set)
                return *this;
            init();
            link c = head;
            for (link it = set.head->next; it != set.z; it = it->next) {
                c->next = new node(it->info, z);
                c = c->next;
                count++;
            }
            return *this;
        }
};

Anyone who reads my posts knows I’m a fan of the iterator pattern. Set algorithm’s are a place where the iterator pattern proves its merit once again, if the underlying data structure is changed to say, a binary search tree, so long as the BST has iterators implemented the algorithms shown will still work. I use iterators which conform to the STL interface so they can be used with C++’s enhanced for loop. As an added bonus, these algorithms also work with std::set.

 
template <typename T>
class ForwardIterator {
    private:
        listnode<T>* curr;
    public:
        ForwardIterator(listnode<T>* pos) {
            curr = pos;
        }
        T& operator*() {
            return curr->info;
        }
        ForwardIterator operator++() {
            curr = curr->next;
            return *this;
        }
        ForwardIterator operator++(int) {
            ForwardIterator t = *this;
            ++*this;
            return t;
        }
        bool operator==(const ForwardIterator& it) const {
            return curr == it.curr;
        }
        bool operator!=(const ForwardIterator& it) const {
            return !(*this == it);
        }
};

Set Operations & Algorithmic Complexity

Set operations can be found at the heart of many data intensive applications and queries. SQL statements for example, often reduce to set operations – think table joins. This means that these algorithms are often processing not insignificant quantities of data. Clearly, efficient algorithms are required, it is for this reason we implement ordered sets, as they allows us to exploit certain properties of ordered data that are unavailable to us if the collection is unsorted.

In an unordered collection, these algorithms reduce to a search problem, and a rather inefficient one at that, regardless of how the data is structured. The reason for this is simple to understand, and is best illustrated through code. Lets begin with the set intersection operation. Set intersection is the act of creating one set from two or more sets, comprised only of the items which are present in all of them. A naiive approach, and what is essentially the only approach open to us if our data is unsorted would look something like this:

template <class T>
void set_intersection(Set<T>& lhs, Set<T>& rhs, Set<T>& result) {
    for (auto it : lhs)
        if (rhs.find(it) != rhs.end())
            result.insert(it);
    for (auto it : rhs)
        if (lhs.find(it) != lhs.end())
            result.insert(it);
}

All of the algorithms I will discuss – difference, symmetric difference, intersection, and union – would look similar to the above. because the only recourse we have for checking if the particular invariant we want holds, is to first check that it is valid in list a, and then also check whether it is valid in list b. This requires us to completely traverse both sets O(N) times, for every element in both sets. Ouch. Thankfully, we’re not totally stuck: by choosing to keep our data sorted we can do better.

Efficient Algorithms on Ordered Sets

The gist of what it is we’re trying to do is this: take two ordered sets, and using the elements from both sets, create one set based on a given predicate. If that doesn’t remind you of something, ill try being a bit more obvious: We are taking two sorted sets and – ahem – merging – elements from either set into one.

By taking the merge routine from merge sort and tweaking the way items to be included in the result set are selected we can perform all four set operations in linear time. I’m pretty sure it goes without saying that that’s a whole heck of a lot a better than the quadratic complexity of the naive implementation discusses above. Let’s take the same set intersection problem and see how it is accomplished via merging.

template <class T>
void set_intersection(Set<T>& lhs, Set<T>& rhs, Set<T>& result) {
auto lit = lhs.begin(), rit = rhs.begin();
while (lit != lhs.end() && rit != rhs.end()) {
if (*lit < *rit) {
lit++;
} else if (*rit < *lit) {
rit++;
} else {
result.insert(*lit);
lit++;
rit++;
}
}
}

Let’s take a moment to dissect what’s going on here and why. Our goal, as stated above is to find the items that only occur in both sets – thus we initialize our iterators and walk the sets simultaneously. Because the lists are in sorted order, we know that if the item from the left set is less than the item in the right set, they cant be equal, and thus cant be in both sets. We check the same but in reverse for the right side. On each iteration, only one of the three cases can be true, for the two negative cases, we advance that sides iterator, leaving the other as it was for the next comparison. Eventually, if an item is in both sets, they will have to be compared to each other, otherwise we will consume both sets without adding any items to the result set, as none of them are present in both sets.

Set difference is even closer in structure to merge sort, as once we exit the main loop, we must add the remaining unconsumed items (if they exist) to the result set. This time we went to favor the result of the first comparison for adding to the list, for the same reason described above: as the lists are in sorted order, we know that if the item from the left set is less than the item in the right set, they cant be equal, and thus cannot be present in both sets,

template <typename T>
void set_difference(Set<T>& lhs, Set<T>& rhs, Set<T>& result) {
    auto a = lhs.begin(), b = rhs.begin();
    while (a != lhs.end() && b != rhs.end()) {
        if (*a < *b) {
            result.insert(*a);
            a++;
        } else if (*b < *a) {
            b++;
        } else {
            a++;
            b++;
        }
    }
    while (a != lhs.end()) result.insert(*a++);
    while (b != rhs.end()) result.insert(*b++);
}

Symmetric difference is implemented the same as above, but we insert the result of either inequality into the result set:

template <typename T>
void symmetric_difference(Set<T>& lhs, Set<T>& rhs, Set<T>& result) {
    auto a = lhs.begin(), b = rhs.begin();
    while (a != lhs.end() && b != rhs.end()) {
        if (*a < *b) {
            result.insert(*a);
            a++;
        } else if (*b < *a) {
            result.insert(*b);
            b++;
        } else {
            a++;
            b++;
        }
    }
    while (a != lhs.end()) result.insert(*a++);
    while (b != rhs.end()) result.insert(*b++);
}

And last but not least: set union, which happens also be an unaltered merge() algorithm borrowed directly from merge sort:

template <typename T>
void  set_union(Set<T>& lhs, Set<T>& rhs, Set<T>& result) {
    auto a = lhs.begin(), b = rhs.begin();
    while (a != lhs.end() && b != rhs.end()) {
        if (*b > *a) {
            result.insert(*a);
            a++;
        } else {
            result.insert(*b);
            b++;
        }
    }
    while (a != lhs.end()) result.insert(*a++);
    while (b != rhs.end()) result.insert(*b++);
}

That’s all I’ve got for you today, but stay tuned for an upcoming post on how we can exploit the shared structure of the above algorithms, combined with C++’s template system to create a framework for set algorithms, without the obvious repetition of implementing the above algorithms. Until then, Happy Hacking!