Roll your own String type

Anyone who has ever had the pleasure of working with ‘C style’ strings aka NULL terminated character arrays. Has at some point stopped and remarked about the terrible choice of implementation when it came to representing strings in C. Indeed, almost every serious security vulnerability for a long time could be traced back to a mishandling of a C style sting. Buffer overflows exist for the most part because of of unchecked string operations.

But have you ever sat down to consider how you could have done it differently? NULL terminated character arrays are not the only way to represent a string in memory. Pascal had a string type which carried its length in the first array slot, eliminating the need for a null terminator. Java and C++ use ‘record’ type strings(string objects), but still have character arrays as the underlying representation. Functional languages like Haskell have done away with the character array entirely, opting instead for a linked list of characters.

In this post I take a stab at implementing my own string type. My objective was not to develop the most high performance, or even “best” implementation of a string ADT, but instead to illustrate what a string type truly is, what methods it needs to to support, and one which could be used as a mostly drop in replacement for std::string.

What is a String?

If you ask most developers what a string is, they will instinctually tell you “a character array”. But if you think more abstractly about the problem, then allow me to propose the following definition of a String Abstract Data Type.

A special case of a multi set of characters, Which are added to the Set in a first in first out ordering, but which are not constrained to removal in that order, and sport arbitrary (out of ordering) insertions. This is somewhat of a gross generalization. However, by framing a string in the guise of a set then we automatically answer some fundamental questions, like: Is an empty string still a string? (Yes.)

Requirements: What is expected from a string?

Now that we have a definition for our string, what operations does our string ADT need to support. But before we get to that, when it comes to strings there is another question that needs to be answered: the matter of mutability. C string’s, except for having a (mostly) fixed length were mutable objects. std::string is also mutable. That is, you can change their contents. Java on the other hand, has immutable strings by default. Practically every functional language treat strings as immutable.

I have implemented a mutable string type below, but by using a linked list, it is actually very easy to implement “functional” versions, which are essentially immutable structures that make a new copy of the object.

Taking a cue from std::string, we should be able to iterate over the individual characters of the string with C++’s enhanced for loop, and when printed our string should be output as a whole. String types almost always support the ability to access a character by index – a byproduct of their traditional array based representation.

Another side effect of using a character array for strings is that inserting and removing characters from anywhere but the last position is inefficient. Because we’re already reinventing the wheel so to be speak, I’ve decided to take a page from Haskell’s book and use a linked list as the underlying representation.

A Linked List String

To use a linked list for representing a string, we utilized both a head and a tail node. The use of a tail node allows for efficient operations at the end of the string, such as concatenation. We also store the character count in a cached variable to keep the length() operation constant time instead of linear.

class String {
    private:
        const static char end_of_string = '\0';
        struct CNode {
            char data;
            CNode* next;
            CNode(char info) {
                data = info;
                next = nullptr;
            }
        };
        CNode* head;
        CNode* tail;
        int length_;

We also are going to want to implement an iterator, so that was can do things like print the characters in the string using a for loop. It’s just basic linked list iterator:

       class Iterator {
            private:
                CNode* curr;
            public:
                Iterator(CNode* c) {
                    curr = c;
                }
                char& operator*() {
                    return curr->data;
                }
                Iterator& operator++() {
                    curr = curr->next;
                    return *this;
                }
                Iterator operator++(int) {
                    Iterator t = *this;
                    ++*this;
                    return t;
                }
                bool operator==(const Iterator& o) const {
                    return curr == o.curr;
                }
                bool operator!=(const Iterator& o) const {
                    return !(*this==o);
                }
        };

We also want to pay close attention to implementing the constructors, and especially copy constructors and operator=, since we want seamless interoperability with C style strings.

         String() {
            length_ = 0;
            head = nullptr;
            tail = nullptr;
        }
        String(const char* s) {
            length_ = 0;
            head = nullptr;
            tail = nullptr;
            for (int i = 0; s[i] != '\0'; i++)
                push_back(s[i]);
        }
        String(const String& o) {
            length_ = 0;
            head = nullptr;
            tail = nullptr;
            for (auto it = o.head; it != nullptr; it = it->next)
                push_back(it->data);
        }
~String() {
            while (head != nullptr) {
                CNode* x = head;
                head = head->next;
                delete x;
            }
        }
String operator=(const char* s) {
            length_ = 0;
            head = nullptr;
            tail = nullptr;
            for (int i = 0; s[i] != '\0'; i++)
                push_back(s[i]);
            return *this;
        }
        String operator=(const String& o) {
            length_ = 0;
            head = nullptr;
            tail = nullptr;
            for (auto it = o.head; it != nullptr; it = it->next)
                push_back(it->data);
            return *this;
        }

std::string supports a push_back operation for appending single characters to the end of the string, so our string type will as well. Also similar to std::string, is the use of both size() and length() to perform the same operation.

       void push_back(char c) {
            CNode* nt = new CNode(c);
            if (length_ == 0) {
                head = nt;
                tail = nt;
            } else {
                tail->next = nt;
                tail = nt;
            }
            length_++;
        }
        int length() {
            return length_;
        }
        int size() {
            return length_;
        }

An important ability that any string type should support is the ability to extract substrings, a linked list representation makes this both easy and efficient:

       String substring(int startpos, int len) {
if (startpos > length_)
return String(" ");
if (startpos + len > length_)
len -= ((startpos+len) - length_);
            int i = 0;
            CNode* x = head;
            while (i < startpos) {
                x = x->next;
                i++;
            }
            String ns;
            for (int j = 0; j < len; j++) {
                ns.push_back(x->data);
                x = x->next;
            }
            return ns;
        }

Iteration is provided through begin() and end() to conform to the STL standard. Obviously using a linked list means constant time indexed access is out of the question, but we still need the ability to access individual characters. A lot of the functionality for our string type comes via operator overloading. To provide access to individual characters override operator[].

        Iterator begin() {
            return Iterator(head);
        }
        Iterator end() {
            return Iterator(nullptr);
        }
char& operator[](int x) {
            if (x < length_) {
                CNode* tmp = head;
                for (int i = 0; i < x; i++)
                    tmp = tmp->next;
                return tmp->data;
            }
            return const_cast<char&>(end_of_string);
        }

Another useful operator to overload is operator+= to support concatenation, which we overload to work with several different string representations as we did for operator=. operator+ shares the exact same implementation.

        String& operator+=(char c) {
            push_back(c);
            return *this;
        }
        String& operator+=(String& o) {
            for (char c : o)
                push_back(c);
            return *this;
        }
        String& operator+=(const char* o) {
            for (int i = 0; o[i] != '\0'; i++)
                push_back(o[i]);
            return *this;
        }
String& operator+(String& o) {
            for (char c : o)
                push_back(c);
            return *this;
        }
        String& operator+(const char* o) {
            for (int i = 0; o[i] != '\0'; i++)
                push_back(o[i]);
            return *this;
        }

We also want to be able to compare our strings, with at least the ability to test for equality. If you want to be able to sort the strings, the you will also need (at the very least) operator<.

         bool operator==(const String& o) const {
            if (length_ != o.length_)
                return false;
            CNode* myhead = head; CNode* therehead = o.head;
            while (myhead != nullptr && therehead != nullptr) {
                if (myhead->data != therehead->data)
                    return false;
                myhead = myhead->next;
                therehead = therehead->next;
            }
            return true;
        }
        bool operator!=(const String& o) {
            return !(*this == o);
        }
bool operator<(const String& o) {
            for (auto it = head, oit = o.head; it != nullptr && oit != nullptr; it = it->next, oit = oit->next) {
                if (it->data == oit->data)
                    continue;
                if (it->data > oit->data) return false;
                if (it->data < oit->data) return true;
            }
            return false;
        }

And last but not least for the operator overloads, is to give our string the ability to work with output streams:

         friend ostream& operator<<(ostream& os, String& str) {
            for (char c : str)
                os<<c;            
            return os;
        }
};

And there you have It, an alternative String representation that makes use of a linked list instead of character array, that can be concatenated, extract substrings from, compared, sorted, and pretty much anything else you’d need to do with a string!

 
int main() {
    int n = 7;
    String a[] = {"Hermoine granger", "Abileen Texas", "Stranger than Fiction", "Gomer Pile"
                  "Bread And Jam", "Cashews", "Fritter Nation", "Sampon Simpson"};
    cout<<"Unsorted List: "<<endl;
    printArr(a, n);
    sort(a, 0, n);
    cout<<"Sorted List: "<<endl;
    printArr( a, n);
cout<<"-----------------"<<endl;
  String one = "first half of";
    String two = "the string in use";
    if (one == two) cout<<"Strings match"<<endl;
    else cout<<"Strings dont match\n";
    cout<<one<<" == ";
    cout<<two<<"?"<<endl<<"Concatenated: ";
one += ' ';
    one += two;
    cout<<one<<endl;
    for (char c : two)
        cout<<c<<" ";
cout<<endl;
    String three = two.substring(4, 6);
    cout<<three<<endl;
}

max@MaxGorenLaptop:~/DSA$ ./a.out
Unsorted List:
Hermoine granger
Abileen Texas
Stranger than Fiction
Gomer PileBread And Jam
Cashews
Fritter Nation
Sampon Simpson

Sorted List:
Abileen Texas
Cashews
Fritter Nation
Gomer PileBread And Jam
Hermoine granger
Sampon Simpson
Stranger than Fiction

-----------------
first half of == the string in use?
Strings dont match
Concatenated: first half of the string in use
t h e s t r i n g i n u s e
string
max@MaxGorenLaptop:~/DSA$

Well that’s all I got for you. I feel like I should say that this article is somewhat tongue in cheek. For most use cases a character array backed string is far more approriate than using this class would be. Still, it is a fun academic exercise. Until next time, Happy Hacking!