Rolling Your Own Input Iterators

Features

Rolling Your Own Input Iterators

Kevin S. Van Horn

Sequences pop up all over the place in a program. An input iterator is often all the glue you need to bring the power of STL to your favorite sequence.

Much has been written about how to use the Standard Template Library (STL) container classes and algorithms. But little has been written on how to extend it with your own STL-compliant container classes, iterators, and algorithms. That's a shame, because STL was intended to be much more than just another class library — it was intended as an open-ended framework for generic programming. This article is a step towards filling that gap in the literature, by showing you how to create your own input-iterator classes, and why you might want to do so. It's aimed at people who already use STL, so I'll assume you know the basics of what iterators are and how they're used.

Why Use Input Iterators?

Input iterators are useful whenever you want to generate, return, or read in a sequence of values which are not already stored in memory as part of some container object. They are used by algorithms that pass through a sequence a single time, in the forward direction only, and never try anything sneaky like hanging onto an old iterator value in hopes of returning to a previous position in the sequence. In spite of these limitations, the standard C++ library contains 20 different algorithms (not counting variations with an extra function-object parameter) that can operate on sequences defined by input iterators. Furthermore, such sequences can initialize or be inserted into any object of one of the standard library's container classes.

The standard example of an input iterator is template class istream_iterator. An istream_iterator<T, charT> turns a stream with character type charT, containing textual representations of values of type T, into a sequence of T values. The appropriate extraction operator (operator>>) must be defined for type T. For example, if foo is the name of a file whose contents are
35\n24.7\n-12.76 3.8e+2
then the code fragment
typedef
istream_iterator<float,char> It;
ifstream is("foo");
vector<float> v(It(is), It());
will initialize vector v with the sequence of values 35f, 24.7f, -12.76f, 380f.

Note that the above example used It() as the past-the-end value (an iterator value just past the end of the desired sequence). Successively incrementing an istream_iterator will eventually cause it to attain the default-constructed value, when the underlying istream reaches end-of-file. This is typical of input iterators — their one-pass nature means that no intermediate iterator values are available for specifying subsequences, so the only available past-the-end value is one indicating some form of end of input.

The standard library also includes template class istreambuf_iterator for extracting from an istream or a streambuf a sequence of characters. Thus we see a common use for input iterators: as adapters that provide an STL-compliant interface to some other class so that it can be used with STL algorithms (and user-written generic algorithms).

Here are some additional examples of input iterators:
Reading a file system directory. Dietmar Kuhl (www.informatik.uni-konstanz.de/~kuehl) has implemented an input iterator class that serves as a wrapper around the Unix opendir, readdir, and closedir functions. Thus you can write
list<string> l(dir_it("foo"), dir_it());
to create a list of the names of all entries of the directory foo, and
distance(dir_it("foo"), dir_it())
to compute the number of entries in directory foo.
An analog of istream_iterator for binary files. If you deal with files that contain a series of values in some binary format, instead of a textually-encoded format, you may find it useful to define a class analogous to istream_iterator for reading in these values.

Lexical analysis. A lexer converts a sequence of characters into a sequence of tokens. For maximum flexibility and ease of interfacing with STL components, a lexer should then be a template class or function that takes an arbitrary iterator range for reading characters, and creates an input iterator for reading tokens. For example, if it0 and it1 are of an input iterator type X, with the range [it0, it1) defining a sequence of characters, then [lexer<X>(it0, it1), lexer<X>()) is the corresponding sequence of tokens. (Recall that [it, itend) means the sequence of values obtained by incrementing it0 up to, but not including, it1, and dereferencing every iterator along the way.)

Database queries. A query can return multiple results. How should you return these results? One common method is to dictate a particular kind of container into which the results will be placed. This has the drawback of being inflexible and requiring an extra copying stage if the application needs to put the results into a different kind of container. Furthermore, sometimes the results can be processed one by one without any need by the application to ever store the entire set of results.
Another common method that overcomes these problems is to have a notion of a database cursor. A query sets the cursor, and then you can read in the results one by one. The only problem with this is that it may not conform to any standard interface. But if you make your database cursors actually be input iterators — or write an adapter for this purpose — then you obtain a very flexible interface. A query returns an input iterator i of some type X, with [i, X()) being the sequence of return values. The iterator pair (i, X()) can then be passed to any number of STL algorithms, used to initialize the contents of a standard container, transferred to an existing container via STL's copy algorithm, or used directly by your own code.

A Database Example

Let's look at that last example above in more detail; in particular, consider putting an STL-style interface on the Unix dbopen facilities. The function dbopen can actually be used to create and access three different kinds of database files:

flat files of records

hash-table files storing (key, value) pairs

B-tree files storing (key, value) pairs

With the B-tree option you can access (key, value) pairs in ascending key order, according to a user-defined comparison function. We'll ignore the flat-file and hash-table options, and concentrate on the B-tree option.

I've written the classes const_db_btree and db_btree to provide the STL-style interface. These assume that your keys are of class Date (Listing 1) , and the associated values have type int. In practice you'd want to make const_db_btree and db_btree be template classes to allow arbitrary key and value types — and that is what I originally did — but doing so complicates the exposition considerably. Since this is an article about input iterators, and not about generic container classes, I opted for simplicity of description over generality here.

Class const_db_btree is for read-only access to an existing B-tree file. Its header files are given in Listings 1 and 2. Class db_btree publicly inherits from const_db_btree, and simply adds functionality to allow you to insert new (key, value) pairs, delete them, or construct a new B-tree file. Since db_btree adds nothing that is relevant to a discussion of input iterators, I have omitted it here.

We'll talk about the implementation later; for now just skip down to the line that says "INTERFACE STARTS HERE" in Listing 2. The constructor argument is the name of the B-tree file you want to access. The function begin returns an input iterator pointing at the first entry in the B-tree, according to the key ordering. The function end simply returns the default-constructed input iterator; it is the past-the-end value for all sequences produced by this class.

The call x.find_from(k) returns an input iterator pointing at the first entry whose key is not less than k, or x.end() if there is no such entry. Thus [x.find_from(k), x.end()) is the sequence of all entries in the B-tree file of x with keys not less than k.

The call x.find_to(k) returns an iterator pointing at the first entry in the B-tree. This iterator value differs from x.begin() in its behavior when repeatedly incremented. After reaching the last element with key not greater than k, this iterator becomes equal to x.end() if you increment it once more. Thus [x.find_to(k), x.end()) is the sequence of all entries in the B-tree file of x with keys not greater than k.

The call x.find_rng(k1,k2) returns an iterator pointing at the first entry whose key is not less than k1, or x.end() if there is no such entry. This iterator value differs from x.find_from(k1) in its behavior when repeatedly incremented. After reaching the last element with key not greater than k2, this iterator becomes equal to x.end() if you increment it once more. Thus [x.find_rng(k1, k2), x.end()) is the sequence of all entries in the B-tree file of x with keys between k1 and k2 inclusive.

The dbopen interface has the notion of a cursor into the database, which is used in the implementation of const_db_btree iterators. However, the dbopen interface only allows for a single database cursor. This prevents us from having more than one valid, dereferenceable iterator into the database at a time. So a side effect of the calls x.begin(), x.find_from(f), x.find_to(k), and x.find_rng(k1, k2) is that any existing dereferenceable (not x.end()) iterators pointing into x are invalida ted.

The call x.get(k, d) returns true if there is an entry in x's B-tree with key k, and assigns to d the value associated with key k. Otherwise it returns false and leaves d unchanged. It does not invalidate any existing iterators into x.

Now for an example of using const_db_btree. Suppose you have a B-tree database where the keys are dates and the associated values are the number of widgets your store sold on that date. The name of the database file is "sales." Listing 3 shows some code that accesses the database to print out the five best sales days within a given date range. The template functions partial_sort_copy and copy, and the template class ostream_iterator, are from the draft Standard C++ library. The call
partial_sort_copy(iit0, iit1, rit0, rit1, cmp);
stores in [rit0, ri1) the N "best"values in the sequence [iit0, iit1), where N is the distance from rit0 to rit1, and "best" is defined by the comparison object cmp. rit0 and rit1 must be random-access iterators. The call to copy just writes the values in array A to cout, using operator<<, separated by the string "\n".

Input Iterator Requirements

The draft C++ Standard (subclause 24.1) discusses three kinds of iterator values: dereferenceable values, past-the-end values, and singular values. Dereferenceable iterator values "point at" some value in a container or input sequence. Past-the-end values can't be assumed to point at anything, but mark the end of a sequence. Singular values are basically unusable iterator values. The only thing you can do with an iterator variable holding a singular value is to assign it a new value. Uninitialized pointers are one example of singular values.

Table 1 is taken (with slight modifications) from the draft C++ Standard, and gives the requirements for a type X to be considered an input iterator for value type T.

Item 1 says that X must have a copy constructor and destructor.

Item 2 says that X must allow copy assignment. The draft C++ Standard says that a copy assignment should return a value of type X, which I suspect is a typo — copy assignments, including compiler-generated copy assignments, usually return a value of type X&, and the descriptions given for istream_iterator and istreambuf_iterator don't support the idea of a return type of X.

Items 3 and 4 discuss operator== and operator!=. These may return values of any type that can be converted to bool, which includes bool itself and the other integer types. In the working paper, the phrase "(a, b) is in the domain of ==" simply means that the expression a == b is (required to be) defined. The phrase "a is in the domain of ==" means that there exists some iterator value b for which the expression a == b is (required to be) defined.

Singular values are not in the domain of ==. Thus, if a is a singular value, then neither a == b nor b == a is (required to be) defined. Presumably a != b is defined whenever a == b is defined, and vice versa. If an expression (such as a == b) is not required to be defined, then the standard doesn't care what it returns or what it does. It could sit in an infinite loop, it could cause the program to crash, it could stomp all over memory, or it could delete all of your files.

Curiously enough, although the draft C++ Standard gives some cases where the expression a == b is not required to be defined, it never explicitly says just when a == b must be defined. However, it seems to be assumed that if a and b are non-singular iterator values marking positions in (or past-the-end of) the same sequence, then a == b must be defined; otherwise loops such as
for (it = start; it != end; ++it)
        ...;
wouldn't work.

The requirement that == be an equivalence relation simply means that it must behave as you expect equality to behave:

1. a == a is always true whenever the expression is defined (any non-singular iterator is equal to itself).

2. If a == b is defined and true then b == a is defined and true (the order of the arguments for == doesn't matter).

3. If a == b and b == c are both defined and true, then a == c is defined and true (you can chain together equalities to deduce new equalities).

Item 5 discusses the dereferencing operator. There's a bit of circularity here. The draft C++ Standard says that a is a dereferenceable iterator if *a is defined; then it turns around and, in item 5, says that *a is defined if a is dereferenceable. In any event, the intention is that *a returns the value designated (or "pointed at") by a, or the first value in the (sub)sequence whose beginning is marked by a. There's some ambiguity about the return type of *a. Section 24.1.1 says that *a must return a value of type T, but the description of istream_iterator<T, charT> indicates a return type of T const &. Perhaps the committee intended the return type to merely be convertible to T.

Item 5 also imposes a consistency requirement on operator== and operator*: If a == b is defined and true, then *a and *b are either both undefined or both defined and return the same value. Combined with items 3 and 4, this tells us that two dereferenceable iterator values a and b must be considered unequal if they point to different values, and that a and b must be considered unequal if a is dereferenceable and b is a non-dereferenceable, past-the-end value.

Item 6 simply requires that a->m can be used as an abbreviation for (*a).m, where m is a member of class T and a is an iterator value.

Item 7 implies that a dereferenceable input iterator can be repeatedly incremented without obtaining a singular value, until it reaches a past-the-end value. (If a past-the-end value is never reached, then the iterator can be incremented indefinitely, and it defines an infinite sequence.) But item 7 is most interesting for what it does not require. If it1 and it2 are copies of the same iterator value, and the operation ++it1 occurs, then it2 is no longer required to be usable. It may be considered a singular value.

The reason for this is that input iterators are used for one-pass algorithms — you aren't supposed to return to previous values of an iterator. The implications for operator== are that when implementing operator== you need only consider comparing an input iterator with itself, copies of itself, and the past-the-end value; it doesn't matter what operator== returns in any other case.

This lack of a requirement is important, as it allows efficient implementations of input iterators. The original input iterator requirements, given in Stepanov and Lee's seminal paper defining STL, were stronger. As a result, istream_iterator had to be implemented with three data members:

1. A pointer to an istream.
2. A bool indicating whether this was the past-the-end value.
3. A variable of type T, holding the value (if any) designated by the iterator.

This made an object of class istream_iterator rather large, which could be expensive to copy, especially if values of type T were expensive to copy. But iterators get copied all the time, because they are typically passed by value. They should therefore be inexpensive to copy.

The weakened requirements allow an input iterator to contain a single data member: a pointer to a structure that holds the designated value and whatever information is necessary to generate the remaining values in the sequence, with a null pointer indicating the past-the-end value. Equality of iterators may be implemented as equality of the stored pointers. When an input iterator is incremented, it may need to update the values in the structure pointed at, but it can continue to hold the same pointer. Since it is no longer valid to dereference or compare for equality any copies of the old value of the iterator, it doesn't matter that these in fact hold the same value as the incremented iterator.

Item 8 says that, if you ignore return values, post-increment and pre-increment of an input iterator are equivalent. Again, the most interesting thing is what is not required. There is no restriction on the return type of post-increment. In particular, it need not return an iterator value.

Item 9, though, adds the additional restriction that *r++ must be defined, and equivalent to incrementing r and then returning the old value pointed at. At first this would seem to require that r++ return the old value of r, and that this old value remain dereferenceable, thereby unraveling the implementation scheme I described above. But there is another way of satisfying item 9. You can define a proxy class whose only purpose is to hold a value of type T, and for which operator* is defined and returns the enclosed value. If the operation r++ returns a proxy containing the value of *r before incrementing, item 9 is then satisfied. I'll show an example of this in the next section.

Finally, there is a time-complexity requirement: all of the operations in Table 1 must take constant time (amortized). What this means is that, for any particular machine and compiler, you can put a fixed upper limit T on how long any one of the operations takes, that is independent of the length of the sequence. An example of violating this requirement would be if the first increment operation took one unit of time, the second took two units, the third took three units, and so on. Actually, the requirement is a little more lenient than this, in two ways:

1. Constructing, assigning, or copying the value pointed at by an iterator only counts for one unit of time, regardless of how long it actually takes. Otherwise it would be impossible to satisfy the time complexity requirements if you had a value type like vector<int>, since the time required to copy a vector is proportional to its length.

2. Individual increment operations may take longer than the upper limit T, as long as iterating over the entire sequence takes no longer than a time of NT, where N is the length of the sequence. One example of where this becomes important is when doing a depth-first traversal of a tree structure. Going from the rightmost leaf of a large subtree to the leftmost leaf of the next subtree may take time proportional to the depth of the tree, since you have to follow many parent links up to get to the next subtree. But if you traverse the entire tree, many steps will only require following one child link or one parent link. The average will be just under two links followed for each node of the tree.

In addition there are certain required typedefs that are used by the generic algorithms in the Standard Library. You can satisfy these requirements simply by making your input iterator class X be publicly derived from
std::iterator< std::input_iterator_tag, T, Distance >
where Distance is an integer type that can hold the maximum possible length of a sequence defined by iterators of type X. Typically, it is size_t. [The current draft C++ Standard adds the template parameters T* and T&. — pjp]

If you are still working with a nonstandard version of STL, typically derived from the original H-P version, you have to publicly derive X from
std::input_iterator<T, Distance>
In either case, if your compiler doesn't support namespaces then leave off the std:: namespace qualifier.

Creating Input Iterator Classes

Rather than deal with all of these requirements anew every time I needed to define an input iterator class, I decided to write an adapter to take care of the details once and for all. The code is given in Listing 5. The symbol NO_PARTIAL_SPECIALIZATION should be defined if your compiler can't handle partial specialization of templates.

The class input_iter_adapter<P, Distance> is a wrapper around a pointer or smart-pointer class P that makes it behave as an input iterator. The type Distance should be chosen as described at the end of the previous section. The type P must satisfy the following requirements:

1. pointer_traits<P>::element_type is defined as some type G. The pointer_traits class template is given in Listing 4. This requirement is automatically satisfied if P is a pointer type, or if the type P::element_type is defined. element_type is the type of structure pointed at by objects of type P. If the symbol NO_PARTIAL_SPECIALIZATION is defined, you can ignore this requirement, but you'll have to explicitly provide the value_type (Val parameter) and deref_type (Der parameter) mentioned below.

2. G::value_type is defined as some type T with a public copy constructor. T is the value type for the input iterator.

3. G::deref_type is defined as either T or T const &. It is the return type of the dereference operator (*it). This is to accomodate the ambiguity in the draft C++ Standard as to whether the return type of *it must be T or can be T const &.

The following are required for any value g of type G:

4. g.value() has type G::deref_type, and returns the most recent value generated in the sequence. g.value() is undefined if g.next() has never been called or the last call of g.next() returned false.

5. g.next() generates the next value in the sequence and returns true if there are more values in the sequence, otherwise it returns false.

Now on to the implementation. The input_iter_adapter class has a single data member p of type P. The private function inc() gets the next value in the sequence, setting p to the null pointer if there are no more values in the sequence.

The default constructor creates a past-the-end value by initializing p with the null pointer. The other constructor allows an iterator to be constructed from a P value, and fetches the first value of the sequence (if any).

The class PIR (post-increment return) is the proxy class used to implement expressions of the form *r++. The post-increment operator initializes a PIR object with the current value pointed at, increments the iterator, then returns the PIR object.

The remaining operations should require no further explanation.

The class P can simply be a pointer to G when the object generating the sequence of values has a static scope that encloses uses of the input iterator — i.e., when you have something like:
{ G g(...);
input_iter_adapter<G *> i(&g), iend();
// code that doesn't pass i out of this scope
}
The implementation of const_db_btree in the next section is an example of this.

In some cases you'll want P to be a smart-pointer class for G. For example, suppose you want to create an input iterator for the Fibonacci sequence. This sequence is defined by
F(1) = 1
F(2) = 1
F(n+2) = F(n) + F(n+1)
Listing 7 shows the function fibonacci(n) that returns an input iterator for the first n elements of the Fibonacci sequence. Class fib is the generator class G mentioned in the requirements given above. The input iterator uses a reference-counted smart pointer template class defined in Listing 6. This smart pointer class keeps track of how many pointers are referencing an object, and deletes the object when there are no references to it. This is needed because the input iterator, along with its enclosed pointer, is passed out of the scope in which the fib object is created, and so some means of reclaiming storage for the fib is needed.

Revisiting the Example

Now we're ready to look at the implementation of the const_db_btree class described earlier. It is an STL-style interface to the facilities provided by the dbopen function. The implementation is given in Listings 2 and 8.

The Unix function dbopen returns a pointer to a DB struct. Most of the DB struct members are pointers to the various functions needed to access the database — and all of these functions take the DB pointer as an argument! This somewhat convoluted arrangement is used because there are several different kinds of databases that can be opened with dbopen, and each has its own version of each access function. Keys and their associated values are represented via DBT values:
struct DBT {
        void * data;
        size_t size;
        };
data is a pointer to an array of size bytes. The various database routines take and return DBT values. The functions pod2dbt and dbt2pod (Listing 8) convert "plain old data" to and from the DBT representation.

Class const_db_btree has one member object, body, of type datagen (see Listing 2) . This is the generator class for its iterator class — an iterator is just a wrapper around a pointer to body (see the typedef for iterator and the functions that return an iterator). The data members of class datagen are:

value_ — the (key, data) pair for the current database cursor position.

dbp — pointer to the DB struct returned by dbopen.

key, data — DBT representations of value_.

ek — greatest key value to retrieve (for find_to and find_rng).

endkey — DBT representation of ek.

flag — operation code used by the database access routines returned by dbopen.

The function const_db_btree::ctor (Listing 8) is called by the constructor. It simply opens the database as a B-tree file, and specifies that the function compare be used to compare (the DBT representations of) keys.

The function datagen::set(Key * startk, Key * endk) (Listing 8) is used by begin, find_from, find_to, and find_rng. It sets things up for the first call of next. It is always immediately followed by the statement
return iterator(&body);
Since the constructor for input_iter_adapter calls next, every call of set is followed by a call to next. startk is (a pointer to) the lower bound on the range of keys we want to fetch. If it is null, we want to start with the first entry in the database. endk is (a pointer to) the upper bound on the range of keys we want to fetch. If it is null, we want to continue to the end of the database.

The actual database accesses all occur in datagen::next (Listing 8) , which is called by the iterator increment operation and the iterator constructor (Listing 5) . The next function determines if there are any more (key, data) pairs to fetch, and if so, fetches the next one. On entry to this function, flag is set to either R_FIRST, R_CURSOR, or R_NEXT (R_SETCURSOR is used only in db_btree). The R_FIRST flag tells the dbp->seq function to fetch the first (key, data) pair in the database. The R_CURSOR flag says to fetch the first (key, data) pair with key no less than the key data member, and set the database cursor to this position. The R_NEXT flag says to advance the database cursor to the next (key, data) pair and fetch it. In all cases the new (key, data) pair ends up in the data members key and data. If the return code is greater than zero, no more database entries could be found. If the returned key is greater than ek (whose DBT representation is in endkey), and we are returning values from find_to or find_rng, this also indicates the end of the sequence of values, and next returns false.

Conclusion

Input iterators provide a standard interface for returning sets or sequences of results, one that fits with STL and allows wider use of generic programming techniques. Use of the input_iter_adapter class template lets you easily define new input iterator classes without having to pore through the detailed requirements for input iterators. o

Kevin S. Van Horn, Ph.D. is the proprietor of KSVH Software and Consulting (http://www.xmission.com/~ksvhsoft/). His current interests are generic programming and tools for decision making under uncertainty. Previously he was the technical lead for Excite, Inc.'s NewsTracker service. You may reach him at ksvhsoft@xmission.com. All source code in this article is copyright © 1997 by Kevin S. Van Horn. You may freely use and modify this code in any manner provided this entire paragraph appears prominently in the source.