Standard C/C++

Columns

Standard C/C++

P. J. Plauger

The Facet ctype

Classifying characters is still an important operation in many C++ programs, but it now involves considerably more machinery in the presence of multiple locale objects.

Introduction

One of the early strengths of C was its ability to manipulate characters cleanly and rapidly. The char data type, pointers, character literals, and string literals together provide a powerful notation — far easier to use than the earlier machineries of assembly language or Fortran. C programmers certainly didn't invent text processing, but they quickly achieved major advances in the quality, flexibility, and speed of software tools for manipulating text.

It should come as no surprise that the header <ctype.h> is one of the earliest additions to what is now the Standard C library. Classifying characters and mapping between character cases are among the commonest operations in text processing. It is not uncommon for a text processor to apply several classification tests to each character it reads. The character-classification functions, just like getchar and its buddies, can thus have a profound effect on the overall performance of a program.

Consider the simple function isalnum(int ch). It accepts as an argument either a non-negative character code, which is essentially (int)(unsigned char)ch, or a negative end-of-file code, EOF. In the interest of good performance, the function is not obliged to test for any argument values outside this range. The traditional implementation is to define EOF as -1 and to use ch as an index into an array of characters. Each element of the array stores a classification mask for the character code that indexes the element. ANDing the array element with a particular union of classification bits makes for a rapid test.

A typical implementation goes one step farther. It eliminates the overhead of a function call by providing a masking macro. My implementation of the Standard C library, for example, contains the following macro definition in the header <ctype.h>:
#define isalnum(ch) (_Ctype[ch] & (_DI|_LO|_UP|_XA))
(See "Standard C: Implementing <ctype.h>," CUJ, November 1990. It's on the CUJ CD-ROM, if you don't keep a giant stack of back issues, as I do.)

Here, _Ctype is a static pointer supplied by the library. It points to element number 1 of a 257-element array of masks, for a typical environment with eight-bit bytes. The constant _DI specifies the mask bit that signifies a digit. Similarly, _LO signifies a lower-case character, _UP signifies an upper-case character, and _XA signifies any "extra" alphabetic characters that might be none of the above. The last category matters only in locales other than the standard "C" locale that have a more elaborate alphabet than is required by English.

I've recited all this basic information to emphasize a simple point. In a typical C program, the expression isalnum(ch) is blindingly fast. It effectively adds an integer to a static pointer, picks up the mask word designated by the resulting pointer, and ANDs it with a constant integer. If the result is nonzero, you know that ch is classified as an alphanumeric character.

The standardization of C made character classification a bit more complex, but not necessarily any slower. (See my introduction to this topic last month, "Standard C/C+: Introduction to Locales," CUJ, October 1997.) A call to the library function setlocale, declared in <locale.h> can change the rules for character classification during program execution. An alternate locale can add characters to certain classifications in the "C" locale, as I indicated above, but it can't take any away. In any event, all an implementation has to do is change the value stored in _Ctype to point at a new classification table. The macros in <ctype.h> can still work, and they can remain as fast as ever.

Standardization did add one significant complexity in this area. The C Standard added the concept of a "wide character," of type wchar_t, which can represent a much larger set of characters than the old standby char. Amendment 1, adopted by ISO a few years ago, added wide-character versions of essentially all the traditional character classification and manipulation functions in the Standard C library. The new function iswalnum(wchar_t wch), for example, is declared in the new header <wctype.h>. As you might guess, it does for wide characters what isalnum does for "narrow" (single byte) characters.

Type wchar_t is likely to be a 16-bit or larger integer type. Hence, it can represent tens of thousands, if not billions, of character codes. It is not likely that an implementation of iswalnum will make use of a lookup table akin to good old _Ctype above. Whatever method the function uses, it will doubtless be substantially slower than our old friends from <ctype.h>. That's simply the price you have to pay if you want to traffic in Unicode, Shift JIS, or one of the other large character sets in use today.

Doing It With Facets

The draft C++ Standard adds even more complexity, as I outlined last month. It provides for multiple locales in a single program. Each locale is associated with a locale object. In particular, each stream object — such as cin or cout, declared in <iostream> — is "imbued" with its own private locale object. Each locale object, in turn, designates a couple dozen "facets," which perform the actual locale-specific operations. With a bit of setup, you can presumably read dates in French and write them out in German, using the machinery associated with the template facets time_get and time_put to do much of the work for you.

I will eventually describe the machinery needed to read and write dates this way, but not in this installment. Rather, I begin my detailed look at locale facets with template facet ctype. It is the obvious descendant of a quarter century or more experience in character classification that I outlined above, all beginning with the C header <ctype.h>.

Template class ctype is a facet by virtue of being publicly derived from class locale::facet. It also defines the public static member object id, of type locale::id. I described this general machinery last month, so I won't dwell on it here. More important is to note that the template class has one class parameter which I tend to call E, for "element type." It describes the type of character elements you want to work with, be they elements of an input stream, an output stream, a string class, or whatever.

The obvious specialization of template class ctype is ctype<char>. The Standard C++ library provides a facet of this class in all locales. As a matter of fact, the library provides an explicit specialization for this class, with a few added properties, and optimizations, not found in the basic template. The library also provides a facet of type ctype<wchar_t>, for use with wide streams and strings of wchar_t. You can, in principle, specialize ctype for other types, such as a character type that you define. To do so requires rather more skill than is at first apparent, however. I may one day describe all the steps involved in building streams and strings of exotic character types, and locale facets to match, but not here and now.

Certain properties are common to all specializations of template class ctype. These are distilled out into the class ctype_base, as shown in Listing 1. This class supplies the necessary heritage from class locale::facet, and the base-class constructor with the behavior expected by other members of class locale. It also supplies an enumeration named mask. You will recognize the names of the enumeration constants from the <ctype.h> heritage. The values I show here make use of the mask-bit macros from my implementation of the Standard C library. There is no requirement that the encodings match up between C and C++ — I use them because they capture an encoding that is known to work properly.

Listing 2 shows one way to implement template class ctype. Listing 3 shows one way to implement the explicit specialization ctype<char>. It differs from the basic template definition primarily in its treatment of the character-classification functions:

You can specify your own _Ctype-style table when you construct such a facet.

The member functions ctype<char>::is, ctype<char>::scan_is, and ctype<char>::scan_not, which perform character classification, do not call underlying virtual member functions.
These differences were introduced with an eye to improving performance over the template version.

I won't discuss this implementation in any detail, because those details are either a) obvious, given the functionality required, or b) obscure, given the magical behavior of the underlying locale machinery. I will describe what the member functions actually do in more detail shortly, but first I provide a few examples of how to use this facet in real life.

One obvious way is as a more-or-less direct substitute for the older <ctype.h> macros and functions. Say you have a locale object loc in hand, and you want to test whether some character code ch, of type char, is classified as alphanumeric in the locale described by loc. Then you can write isalnum(ch, loc), as opposed to the older expression isalnum(ch), which tests ch in terms of the global locale defined by the Standard C library. The two-argument version is supplied by the template function isalnum, defined in <locale>:
template<class E>
bool isalnum(E ch, const locale& loc)
  {return (use_facet< ctype<E> >(loc)
      .is(ctype_base::alnum, ch)); }
Here's what's going on. First, the template function use_facet paws through the locale object loc in search of an instance of the facet ctype<char>. It should find one, since all locales are born with such a facet. (See last month's installment for more details, and a few caveats.) If it succeeds, the function returns a const reference to the facet. Otherwise, it throws an exception.

The member function ctype<char>::is takes a mask argument, in this case ctype_base::alnum, and a character value, ch. It determines whether ch "is" a character of the classification specified by the mask. The ultimate test looks pretty familiar:
return ((_Ctype._Table[(unsigned char)_C] & _M) != 0);
Aside from an added cast, which may or may not produce actual executable code, this comes close to the macro definition of yore. The explicit specialization avoids the call to a virtual member function, as I mentioned above, which also improves the chances that the code will truly be inlined.

But all those savings are certainly swamped by the cost of calling use_facet in the first place. I won't show all the details of use_facet, which are extremely tedious, but I can assure you that the function must perform many operations that are not easily optimized away. The bottom line is that, despite an occasional nod toward good performance, character classification using locale objects is a much more expensive operation than when using the functions in <ctype.h>.

The draft Standard C++ library never actually calls template function isalnum or its ilk. Rather, you will see patterns much like this one, from the header <istream>:
template<class E, class Tr = char_traits<E> >
class basic_istream {
.....
  const ctype<E>& fac = use_facet< ctype<E> > (ios_base::getloc());
  Tr::int_type ch;
  .....
  fac.is(ctype_base::space, Tr::to_char_type(ch));
Class istream from the early days of iostreams is now replaced by template class basic_istream. Its template parameters are the element type E, and the corresponding "traits" (properties) class Tr. The name istream is now a typedef for basic_istream<char>.

To skip white space, a member function needs to perform the modern equivalent of isspace(ch), using the locale imbued in the stream. Thus, ios_base::getloc() delivers up a copy of the imbued locale object, and use_facet determines the relevant ctype<E> facet. The member function is is then called as in the example above.

You will find usages like this throughout a modern Standard C++ library. If you define your own inserters or extractors, and you wish to keep them honest with regard to any locale dependencies, you should copy these patterns in your own code.

Using Facet ctype

Now you have some idea why you might want to use template facet ctype. Here is a brief summary of how to use it. I list the member types and functions in alphabetical order, for want of any more compelling order.
typedef E char_type;
The type is a synonym for the template parameter E.
explicit ctype(size_t refs = 0);
The type is a synonym for the template parameter E.
virtual bool do_is(mask msk, E ch) const;
virtual const E *do_is(const E *first, const E *last,
    mask *dst) const;
The first protected member template function returns true if MASK(ch) & msk is nonzero, where MASK(ch) designates the mapping between an element value ch and its classification mask, of type mask. The name MASK is purely symbolic here; it is not defined by the template class. For an object of class ctype<char>, the mapping is tab[(unsigned char)(char)ch], where tab is the stored pointer to the ctype mask table.

The second protected member template function stores in dst[I] the value MASK(first[I]) & msk, where I ranges over the interval [0, last - first).
virtual char do_narrow(E ch, char dflt) const;
virtual const E *do_narrow(const E *first, const E *last,
    char dflt, char *dst) const;
The first protected member template function returns (char)ch, or dflt if that expression is undefined.

The second protected member template function stores in dst[I] the value do_narrow(first[I], dflt), for I in the interval [0, last - first).
virtual const E *do_scan_is(mask msk, const E *first,
    const E *last) const;
The protected member function returns the smallest pointer p in the range [first, last) for which do_is(msk, *p) is true. If no such value exists, the function returns last.
virtual const E *do_scan_not(mask msk, const E *first,
    const E *last) const;
The protected member function returns the smallest pointer p in the range [first, last) for which do_is(msk, *p) is false. If no such value exists, the function returns last.
virtual E do_tolower(E ch) const;
virtual const E *do_tolower(E *first, E *last) const;
The first protected member template function returns the lowercase character corresponding to ch, if such a character exists. Otherwise, it returns ch.

The second protected member template function replaces each element first[I], for I in the interval [0, last - first), with do_tolower(first[I]).
virtual E do_toupper(E ch) const;
virtual const E *do_toupper(E *first, E *last) const;
The first protected member template function returns the uppercase character corresponding to ch, if such a character exists. Otherwise, it returns ch.

The second protected member template function replaces each element first[I], for I in the interval [0, last - first), with do_toupper(first[I]).
virtual E do_widen(char ch) const;
virtual const char *do_widen(char *first, char *last, E *dst) const;
The first protected member template function returns E(ch).

The second protected member template function stores in dst[I] the value do_widen(first[I]), for I in the interval [0, last - first).
bool is(mask msk, E ch) const;
const E *is(const E *first, const E *last, mask *dst) const;
The first member function returns do_is(msk, ch). The second member function returns do_is(first, last, dst).
char narrow(E ch, char dflt) const;
const E *narrow(const E *first, const E *last, char dflt, char *dst) const;
The first member function returns do_narrow(ch, dflt). The second member function returns do_narrow(first, last, dflt, dst).
const E *scan_is(mask msk, const E *first, const E *last) const;
The member function returns do_scan_is(msk, first, last).
const E *scan_not(mask msk, const E *first, const E *last) const;
The member function returns do_scan_not(msk, first, last).
E tolower(E ch) const;
const E *tolower(E *first, E *last) const;
The member function returns do_tolower (first, last).
E toupper(E ch) const;
const E *toupper(E *first, E *last) const;
The member function returns do_toupper (first, last).
E widen(char ch) const;
const char *widen(char *first, char *last, E *dst) const;
The member function returns do_widen(first, last, dst).

Facets by Name

Many, but not all, facets defined in the draft Standard C++ library can also be constructed "by name." The "name" in question here is the null-terminated string locname, of type const char *, you might use as an argument to the Standard C library function setlocale, declared in <locale.h>. If you can legitimately call:
setlocale(LC_CTYPE, locname);
to alter the LC_CTYPE category of the global locale in the Standard C library, then you can also construct the facet:
ctype<char> fac =
    new ctype_byname<char>(locname);
You can then, say, alter the behavior of the stream cin to match this locale by executing:
cin.imbue(locale(cin.getloc(), &fac));
Listing 4 shows one way to implement template class ctype_byname. There's not much to it, but it sits atop a small mountain of code. o

P.J. Plauger is Senior Editor of C/C++ Users Journal and President of Dinkumware, Ltd. He is the author of the Standard C++ Library shipped with Microsoft's Visual C++, v5.0. For eight years, he served as convener of the ISO C standards committee, WG14. He remains active on the C++ committee, J16. His latest books are The Draft Standard C++ Library, Programming on Purpose (three volumes), and Standard C (with Jim Brodie), all published by Prentice-Hall. You can reach him at pjp@plauger.com.