Parsing numeric input has always been a messy affair. Standard C++ locales add culture dependence to the mix as well.
Prolog
Well, the C++ Standard is finally on its way. Barring any unpleasant administrative surprises, we should now know what Standard C++ will look like for the next half decade or so. You can read the details elsewhere in this issue. (See Pete Becker's column.) I celebrate the change in this column by a small but significant alteration in style. Instead of calling it the draft C++ Standard, I will henceforth refer to the language definition as the C++ Standard. May it live long and prosper.
Introduction
Last month, I described the thoroughly modern way to convert numeric values within a C++ program to human-readable text. (See "Standard C/C++: The Facets num_put and numpunct," CUJ, January 1998.) This is the magic that gets invoked when you insert, say, a value n of type short into the standard output stream cout, as in:
cout << n;The expression statement calls the function:
ostream::operator<<(ostream&, short)where the type ostream is now a synonym for the template specialization:
basic_ostream<char, char_traits<char> >Text is inserted into the stream buffer associated with cout by an output iterator OutIt of type:
ostreambuf_iterator<char, char_traits<char> >The actual conversion to text occurs within the member function put in a locale facet of type:
num_put<char, OutIt>All this machinery is a far cry from the early days of C++. The first implementations of iostreams basically just called printf, or perhaps sprintf, like any good C programmer would do. Indeed, the behavior of the put functions in num_put is defined in terms of equivalent printf calls. Inserters supply pretty notation, and better type checking than in C, but they encapsulate much the same simple functionality of yore. (And as I observed last month, num_put<char, Iter>::put typically relies on sprintf to do the actual conversion.)
As you might expect, the reverse process has been blessed with equally modern machinery in the Standard C++ library. To extract a text field from the standard input stream cin and convert it to a numeric value n of type short, you can write:
cin >> n;The expression statement calls the function:
istream::operator>>(istream&, short&)where the type istream is now a synonym for the template specialization:
basic_istream<char, char_traits<char> >Text is extracted from the stream buffer associated with cin by an input iterator InIt of type:
istreambuf_iterator<char, char_traits<char> >The actual conversion to text occurs within the member function get in a locale facet of type:
num_get<char, InIt>The analogous machinery in C is the function scanf and its brethren. Indeed, the behavior of the get functions in num_get is defined in terms of equivalent scanf calls. Extractors, like inserters, supply pretty notation, and better type checking than in C, but they too encapsulate much the same simple functionality.
Here is where the parallelism breaks down. The function printf is reasonably well designed and has been used widely for decades. But the function scanf is much less successful. The C Standards committee haggled over its details for quite some time. Eventually, we concluded that the function is of limited usefulness. Programmers simply have too many conflicting needs for parsing text fields and handling error conditions to be satisfied by one all-purpose scanning function.
The C++ Standard can, should, and does look to scanf for a precise definition of acceptable input fields. But an implementation of the Standard C++ library should not necessarily endeavor to turn each extractor call into an actual call on scanf or one of its variants. Put simply, the work you have to do to set up for a call to scanf isn't worth the benefit. Better to call directly the underlying functions that scanf might call, such as strtol or strtod.
Template Class num_get
My goal this month is to describe template class num_get, as part of my ongoing tour of the locale facets defined in the Standard C++ library. You'll have to go back a few months to read an overview of facets in general. (See "Standard C/C++: Introduction to Locales," CUJ, October 1997.) All I'll say here is that num_get encapsulates all the logic for converting a text field to a numeric value Boolean, integer, floating-point, or void pointer value. Every locale object contains references to two locale num_get facets, with types:
num_get<char, istreambuf_iterator<char, char_traits<char> >> num_get<wchar_t, istreambuf_iterator<wchar_t, char_traits<wchar_t> >>Numeric-conversion logic can in principle depend on culture-specific rules. Indeed, each of the num_get facets relies on its corresponding numpunct facet, just as the num_put facets do, for various interesting bits of information. (See last month's column.) The numpunct member function decimal_point, for example, may return a comma in a European locale. The num_get member function get thus knows to treat a comma as a decimal point when reading a floating-point text field.
Listing 1 shows one way to implement the extractor for a short integer. It illustrates in detail how the facet num_get is used in real life. In brief:
- The member function ios_base::getloc obtains a copy of the locale object "imbued" into the object of class basic_istreambuf<_E, _Tr>.
- The template function use_facet<_Nget> obtains from its locale object argument a const reference to the facet _Nget (a typedef for num_get<_E, _Iter> here), which should always be present.
- The facet reference _Fac can be used to call any of several versions of _Fac.get to extract a field (sequence of elements) from the input stream and convert it.
- _Fac.get extracts elements from the stream buffer (obtained by calling _Myios::rdbuf()) using an input iterator of type _Iter (a typedef for istreambuf_iterator<_E, _Tr>) to assemble the input field.
- _Fac.get obtains stored formatting information from the object of class basic_istream<_E, _Tr>(*this).
- _Fac.get stores the converted field value in the object _Y (of type long in this case). It can also throw an exception if anything goes wrong along the way.
- The extractor ensures that the converted value can be properly represented in a short integer, then stores the converted value.
The facet is thus responsible for parsing the input and doing the conversion. It doesn't define an overload of get for type short because the one for long can do the job, with a little help from the extractor.
Listing 2 shows one way to implement (most of) template class num_get. As with earlier facets, I choose to omit most of the implementation-specific magic code. It's a distraction. Most of the interesting action occurs in the overloads for do_get. These protected virtual member functions deal with the quite different parsing requirements of bool, integer values, floating-point, and void pointer values.
A few hints to minimize distractions:
- The macro _NARROW converts a member of the element type to type char. If the character code is not a member of the basic C character set, the function returns zero (a null character).
- The macro _WIDEN converts a member of the basic C character set, as type char, to the element type. Typically, this involves little or no work.
- The macros _MAX_INT_DIGIT, _MAX_EXP_DIGIT, and _MAX_SIG_DIGIT are ones I've used for years to specify the number of characters (elements) needed to represent the largest possible integer and floating-point values.
- The private member function _Getifld parses an integer input field. It must also check for "thousands separators," such as the commas widely used to separate groups of three digits. If any separators are present, then the field must obey the grouping rules spelled out in the numpunct facet. The resulting logic can charitably be described as interesting.
- The private member function _Getffld parses a floating-point input field.
Note that the code in Listing 2 doesn't do the actual conversion from an integer or floating-point text field to an internal encoding. Integers are converted by calling the Standard C library functions strtol or strtoul. Void pointers are converted much the same way, except that the code provides a way to represent pointers with more bits than an unsigned long.
Floating-point values are converted by calling the special functions _Stof, _Stod, or _Stold. These behave much like the Standard C library function stod except that they also accept a power-of-ten scaling factor. A careful implementation of these special functions can more easily and portably handle values too large to represent. (For more detail, see my book, The Draft Standard C++ Library, Prentice-Hall, 1995.)
Conclusion
You now have a nodding acquaintance with the facets num_get, num_put, and numpunct. The questions is, why should you care? You might conceivably want to write your own locale-dependent numeric converters. But you can probably do a serviceable job using the iostreams facilities in <sstream> or <strstream>. Both are rather easier to use than the raw facet facilities.
More likely, you might want to write your own version of one of these facets. In that case, it helps to know what they do. Nevertheless, your best bet is to derive a subclass from one of these template classes. That way, you can simply override the virtual do_get functions you want to muck with. Leave the others alone. Even better, call on the unmodified member functions to do all the hard work.
Locale facets are pretty powerful, but then so is my VCR. I avoid messing with the programming of either as much as possible. o
P.J. Plauger is Senior Editor of C/C++ Users Journal and President of Dinkumware, Ltd. He is the author of the Standard C++ Library shipped with Microsoft's Visual C++, v5.0. For eight years, he served as convener of the ISO C standards committee, WG14. He remains active on the C++ committee, J16. His latest books are The Draft Standard C++ Library, Programming on Purpose (three volumes), and Standard C (with Jim Brodie), all published by Prentice-Hall. You can reach him at pjp@plauger.com.