The Learning C/C++urve

Columns

The Learning C/C++urve

Bobby Schmidt

C->C++ Mutations, Part 3

When bad things happen to good code, it's time to seek professional help. Bobby shows how to avoid major trauma when migrating code from C to C++.
Copyright ©1996 Robert H. Schmidt

Yes, sports fans, it's time for another action-packed matchup, as C++ compilers take on ported C code. As an added attraction, this month we have a new challenger: the folks at Symantec have graciously sent me Release 5 of their Power Macintosh C++ Development System.

Comments in C and C++

Even if you are brand spanking new to C++, unless you've been living in the Q Continuum all these years, you almost certainly know that C++ supports two forms of source comments:
non_comment /* comment */ non_comment
non_comment // comment from here
            // to end-of-line
The first form comes straight from C, and has all the features (and limitations) of C comments. The second form is new, and should be familiar to anyone programming in assembler, where comments are typically terminated by the end of their source line. Unfortunately, these new comments in C++ render certain C constructs invalid.

Imagine that you work in the test or QA department of a group building a C compiler. Further imagine that you create the test cases
int x = 6 / 1;
int y = 6 * 1;
as exercises for the compiler's optimizer. After your test case reviewers tell you to add more comments to your code, you change the above to
int x = 6 //* optimized out */ 1;
int y = 6 */* optimized out */ 1;
This is valid C and should compile. If you later migrate this test code to C++, it most definitely should not compile. On my system, both compilers complain
';' expected
on the first line. The problem is one of tokenization. When C scans the line, it sees the token sequence starting
int  x  =  6  /  /*
C++ scans the same line as starting
int  x  =  6  //  *
Once C++ thinks it sees the // comment, it ignores everything else on the line, as if you had originally written
int x = 6
with no trailing semicolon.

In practice, I'd be genuinely surprised if any of your C code actually contains the sequence //*. If it does, just add a space after the first / to make the ambiguity disappear.

I adore these // comments. In fact, about the only time I use /*...*/ is for "commenting out" code. Now you astute types may think "look Bobby, the more robust way to comment out code is with the preprocessor." And verily you would be right, for
#if 0
 /* price of tea in China */
    int t;
#endif
always works, while
/*
    /* price of tea in China */
    int t;
*/
doesn't (comments can't nest). While I'm wearing my "setting a good example (a.k.a. teacher)" hat, I'll recommend you use the #if 0 method to disable sections of code; while I'm wearing my "do as I say not as I do (a.k.a. parent)" hat, I'll also note there can be advantages to using /*...*/ where possible.

In particular, if your code editor uses syntax coloring, the chances are excellent it can put /*...*/ lines in a unique color, while the chances are about zero it can similarly detect and color #if 0 ... #endif. If you know of editors that color code disabled by #if 0, please let me know, and I'll mention them here.

Note that the proposed C9X revisions to Standard C include //comments, meaning the porting problem shown here eventually will occur even if you stay within the same language.

Keywords

The // problem above comes from C++ adding a new token that didn't exist in C, so that the character sequence //* has different meanings in each language. C++ actually adds a raft of other new tokens. While they don't suffer the scanning ambiguity of comments, many do suffer the more serious problem of colliding with identifiers.

In general, every time C++ adds a new keyword token to the language, it destroys the use of that same word as an identifier. Any code that uses such identifiers suddenly stops compiling, often with obfuscated error messages. To demonstrate, run the valid C definition
struct operator
    {
    char x;
    int class;
    long public;
    };
through a C++ compiler and note the results. On my machine, Symantec yields the diagnostics
Line 2 : not an overloadable
    operator token
Line 3 : size of __unnamed is
    not known
Line 4 : illegal combination
    of types
Line 5 : '=', ';' or ',' expected
Line 6 : identifier or
    '( declarator )' expected
while Metrowerks just generally complains about bad declaration syntax.

In C, the names operator, class, and public are freely available as identifiers; in C++, these same names are keywords. Tweak those three identifiers slightly, as in
struct operator_ // changed
    {
    char x;
    int class_; // changed
    long public_; // changed
    };
and the code compiles cleanly as C++.

This business of adding new keywords is something the Standards committees do not take lightly. Committee members are abundantly aware that every new keyword we introduce breaks existing conformant code. However, we must weigh this penalty against the advances brought by new language extensions and features.

Some of the new C++ keywords like reinterpret_cast are not likely to collide with existing identifier usage; others, like class, new, and try, are much more likely to collide. Unlike earlier columns, where I could prescribe nostrums such as type casts, I honestly have no simple cure here. The only two options I see are to either manually search and replace such usages in your source files, or to let the compiler find them for you as shown here.

This identifier-keyword collision can be severe if you have a published API or library that exposes these names. Microsoft has run into this with the keyword bool. One of their OLE header files contains a union with a member named bool. As I understand it, this usage has helped delay introduction of built-in type bool into their Visual C++. (By the time you read this, Microsoft may have released Visual C++ Version 4.2 which may, or may not, remedy this.)

In Table 1, I list the names that are available as identifiers in Standard C, but are reserved as keywords in Standard C++ (as of the January 1996 Working Paper).

Tentative Definitions

Way back in November last year, I described C's notion of tentative definitions. The gist of tentative definitions is that, at file scope, C can't tell if a statement like
int x;
is a definition or just a declaration. In such contexts, C tentatively assumes the statement is a definition, unless and until it sees another statement that is unambiguously a definition. Tentative definitions make possible C constructs such as
int x; /* declaration and tentative
          definition */
int x; /* redeclaration of same 
          'x' */
int x = 0; /* real definition;
              previous tentative */
           /* definition is treated
              as just a */
           /* declaration */
This code is not valid C++, which disallows tentative definitions. C++, with its One Definition Rule, wants to treat the first declaration that can possibly be a definition as the definition. Any subsequent pretenders to that throne may generate a compiler diagnostic.

Compatible Types

Consider the two C translation units:
/*
 *  translation unit #1
 */
struct s1
    {
    char c;
    int i;
    long l;
    };
void f(struct s1 x)
    {
    /* ... do something with x */
    }
and
/*
 * translation unit #2
 */
struct s2
    {
    char c;
    int i;
    long l;
    };
extern void f(struct s2);
int main(void)
    {
    struct s2 x = {`a', -1, 10};
    f(x);
    return 0;
    }
struct s1 and struct s2 differ only in their tag names; otherwise they are identical, and constitute what C calls compatible types. Broadly speaking, this means that, within a conforming C program, struct s1 and struct s2 are interchangeable. In our example above, we are passing f an argument of type struct s2 even though f's definition is prototyped to accept a struct s1. Because struct s1 and struct s2 are compatible types, this apparent type mismatch is okay.

C++, which does not have the notion of compatible types, considers struct s1 and struct s2 distinct types. In the above example, both argument and parameter must be the same named type, struct s1 or struct s2, but not a mix as shown; otherwise the resulting C++ program is not well-formed. In what I interpret as a corollary of the One Definition Rule, C++ reckons compatibility by name equivalence, not by structure equivalence. In this example, the types must have the same name; just having a matching morphology won't cut it.

If you compile this code as C++, the compiler will accept it; as we've seen in months past, C++ doesn't look across translation units to ensure type compatibility. The linker, however, will detect the mismatched function signatures (recalling July's discussion of type-safe linkage).

Implicit int

Here's a biggie, a new C++ feature that fixes one of my oldest gripes about C: implicit int. Simply put, if you don't explicitly specify a type within a declaration, C assumes the type is int. That is, C considers the statements
extern n;
void f(const x);
main(void)
    {
    return 0;
    }
equivalent to
extern int n;
void f(int const x);
int main(void)
    {
    return 0;
    }
In all cases, the declaration specifier sequence (a.k.a. decl-spec-seq -- the "stuff" to the left of the name being declared) assumes type int [1] . C++ does not make this assumption, and will not allow the first example. This is a fairly recent change to Standard C++ (coming within the last year or so I believe), so your compiler may not enforce it.

Why did the C++ committees make this change? For several reasons:

The next version of Standard C will probably eliminate implicit int.

Most people seem to consider full declarations good form and proper style these days.

C++ compilers can't syntactically tell the difference between a function-style cast or construction, and a declaration with implicit int. Thus, compilers have trouble generating accurate and meaningful diagnostics for such code.

This last point is demonstrated by the C++ statement
static t(x);
What is this? The answer depends on the semantics of t and x -- you, and more importantly, the compiler, can't reach a conclusion based on syntax alone. Possibilities include:

the definition of a static object x of type t.
the definition of a static object t of type (implicit) int, initialized to x.
the declaration of a static function t accepting a parameter of type x and returning an (implicit) int.

Were implicit int not allowed, the only possible interpretation would be the first one.

Scope of Nested struct Names

You may be surprised to know that
struct s
    {
    struct t
        {
        int i;
        } j;
    };
struct t x; /* access to "hidden" struct t? */
compiles as Standard C. I know when I first learned this I was disappointed, figuring that t should not be available outside the scope of the (apparently) enclosing structure s. But in fact, this sequence is tantamount to
struct t
    {
    int i;
    };
struct s
    {
    struct t j;
    };
struct t x;
As you can easily imagine, C++ tightens these scoping rules. t is unavailable outside s without explicit scope resolution, as in
s::t x;
Even this won't work if t is declared as private within s. For C++ to act otherwise would destroy the ability of classes to encapsulate and hide other types.

Type Name Spaces

C++ changes struct name scope and visibility in other ways. Here's a program that is different from previous examples in that it actually runs and links in both languages:
#include <stdio.h>
char s[10];
int main(void)
    {
    struct s
        {
        char i;
        };
    printf("sizeof(s) == %lu\n",
            (unsigned long) sizeof(s));
    return 0;
    }
Since sizeof(char) is 1 and assuming structures are byte-aligned, what does this program print? The answer depends on the language you compile it under. With C, the program writes
sizeof(s) == 10
while with C++ you get
sizeof(s) == 1
The difference comes from how each language interprets s in the expression sizeof(s). C clearly sees s as referring to the array, while C++ sees s as referring to the locally defined struct. To add to this mystery, if you change the struct s definition to
typedef struct s
    {
    char i;
    } s;
rebuild and rerun, you now see
sizeof(s) == 1
in both languages.

The C typedef makes explicit what C++ makes implicit: declared struct names automatically go into both the tagged name space and the "normal" (untagged) name space [2] . Therefore, in C++ the definition
struct s
    {
    char i;
    };
is equivalent to
typedef struct s
    {
    char i;
    } s;
C does not automatically clone tagged names into the untagged name space this way. Before we added the typedef, the only s C knew was the outer array object; after the typedef, a local type name s appeared in both name spaces just as it does with C++. The local name masked the outer name, leading to the new result.

C++ behaves this way as a notational convenience. Remember that user-defined types are supposed to look built-in, a theme I hammered to death in my Boolean series earlier this year. Imagine how less built-in that Boolean would look if we had to write
class Boolean f(class Boolean);
instead of
Boolean f(Boolean);
structs and typedefs

These new C++ name space rules alter how structs and typedefs interact. Because C++ implicitly puts a struct name in both name spaces, you can't turn around and use the (implicitly) typedefed name as another typedef. To see this, let's modify our earlier example to
struct s
    {
    char i;
    };
typedef int s;
and compile in both languages.

As you probably expect, this works in C but not in C++. Remember, the C declaration struct s does not put s in the untagged name space. Thus, when C encounters typedef int s, there's no existing untagged s to collide with. C++, as we established above, does put s in the untagged name space; when it sees typedef int s, it complains that s is already declared.

Similarly, you can't redefine a typedefed name to something else in the midst of a C++ struct definition. The illustration code
typedef long L;
struct s
    {
    L l;
    long L;
    };
is, as usual, valid C but ill-formed C++. C lets you redefine the semantics of L in the middle of the struct definition. (Whether or not this is sane design is another matter.) Name lookup rules in C++ are heinous enough, so thankfully that language bars such redefinitions; otherwise I can easily imagine member functions, especially those declared in situ, getting mighty confused over what L really means.

Coming Attractions

Some wondrous events to mark on your calendar:

Next month I'll present the fourth and final episode in this column series on C-> C++ Mutations. After that I feel another Why Files coming on ...

P.J. Plauger, Dan Saks, and I will once again hold court at the Embedded Systems Conference, in San Jose, just before the autumnal equinox.

You can also catch our act at the Software Development Conference in Washington D.C. around Halloween. I believe other CUJ luminaries will be there, including my predecessor in these pages, the indefatigable Chuck Allison.

Saks has started his "Strings Across America" Summer '96 Tour, teaching intro C++ in a dozen cities over as many weeks. I keep telling him he should print up tour T-shirts and get Spinal Tap to open. Be sure to bring your lighters for his dramatic encore of "Freebird."

All these conferences and shows are brought to you by our benevolent publisher Miller Freeman Inc., the official provider of copy constructors to the 1996 Summer Olympics.

Erratica

Diligent Reader Steve Willer e-mails from Toronto, asking about my use of the term "RTTI casts" in June. As Steve correctly points out, of the four types of casts I call RTTI casts, only one (dynamic_cast) actually needs to use run-time type information.

The C++ Working Paper bundles these casts into the generic category postfix-expression. I call them RTTI casts because

I wanted some nomenclature to distinguish them from older C-style and C++ function-style casts.

Since all the casts follow the same syntactic model, I decided to give them all some common name.

I noticed these casts appearing about the time RTTI did.

As noted above, one of the casts requires RTTI.

If you have a better name for these types of casts, please suggest it to me; if it's a winner, I'll start using it in my writing.

Notes

[1] To learn more about decl-spec-seq than perhaps any socially adept person should want, check out Dan "Earl Anthony" Saks' recent column series.

[2] The term "name space" as used here has no connection to the new C++ keyword namespace. I think that keyword is unfortunately named, since it demarcates a scoped declarative region more than a name space. Maybe it should have been called namescope.

[3] Available as macros in Standard C by including iso646.h.

[4] Available as a typedef in Standard C by including stddef.h. o

Bobby Schmidt is a freelance writer, teacher, consultant, and programmer. He is also a member of the ANSI/ISO C standards committee, an alumnus of Microsoft, and an original "associate" of (Dan) Saks & Associates. In other career incarnations, Bobby has been a pool hall operator, radio DJ, private investigator, and astronomer. You may summon him at 3543 167th Ct NE #BB-301, Redmond WA 98052; by phone at (206) 881-6990, or via Internet e-mail as rschmidt@netcom.com.