Some Notes about the C Programming Language

Abstract

While developing some low-level graphics libraries in the early 1980s, I stumbled upon a number of subtle issues in the C programming language. This document summarizes my findings. Some items are obsolete now (because of better documentation and the appearance of ANSI C and C++); other items are still relevant,

Introductory Remarks
Type Conversions
Functions and Procedures
Macros for In-Line Functions and Procedures
Parentheses in Macro Bodies
Choosing between Macro and Function
Macros versus Functions: Important Differences
Macros versus Functions: Hiding Power
Macros for Other Purposes
Modularization: Include Files
Constants and Fixed Variables
Stand-Alone Applications (Software for the IOP)
Objects and Their Addresses
Alignment
Efficiency
C Implementation on MC68000 (on TNO's Geminix)
Declarators and Type Names
Pointers
Declaration versus Definition (also covering Storage Allocation)
References

Introductory Remarks

The C programming language was used to program the processors involved in the ``GKS-workstation-in-IOP'' project. There were two environments:

C under UNIX on the MC68000-based GEMINIX from TNO, and
C on a bare MC68000 I/O processor (called IOP).

For both environments the MC68000 C compiler by ACE [2] was employed. Especially the second environment required an intimate knowledge of C and its implementation, since I had to write efficient low-level graphics device drivers. This document describes a few conventions I followed when using C. It takes away the need to repeat all kinds of explanations in the documentation parts of the C source files.

While developing the software I hit upon many gaps in my knowledge of C. These notes summarize my discoveries insofar they cannot easily be found elsewhere. Many things discussed in this document are also treated in [1], although they are often hidden or only mentioned on the fly. Most issues are explicitly touched upon in [5] (alas, I discovered the book only after this document was written). The reader is assumed to have working experience with C.

Type Conversions

Some of the graphics libraries developed in the project heavily rely on type conversions. Be aware of this when reading assignments, function invocations, and return statements, which (may) imply a type conversion. In most cases it is not explicitly mentioned that the automatic type conversion is indeed desirable. A cast can always be used to make a type conversion explicit. Also see [5, Ch.6].

WARNING: The automatic conversions applied to actual function parameters on the one hand, and those applied to righthand operands of assignments and to return arguments on the other, are not exactly the same! A noticeable difference is seen in the following example:

f(a)
float a; /* N.B. adjusted to read as double [1, p.205] */
{ ... }
...
{
	short i;
	float r;

	r = i; /* i automatically converted to int, then to float */
	f(i);  /* i automatically converted to int, NOT float */
	f((float)i); /* i converted to float, then automatically to double */
}

Functions as Procedures

C does not know procedures. They can be simulated by functions that are invoked in an expression statement. The semi-colon (;) in an expression statement discards the returned value. In fact, a function need not explicitly return a value. This is accomplished by `return;' (without expression), or by ``flowing off the end of a function'' [1, p.203]. In both cases the return value is undefined. When the value returned by a function is undefined, it can only be ``used'' sensibly as an expression statement or as the first operand of the comma operator.

CONVENTIONS: True functions always return a well-defined value; the function's definition always contains an explicit type specifier, even if it is `int'; functions intended for use as procedure always return an undefined value and the type specifier in their definition is always omitted (so implicitly it is `int'). {N.B. We could have defined the dummy type `proc' for procedures, or `void' as in [5].} From now on we use the term `procedure' as if it were a C concept. `Function,' however, retains its double meaning. (We use the word `parameter' as an equivalent for `argument,' which is often used in [1].)

Macros for In-Line Functions and Procedures

We tried to hide whether an operation is implemented as function or as macro (#define). This means that a macro invocation should syntactically resemble function invocation. A function invocation is a particular form of expression. It is therefore safe if all macros for in-line functions are expressions. Any expression can be turned into a statement by appending a semi-colon (;); the resulting value is simply discarded. A macro for an in-line function, however, should never generate statements, because these are not expressions and would cause an asymmetry with function syntax (see example below). Consider the following macro definitions of S() (notice that parentheses without arguments also work for macros, see [5, Sec.3.3.2 on p.32]):

#define	S()	a = 0
#define	S()	a = 0;
#define	S()	a = 0; b = 1
#define	S()	{ a = 0; b = 1; }
#define	S()	( a = 0, b = 1 )

See what happens with the following way of using S(), depending on what definition is taken:

if (c) S(); else x = 0;

Notice that only the first and the last definitions generate expressions, giving no problems; the others will cause a syntax error (`else' is not recognized, because of multiple statements between `if (c)' and `else'). It is hard to repair this; one way would be to forbid the semi-colon after a macro invocation serving as in-line procedure (function invocations, simulating a procedure call, always need a semi-colon).

This also has to do with: (1) `;' is not a statement separator (or concatenator); (2) `;' is not a true statement terminator: (2.1) all expression statements end in semi-colon; (2.2) the compound statement does not end in semi-colon, and the if, while, and for statements do not have their ``own'' trailing semi-colon (they may inherit it from the final constituting statement, however, but only if this is not a compound statement); (2.3) all other statements end in their own semicolon; (3) ``structured'' statements, excepting the compound statement, do not take statement lists, but single statements as constituting part(s) (in fact this is the reason for the compound statement's existence).

All would have been well if: (1) `;' were a statement separator, or (2) `;' were a statement terminator, i.e. a compound statement had a compulsory trailing semi-colon, or (3) all structured statements would accept statement lists as constituting parts (this could have been accomplished by the introduction of special terminator keywords, such as fi, od, rof; or by always using braces, say by defining:

#define	IF	if
#define	THEN	{
#define	ELSE	} else {
#define	FI	}

but this was not done, since it introduces new problems).

In that case we could have had two types of macros: (a) expressions, and (b1) statements, or (b2) statements without their trailing semi-colon (expressions being a special case), or (b3) statement lists; corresponding to (a) in-line functions and (b) in-line procedures. (It might have been a good idea to define the semi-colon as an operator, equivalent to the comma, then (3) would also be an expression. The value of some statements could be undefined.)

CONVENTIONS: A macro for an in-line function (procedure) always generates an expression. When the body of a function contains only expression, if, and return statements (that essentially eliminates only loops), and when it does not need local variables and does not alter its parameters, then it can be turned into an in-line function (i.e. a macro). An if statement can be changed into a conditional expression, the semi-colons can be replaced by comma-operators (omitting the last one if no expression follows), this enables you to rewrite the function body as a single expression.

Parentheses in Macro Bodies

The definition of a macro (e.g. for in-line functions) requires special attention; there are some nasty pitfalls. The macro invocation is textually replaced by the macro body, with each formal parameter textually replaced by the corresponding actual parameter. This replacement is rescanned for more #defined items, until none are left (see [5, Sec.3.3.3 on p.34]).

Some surprises are possible (also see [5, Sec.3.3.6]). You have to be careful about priority conflicts between operators in (1) the environment of the macro invocation; (2) the macro body; (3) the actual parameters. That is, if you want macro invocation to resemble function invocation. It is, therefore, safest to have surrounded by parentheses: (a) the entire macro body (shielding (1) from (2)), and (b) each formal parameter in the macro body (shielding (2) from (3)). This forces each actual parameter to be evaluated before operators in (2) come into play, and forces the evaluation of the entire macro body before operators in (1) are applied; just the same as with function invocation. The outer parentheses can be omitted, if the in-line function simulates a procedure (no sensible value returned), the inner parentheses can be omitted if the formal parameter is surrounded by [], or member (not simply part) of an actual parameter list. Also see [1, p.87].

Choosing between Macro and Function

The choice between definition by function or by macro is sometimes difficult. It is not true that a macro is always faster at the expense of more code, because of its ``in-line'' nature (no function invocation overhead, but repeated code for each invocation). When a formal macro parameter appears several times in the macro body, then invocation of the macro with a complicated expression for this parameter generates code with each occurrence of the parameter replaced by the expression. This may cause recomputation of the expression, which can be undesirable (speed) or plain wrong (if you forgot about the side effects of the expression). Also take notice of the differences mentioned below.

Macros versus Functions: Important Differences

Actual macro parameters need not always be expressions, but can also be type names, operators, etc.
Formal macro parameters do not have a specific type, so the actual macro parameters are not restricted in type (see MAX(a,b) in [1, p.87]).
If the function f is defined as `int f()', then f() has type `int', but f has type `int (*)()', i.e. pointer to function returning integer. The latter, obviously, does not hold for macro identifiers.

Automatic type conversions for actual parameters may differ for a function and its in-line macro version! (See above: Type Conversions.)

float r;			/* int i; invocation: */
f(x)  float x;  { r = x; }	/* f(i): NO (= wrong) conversion,
				   use f((float)i) instead */
#define f(x)	(r = (x))	/* f(i): correct conversion */

Formal function parameters are local variables, initialized with the value of the corresponding actual parameter at invocation (evaluated only once). Not so for macro parameters; multiple evaluation or no evaluation at all are also possible; be careful about side effects.

Output parameters of functions must be passed as a pointer (reference to the actual parameter); macro parameters can be assigned a value, directly affecting the outside world.

				/* int i; invocation: */
f(x)  int x;	{ x = 3; }	/* f(i), does NOT affect i */
f(x)  int *x;	{ *x = 3;}	/* f(&i), affects i, not &i */
#define f(x)	((x) = 3)	/* f(i), affects i */
#define f(x)	(*(x) = 3)	/* f(&i), affects i */

The order in which the actual parameters are evaluated when a function is invoked, is undefined by C (the ACE compiler does it right to left! [2, p.3]); actual macro parameters need not be evaluated at all (keep the conditional expression `? :' and the operators && and || in mind), and the order of evaluation also depends on the structure of the macro body.
There is no conflict when a function and structure member have the same name [1, pp.197, 206]. Macros and structure members should have different names.
CONVENTIONS: We have tried to keep things simple, so that moving from macro to function does not result in a headache. Turning a function into a macro is not always possible when following our conventions (see above: Macros for In-line Functions). When a macro has been used for its special features (1, 2), this will always be explicitly mentioned. The same holds for special features of functions (3). When output parameters are involved (6), we have always used the reference version of the macro definition, although it may seem unduly complicated. This is only a textual complication, and then only after substitution (which the compiler is supposed to do, not the reader); it does not result in inefficient code, because a decent C compiler generates the same code for (*(&i) = 3), as for ((i) = 3). We have never relied on the order of evaluation for actual function/macro parameters (7), so that they can easily be interchanged, if the need arises. Functions and structure members never have been given the same name (8).

Macros versus Functions: Hiding Power

Another problem with macros is that they do not have the same hiding power as functions, due to the fact that macros are only a textual abbreviation mechanism. All items in the macro body should somehow be available at the place of invocation, possibly by including the right files.

Consider the following example; here, `lowlevelD.h' makes outcmd() and START available (what they are, is not known by the module using lowlevel, each might be #defined or extern).

/* definition by function */	|/* definition by macro */

	`highlevelD.h': Definition part of module highlevel

extern	start();		|#include "lowlevelD.h"
				|#define start() outcmd( START )

	`highlevel.c': Implementation part of module highlevel

#include "lowlevelD.h"		|/* empty */
start()				|
{ outcmd( START ); }		|

In the macro version all the lowlevel stuff is (and has to be) made available to any module using highlevel. (It need not be used, of course, but it occupies symbol table space, and abuse would be hard to detect and locate.) The nested #include is not so attractive, especially not when the module that uses highlevel also uses another (macro version) module that uses lowlevel. This situation would cause multiple inclusion of the same header file. Multiple inclusion may cause problems with typedef's. Multiple inclusion can be avoided by omitting the #include in the definition part (of the macro version), and by stipulating that the inclusion of `highlevelI.h' is to be preceded(!) by

#include "lowlevelD.h"

(an obligation that is easily forgotten by the user of highlevel). Also see below: Modularization: Include Files.

Macros for Other Purposes

We have also used macros for other purposes than in-line functions. For example:

#define forallplanes(p)		for (p=plane; p<=maxplane; p++)
#define BUFFER(size)		struct { int n_used, buf[size]; }

Modularization: Include Files

Most of the software is modularized in Modula-2 fashion. Each module consists (conceptually) of a definition (sometimes also called interface) part and an implementation part. The definition part has to be accessible by those that use the particular module, so that definitions can be imported. The implementation part is compiled separately (or better: independently) from the clients of the module. The implementation part, may, however, be distributed over several source files, each compiled separately.

CONVENTION: The definition part resides on a separate header file (called definition header), which is to be included by clients of the module. In general, the implementation parts also #include this header file. Any information shared by the implementation parts, but not intended for the client (i.e. private, not to be exported), is in a separate header file, called implementation header. As noted above (see Macros: Hiding Power), multiple inclusion of definition parts cannot be avoided. Its disastrous effect, however, can be suppressed by the following trick (now a method).

CONVENTION: Each definition part has the following structure.

# ifndef _FLAG
# define _FLAG
... actual definition part
# endif

In this way, multiple inclusion is harmless. Naming conventions (for a module named `module'):

definition header file name:		moduleD.h
implementation header file name:	moduleI.h
implementation main source file name:	module.c
definition part flag name		_MODULE

Constants and Fixed Variables

Two kinds of constants can be distinguished, viz. ordinary C constants and variables whose value does not change after (possibly dynamic) initialization (fixed variables for short). We sometimes use the term constant for both kinds. C constants are often hidden in a #define. Why use fixed variables at all? Well, (1) there are no C constants for arrays and structs, except to initialize (external) array and struct variables, and (2) possibly the constant cannot be determined at compile/load time. A variable (fixed or not) has an address; a constant does not have an address, so a pointer to a constant is impossible. Array and function names belong to the category of C constants.

A string is a fixed variable with hidden name, initialized by the compiler ([1, p.181]; except when used as initializer of an array of char: [1, pp.84, 199]); it is even better to think of a string as an expression (resulting in a pointer to char) with (compile time) storage allocation as side effect (also see Declaration versus Definition). The following is a perfectly legal expression (with result of type char):

"0123456789abcdef"[i&0xf]

It is the same as

char L[] = "0123456789abcdef";
     /*  = { '0', ..., 'f', '\0' }  */
...

L[i&0xf]

only there is no need to coin a name for the string in the previous case. In the latter case, the variable L can be used at several places without requiring additional storage. The second double-quoted string is an array initializer, and not a string expression! Another example:

char *s ;

for (s = ",.;:!?" ; *s ; s++) processchar(*s);

Again we have a locally defined string that is completely hidden, but that does get storage allocated. The string is only accessible via s.

N.B. Notice the difference between p and A defined by

char *p = "hello", A[] = "hello";

For p the compiler reserves 4 bytes (the size of a pointer [2]); for the first "hello" it reserves 6 bytes initialized with the six chars: `h', `e', `l', `l', `o', `\0'; p is a variable initialized with the address of `h'; for A[] it reserves 6 bytes, initialized with the same six chars, but residing at a different address; A is a constant, viz. the address of its first element. The first string is an expression [1, p.186], the second an abbreviated initializer [1, p.199].

Expressions (or parts thereof) that contain only constants, are evaluated at compile time; at least the ACE compiler does so, and it is suggested in [1, p.45] (somewhere in the middle) that this is true in general; this is usually called `constant folding'; also see [5, Sec.7.10].

Stand-Alone Applications (Software for the IOP)

Software running in the IOP can in no way fall back on UNIX!

We have to take care of the memory layout ourselves (under UNIX this is done by compiler, loader and operating system). This involves the location of text (=code) and data sections of the software, its entry point, the stack, and any additional stuff.
The compiler need not generate test instructions for the stack (see ACE doc on `cc', [2, pp.5, 12]). We use the compiler option -k to suppress them.
There is no good way to find out how much stack space is required, even though the compiler should be able to tell us (well, if any utility could tell us, it would be the compiler). You have to make an estimate yourself (and it had better be a good one).
Be careful with floating point operations, e.g. when these are supported by a special floating point package (either hardware or software) that resides in the UNIX kernel.
A program in the IOP cannot terminate in a similar way as when running under UNIX, in fact it should cycle for ever or stop (there is no one to return to).
When the loader `ld' is invoked by `cc,' it is also handed the file `/lib/crt0.o' which contains the symbol _exit, that causes a number of undesirable (i.e. UNIX dependent) routines to be loaded. So loading should be done separately.
When a function that runs on the bus (UNIX) processor, has to be ported to the IOP, special care is needed to assure that all information required for its execution is available in the IOP. This may involve more than simply transferring the values of the function's arguments to the IOP! The function could have used external data (global variables) or functions, or its arguments could have been pointers to objects in the bus processor's memory. Some argument values don't make sense in the IOP, especially bus processor pointers, and file descriptors.
Dynamic allocation of storage (usually by calloc()), should be done differently. (We probably have to write our own allocator for the IOP.)

Objects and Their Addresses

Usually the C compiler and loader determine the addresses of objects defined in a C program. Sometimes it may be desirable to force an object to have a specific address (often due to hardware restrictions). When the address itself is known in advance, this can be accomplished by a pointer and a cast. For example,

#define	s	(*(char *)0xd00401)

makes s an `lvalue' [1, p.183], which can be used as if it were an ordinary variable of type char (with &s == 0xd00401). Interpret the definition of s as follows: 0xd00401 is a constant of type integer; the cast (char *) makes it a constant of type pointer to char, that is if you put a * in front you get a char; finally we put that * in front: it makes it an lvalue referring to a char (the outer parentheses prevent priority conflicts between operators on s and those within s). When t is declared as char, the following expressions involving s are correct:

s = 'a'; s == t; t = s; ++s; f(&s).

For arrays with a specific (compiler/loader independent) address, the above works, that is

#define a1  (*(char (*)[10])0xd00400)

is similar to `char a1[10];', notice the parentheses around the asterisk, `char *[10]' is an array of 10 pointers to characters! But simpler may be:

#define	a2	((char *)0xd00400)

The expression a2[i] works as usual, etc. (NOTE: sizeof(a1) == 10, but sizeof(a2) == size of pointer (==4 for us)).

The general recipe for an object normally declared as `TS D( id );', but forced at address ad, is:

#define	id	(*(TS D( (*) ))(ad))

That is, the type-specifier TS is copied literally, the identifier id in the declarator D is replaced by (*), a parenthesized asterisk. N.B. When the address ad is an expression, it is best to put parentheses around it (as we have done), because the preceding cast has a high priority! [1, p.49]

This recipe can also be used for functions:

#define	f	(*(int (*)())0xd01000)

makes f a function returning an integer, its address is 0xd01000; invocation is as always f(...). Also see [1, pp. 185, 209, 211].

To get the address of a such a forced object, simply place & in front of the identifier; this annihilates the * in its definition, leaving the address (cast to the appropriate type). But for arrays and functions simply use the identifier without subscript [] or parameter list ().

Another way of accomplishing forced addresses, somewhat less efficiently, is the introduction of a pointer variable, initialized once, to point to the object at the specific address. This method has to be used when the address ad cannot be determined at compile/load time. Recipe:

TS D( (*id) ) = (TS D( (*) ))(ad) ;

The variable id itself will probably never be changed (it is a fixed variable, see: Constants), it may have to be copied to a register for faster access (also see: Pointers). The actual object is *id (in contrast to the first method discussed above, where the object is referred to simply by id), so id is a misleading name for this pointer.

There is yet another method, that may sometimes be useful, but it is not portable (it works for TNO's Geminix). Declare the object id (to be forced at the address ad) as extern (i.e., `extern TS D( id );') in the C source file. Make an assembler (.s) file with the following contents:

.data
.global	_id
.set	_id, ad

Assemble it and link it together with the rest. It makes id an Absolute External.

When several objects have to be forced on specific addresses, such that they are contiguous (packed), there arises another problem. The address of the first object is given; the address of any other object follows from the address and size of the previous object, and its own alignment condition. One way to get around calculating all the addresses yourself, is to use a ``super''structure that contains all the objects, and to force this superstructure once on the desired address. Note that a struct should always be given an even address, [2, p.3]. The compiler takes care of alignment and sizes of the members (the objects), so modification does not mean you have to recompute all the addresses yourself.

CONVENTION: We have used a typedef to define the type SuperStruct as the structure containing the objects, and S as the identifier for the structure itself (S is the only instance of SuperStruct).

Alignment

Alignment is a vague concept in C, although most implementations have to deal with it. Alignment is necessary if there are restrictions on the addresses of (some) objects (see e.g. [2, Sec.1.2].); it is mostly hidden in the compiler. Sometimes, however, it needs extra attention (see e.g. [1, Sec.8.7 p.173]). The alignment of arrays as described in [2] is not correct: an array has the same alignment as its element type.

Efficiency

C relies very much (too much?) on particular representations for object values (binary for integers), and (partly) machine dependent side effects of otherwise abstract operators (e.g. ++). But this is not pushed to the limit (e.g. reversing a bit pattern is still awkward, whereas most processors allow shifting into the ``carry flag'' and back). Nevertheless, it enables one to write quite efficient programs.

Let me put forward a suggestion for an additional operator, the reverse comma: (A ` B). The semantics are: evaluate A, temporarily store the result (on the stack e.g.), evaluate B, throw away its result, and return the stored result as the result of the expression. So (A ` B) is something like ( t = A , B , t ), where t is a fresh variable. If A and B have no side effects, it is the same as (B,A). Properties:

((a ` b) ` c) == (a ` (b ` c))  (` is associative)
((a , b) ` c) == (a , (b ` c))  (so you can write (a , b ` c) unambiguously)
in general ((a ` b) , c) != (a ` (b , c))

Assorted remarks. `++n' is slightly faster than `n++'; `++n' being shorthand for `n += 1', `n++' for `(n ` ++n)'. `do { ... } while (b)' is slightly more efficient than `while (b) { ... }', but they are only equivalent, of course, if `b' holds initially. When highest speed is required, we have, therefore, sometimes preferred to write

/* N.B. n > 0 */
do {
	...
} while (--n);

instead of the more obvious

while (n--) {
	...
}

The ACE C compiler does constant folding, that is, expressions that can be evaluated at compile time are replaced by a single constant (instead of generating code to evaluate the expression). Also see Constants.

For low-level applications that directly deal with special hardware, it is important to have an understanding of the optimization techniques applied by the C compiler. Some of the standard techniques are discussed in [5, Sec.7.13]. For example, the compiler may minimize access to external variables by keeping track of the processors register contents. The expression `x*(x+y)' could be evaluated using only one access of the variable x, when enough registers are available. Some compilers produce no code for the statement `x;', which inspects the variable x, but immediately discards the value. Normally, inspection of a variable has no side effects; therefore, these optimizations are correct. Dedicated hardware, however, might change its state because of a read access of some memory location (e.g., auto-increment or generate an interrupt for another processor; how a memory location can be accessed like a variable is explained in Objects and Their Addresses). In that case, access must somehow be forced; for example, by assignment to a dummy (register) variable.

C Implementation on MC68000 (on TNO's Geminix)

We have tried not to make use of specific implementation details. But sometimes this could not be avoided. For example, integers (and longs) are 4 bytes, the same as pointers. See [2] for more info.

Declarators and Type Names

Declarators and type names are awful to read and to write; the order is all mixed up [1, pp.193-194, 199]. This is paricularly nasty in definitions that also initialize:

char *s = "hello", *t = s;

Two modes of thought are required to decode this. First, t's declaration is to be read as: *t has type char, so t is a pointer to a char (mentally reversing the order in the declarator). Secondly, t is assigned the value of s, usually written as `t = s;'. Normally `*t = s;' would be an illegal assignment.

Try for yourself to declare an array A of 10 pointers to functions returning a pointer to a structure previously tagged as `date', and next a register pointer p to the elements of such an array (for fast access), initially pointing to the first element of A. Here we go:

struct date *( ( *( A[10] ) )() );
--------------------*------------       A is ...
---------------------****--------       an array of 10 ...
-----------------**-------*------       pointers to ...
---------------*------------***--       functions returning ...
------------**------------------*       a pointer to ...
***********----------------------       a structure tagged `date'

register struct date *( ( *( *p ) )() ) = &A[0];

Eliminating unnecessary parentheses, and replacing &A[0] by A [1, p.94], gives:

struct date *(*A[10])();
register struct date *(**p)() = A;

N.B. It is not the leftmost *, but the rightmost *, in the register definition that makes p have the type `pointer to ...'. A postfixed * and type specifier would have made things a lot easier (like: var A[10]*()* struct date).

Here is another example, it describes how our color lookup table is structured:

struct { struct {byte d,i;} r[N],g[N],b[N];} V;
---------------------------------------------*  V is ...
********-----------------------------------*--  a struct with ...
----------------------------*----*----*-------  fields r, g, b ...
-----------------------------***--***--***----  each an array of N ...
---------********---------*-------------------  structs with ...
----------------------*-*---------------------  fields d and i ...
-----------------****-------------------------  each a byte

Because a selection looks like: V.g[n].i, I would have preferred something like:

var V: struct { r,g,b: array P[N] of struct { d,i: byte } };

There is no mechanism to `decompose' type names. For example, if we have

typedef char *String;

then there is no way to express `the type of the elements String points to' by referring only to the type String. This might be nice for sizeof applied to type names that were imported from some module.

Pointers

Their declaration is awful (see above). Sometimes explicit pointers cannot be avoided, even when the extra level of indirection is not desired. Inconsistent naming of variables may cause confusion between the pointer and the object pointed to, especially when you are not interested in the pointer as such. This is the case with reference parameters:

f(x)  int x;  { ... x = 3; ... }  /* invoke: f(i), f(i-1) */

f(x)  int *x;  { ... *x = 3; ... }  /* invoke: f(&i) */

In the second function, x is not an int. Changing a parameter from reference to value, or back, is awkward; changes are required to: (1) the parameter declaration, (2) each occurrence of the formal parameter in the function body, and (3) each actual parameter expression in the function invocation. Compare this to PASCAL, where only the parameter declaration needs to be changed:

function f(x: integer); begin ... x := 3 ... end;
(* invoke: f(i), f(i-1) *)

function f(var x: integer); begin ... x := 3 ... end;
(* invoke: f(i), NOT f(i-1) *)

The introduction of conceptually superfluous pointers happens more often. For example, when it is desirable to have fast access to an external (global) struct, you introduce a register pointer to the struct (same with traversing arrays).

struct globrec { char c,d; ... } (*g[10])[3];

f(i,j)
int i,j;
{	register struct globrec r = &(*g[i])[j];
	r->c = 'a'; /* (*g[i])[j].c = 'a' */
	r->d = 'A';
	...
}

Here, r is introduced to avoid the re-evaluation of (*g[i])[j]. In PASCAL this remains completely hidden:

type globrec = record c,d: char ... end;
     arrayofgr = array [0..2] of globrec;
var g: array [0..9] of ^arrayofgr;

procedure f(i,j: integer);
	begin
	with g[i]^[j] do begin
		c := 'a';
		d := 'A'
		...
		end
	end;

A third place where explicit pointers arise, is with objects for which storage is not allocated by the compiler/loader. These may be (1) so called dynamic objects (created by calloc(); same as new() in PASCAL), or (2) objects that reside on specific addresses (also see: Objects and Their Addresses).

Finally, pointers to functions can be used to implement general library functions that do not cause more modules to be loaded than are needed:

/* first implementation;
   whenever manyf() is used, both f0() and f1() are loaded */
manyf()
{
	...
	switch (...) {
		case 0: f0();
		case 1: f1();
	}
	...
}

/* second implementation */
int (*pf0)() = NULL; /* dynamically initialized = f0, if f0() is used */
int (*pf1)() = NULL; /* dynamically initialized = f1, if f1() is used */
manyf()
{
	...
	switch (...) {
		case 0: (*pf0)()
		case 1: (*pf1)()
	}
	...
}

Admitted, it is an awkward construction, but sometimes useful (e.g. it is used in C-GKS (by CWI), see u_defacts.c, ws_globals.h).

Be careful with pointers; the object pointed to must somehow get its storage allocated. This is especially important for strings which are often considered as (char *) objects. There is a huge difference between p and A defined as

char *p, A[10];

p is simply A pointer, A is an array with 10 chars of storage allocated. Confusion may arise when using initializers:

char *p = "hello", A[10] = "hello";

In this case the storage for p is still that of a pointer, but its initializer is a string expression that gets 6 chars of (static) storage allocated. Also see Constants, and Declaration versus Definition.

Declaration versus Definition (also covering Storage Allocation)

For an introduction see [1, pp.28-30, 72-77, 204-206]. The following concepts are confusing: scope of identifiers (syntactic or lexical scope), scope of objects (chronological or semantic scope of variables and functions, called `extent' in [5]), visibility, storage classes and (actual) storage allocation, and accessibility.

storage class	scope of identifier	lifetime of object
------- -----	----- -- ----------	-------- -- ------
extern		unrestricted		permanent
static		(more) restricted	permanent
auto		more restricted		temporary
register	more restricted		temporary

The scope of an identifier with storage class static depends on whether it is defined externally or internally w.r.t. a function. In the latter case, it is more restricted.

For lexical scope see [1, Sec.11.1 p.206; Sec.8.5 p.197; Sec.14.1 p.209]. Although the scope of the identifier of an extern object is unrestricted (as far as the loader is concerned; the loader exports the name), it has to be redeclared in each file referring to it, using the same identifier (that is the only way to refer to the same object [1, Sec.11.2]. The scope (lifetime and accessibility) of an externally defined static object is unrestricted (permanent), just as extern. You notice this (1) by the fact that repeated inspection of the variable always results in the value assigned last; and (2) by handing the outside world (beyond the lexical scope of its identifier) a pointer to the static object; i.e., the object can be accessed via a pointer at places outside the scope of the identifier of the static object, as opposed to objects of storage class auto and register. This also holds for internally defined static objects (the scope of their identifier is even more restricted). but it can be argued that ---as far as accessibility is concerned--- this is an implementation side effect, and that internally defined static objects should not be accessed from outside their lexical scope (allowing an implementation that puts them in a different memory segment that is only accessible when the defining block is active).

The storage class of (formal) function parameters is not well defined. I would say they are auto, unless register is specified, because they serve as local variables, initialized with the value of the corresponding actual parameter at invocation. N.B. register works a bit different for parameters: see [1, top p.205].

Confusing concepts: external/internal occurrence of declaration (w.r.t. a function) and storage class extern.

Do not take `another file' in [1, Sec.11.2 p.206] too literally. There is no conflict between the following declaration and definition occurring in one file (possibly after eliminating #include); also see [1, p.77]:

extern char A[];    /* declaration in header file */
char A[] = "hello"; /* definition in main file that #includes header */

Storage allocation for permanent objects takes place at compile time. The storage for auto and register objects is automatically allocated during execution (i.e. dynamically) the moment they come into existence (on the stack and in a register resp.). Dynamic storage allocation can also be done under control of the program by using such library functions as alloc() (neither on the stack nor in registers, but in a separate area).

Strings (in the sense of double quoted character sequences, not char *str) are strange objects. A string is an expression that has as result a pointer to a character sequence terminated by `\0'; storage is allocated at compile time (not dynamically). A string is, thus, an expression with a side effect. The storage that is implicitly allocated, is of type static with no identifier associated at all (fully hidden, as static as possible; also see Constants). Notice that the expression

"" == ""

equals 0, because two distinct strings are created, and, hence, two different pointers are returned; the values pointed to, however, are the same (as reported by strcmp). Keep in mind that it is very well possible to change the value of a string, although this is usually not intended, and a dangerous practice. Consider for example the procedure say():

say(b)
int b;
{
	char *p;

	p = "abcde";
	if (b) p[3] = 'i';
	printf(p);
}

After the first invocation of say(1), all succeeding invocations will print `abide' regardless of b's value. Some implementations may, however, put the values of strings in read-only memory segments.

References

Brian W. Kernighan and Dennis M. Ritchie, The C Programming Language, Prentice-Hall, Inc., 1978.
Willem Wakker, The ACE Compilers for MC68000, Associated Computer Experts bv, Amsterdam, August 22, 1983.
Dennis M. Ritchie, ``The C Programming Language: Reference Manual,'' in The UNIX Programmer's Manual, Seventh Edition, Vol. 2, January 1979; especially see, ``Recent Changes to C.''
Brian W. Kernighan and Dennis M. Ritchie, ``UNIX Programming: Second Edition,'' in The UNIX Programmer's Manual, Seventh Edition, Vol. 2, January 1979.
Samuel P. Harbison and Guy L. Steele Jr. (Tartan Laboratories), C: A Reference Manual, Prentice-Hall, 1984.

Tom Verhoeff
Department of Mathematics and Computing Science
Eindhoven University of Technology
Eindhoven
The Netherlands

Some Notes about the C Programming Language

Abstract

CONTENTS