Pages

Thursday, 26 April 2012

Keywords, Identifiers and Constants


Keywords

C keeps a small set of keywords for its own use. These keywords cannot be used as identifiers in the program — a common restriction with modern languages. Where users of Old C may be surprised is in the introduction of some new keywords; if those names were used as identifiers in previous programs, then the programs will have to be changed. It will be easy to spot, because it will provoke your compiler into telling you about invalid names for things. Here is the list of keywords used in Standard C; you will notice that none of them use upper-case letters.
autodoubleintstruct
breakelselongswitch
caseenumregistertypedef
charexternreturnunion
constfloatshortunsigned
continueforsignedvoid
defaultgotosizeofvolatile
doifstaticwhile
 Keywords
The new keywords that are likely to surprise old programmers are: constsignedvoid andvolatile (although void has been around for a while). Eagle eyed readers may have noticed that some implementations of C used to use the keywords entryasm, and fortran. These are not part of the Standard, and few will mourn them.

Identifiers

Identifier is the fancy term used to mean ‘name’. In C, identifiers are used to refer to a number of things: we've already seen them used to name variables and functions. They are also used to give names to some things we haven't seen yet, amongst which are labels and the ‘tags’ of structures,unions, and enums.
The rules for the construction of identifiers are simple: you may use the 52 upper and lower case alphabetic characters, the 10 digits and finally the underscore ‘_’, which is considered to be an alphabetic character for this purpose. The only restriction is the usual one; identifiers must start with an alphabetic character.
Although there is no restriction on the length of identifiers in the Standard, this is a point that needs a bit of explanation. In Old C, as in Standard C, there has never been any restriction on the length of identifiers. The problem is that there was never any guarantee that more than a certain number of characters would be checked when names were compared for equality—in Old C this was eight characters, in Standard C this has changed to 31.
So, practically speaking, the new limit is 31 characters—although identifiers may be longer, they must differ in the first 31 characters if you want to be sure that your programs are portable. The Standard allows for implementations to support longer names if they wish to, so if you do use longer names, make sure that you don't rely on the checking stopping at 31.
One of the most controversial parts of the Standard is the length of external identifiers. External identifiers are the ones that have to be visible outside the current source code file. Typical examples of these would be library routines or functions which have to be called from several different source files.
The Standard chose to stay with the old restrictions on these external names: they are not guaranteed to be different unless they differ from each other in the first six characters. Worse than that, upper and lower case letters may be treated the same!
The reason for this is a pragmatic one: the way that most C compilation systems work is to use operating system specific tools to bind library functions into a C program. These tools are outside the control of the C compiler writer, so the Standard has to impose realistic limits that are likely to be possible to meet. There is nothing to prevent any specific implementation from giving better limits than these, but for maximum portability the six monocase characters must be all that you expect. The Standard warns that it views both the use of only one case and any restriction on the length of external names to less than 31 characters as obsolescent features. A later standard may insist that the restrictions are lifted; let's hope that it is soon.

Constants

1. Integer constants

The normal integral constants are obvious: things like 11034 and so on. You can put l or L at the end of an integer constant to force it to be long. To make the constant unsigned, one of u or U can be used to do the job.
Integer constants can be written in hexadecimal by preceding the constant with 0x or 0X and using the upper or lower case letters abcdef in the usual way.
Be careful about octal constants. They are indicated by starting the number with 0 and only using the digits 01234567. It is easy to write 015 by accident, or out of habit, and not to realize that it is not in decimal. The mistake is most common with beginners, because experiencedC programmers already carry the scars.
The Standard has now invented a new way of working out what type an integer constant is. In the old days, if the constant was too big for an int, it got promoted to a long (without warning). Now, the rule is that a plain decimal constant will be fitted into the first in this list

int   long   unsigned long
that can hold the value.

Plain octal or hexadecimal constants will use this list

int   unsigned int   long   unsigned long

If the constant is suffixed by u or U:

unsigned int   unsigned long

If it is suffixed by l or L:

long   unsigned long

and finally, if it suffixed by both u or U and l or L, it can only be an unsigned long.
All that was done to try to give you ‘what you meant’; what it does mean is that it is hard to work out exactly what the type of a constant expression is if you don't know something about the hardware. Hopefully, good compilers will warn when a constant is promoted up to another length and the Uor L etc. is not specified.
A nasty bug hides here:
printf("value of 32768 is %d\n", 32768);
On a 16-bit two's complement machine, 32768 will be a long by the rules given above. But printfis only expecting an int as an argument (the %d indicates that). The type of the argument is just wrong. For the ultimate in safety-conscious programming, you should cast such cases to the right type:
printf("value of 32768 is %d\n", (int)32768);
It might interest you to note that there are no negative constants; writing -23 is an expression involving a positive constant and an operator.
Character constants actually have type int (for historical reasons) and are written by placing a sequence of characters between single quote marks:
'a'
'b'
'like this'
Wide character constants are written just as above, but prefixed with L:
L'a'
L'b'
L'like this'
Regrettably it is valid to have more than one character in the sequence, giving a machine-dependent result. Single characters are the best from the portability point of view, resulting in an ordinary integer constant whose value is the machine representation of the single character. The introduction of extended characters may cause you to stumble over this by accident; if '<a>' is a multibyte character (encoded with a shift-in shift-out around it) then '<a>' will be a plain character constant, but containing several characters, just like the more obvious 'abcde'. This is bound to lead to trouble in the future; let's hope that compilers will warn about it.
To ease the way of representing some special characters that would otherwise be hard to get into a character constant (or hard to read; does ' ' contain a space or a tab?), there is what is called an escape sequence which can be used instead. Table 2.10 shows the escape sequencesdefined in the Standard.
SequenceRepresents
\aaudible alarm
\bbackspace
\fform feed
\nnewline
\rcarriage return
\ttab
\vvertical tab
\\backslash
\'quote
\"double quote
\?question mark
Table 2.10. C escape sequences
It is also possible to use numeric escape sequences to specify a character in terms of the internal value used to represent it. A sequence of either \ooo or \xhhhh, where the ooo is up to three octal digits and hhhh is any number of hexadecimal digits respectively. A common version of it is'\033', which is used by those who know that on an ASCII based machine, octal 33 is the ESC (escape) code. Beware that the hexadecimal version will absorb any number of valid following hexadecimal digits; if you want a string containing the character whose value is hexadecimal fffollowed by a letter f, then the safe way to do it is to use the string joining feature:
"\xff" "f"
The string
"\xfff"
only contains one character, with all three of the fs eaten up in the hexadecimal sequence.
Some of the escape sequences aren't too obvious, so a brief explanation is needed. To get a single quote as a character constant you type '\'', to get a question mark you may have to use '\?'; not that it matters in that example, but to get two of them in there you can't use '??', because the sequence ??' is a trigraph! You would have to use '\?\?'. The escape \" is only necessary in strings, which will come later.
There are two distinct purposes behind the escape sequences. It's obviously necessary to be able to represent characters such as single quote and backslash unambiguously: that is one purpose. The second purpose applies to the following sequences which control the motions of a printing device when they are sent to it, as follows:
\a
Ring the bell if there is one. Do not move.
\b
Backspace.
\f
Go to the first position on the ‘next page’, whatever that may mean for the output device.
\n
Go to the start of the next line.
\r
Go back to the start of the current line.
\t
Go to the next horizontal tab position.
\v
Go to the start of the line at the next vertical tab position.
For \b\t\v, if there is no such position, the behaviour is unspecified. The Standard carefully avoids mentioning the physical directions of movement of the output device which are not necessarily the top to bottom, left to right movements common in Western cultural environments.
It is guaranteed that each escape sequence has a unique integral value which can be stored in achar.

2. Real constants

These follow the usual format:
1.0
2.
.1
2.634
.125
2.e5
2.e+5
.125e-3
2.5e5
3.1E-6
and so on. For readability, even if part of the number is zero, it is a good idea to show it:
1.0
0.1
The exponent part shows the number of powers of ten that the rest of the number should be raised to, so
3.0e3
is equivalent in value to the integer constant
3000
As you can see, the e can also be E. These constants all have type double unless they are suffixed with f or F to mean float or l or L to mean long double.
For completeness, here is the formal description of a real constant:
A real constant is one of:
  • fractional constant followed by an optional exponent.
  • digit sequence followed by an exponent.
In either case followed by an optional one of flFL, where:
  • fractional constant is one of:
    • An optional digit sequence followed by a decimal point followed by a digit sequence.
    • digit sequence followed by a decimal point.
  • An exponent is one of
    • e or E followed by an optional + or - followed by a digit sequence.
  • digit sequence is an arbitrary combination of one or more digits.
Character Constants
A character is a single letter from an alphabet. C defines two basic alphabets; one in which the source code is written and the one which is used when the program is run. They are usually the same, but they do not have to be. If they are different, it is the compiler's job to translate character constants from the source code alphabet to the runtime alphabet.
The basic C alphabet contains the following characters:
  • upper and lower case A-Z,
  • decimal digits 0-9,
  • the space,
  • horizontal tab,
  • vertical tab,
  • newline,
  • backspace,
  • carriage return,
  • alert and form feed characters
and the following symbol characters:
    ! " # % & ' ( ) * + , - . / : ; < = > ? [ \ ] ^ _ { | } ~
Each character can be represented by a number. The numbers representing the digits 0-9 are 10 continuous integers. If for example the digit '0' was represented by the number 100, the '1' would be 101, '2' would be 102 and so on.
Note that the C standard itself does not give the actual mappings from the characters to the numbers representing them. In practice most implementations will use the ASCII standard to map characters to numbers and back; some systems from the IBM mainframe and minicomputer world use the EBCDIC standard.
The characters discussed so far are the basic characters of the alphabets. The basic characters are guaranteed to fit inside a char type. In addition to the basic characters there are extended characters that are locale specific (they are language, culture and nationality specific).
Each implementation will define what extended characters are supported. These extended characters may not fit inside a single char. There is a "wide character" type called wchar_tdefined in stddef.h that is used to hold these characters which may potentially be more than one byte.

No comments:

Post a Comment

Thank you for your valuable comment