The Unicode standard

The Unicode Standard is the specification of an encoding scheme for written characters and text. It is a universal standard that enables consistent encoding of multilingual text and allows text data to be interchanged internationally without conflict. The ISO standards for C and C++ refer to ISO/IEC 10646-1:2000, Information Technology--Universal Multiple-Octet Coded Character Set (UCS). (The term octet is used by ISO to refer to a byte.) The ISO/IEC 10646 standard is more restrictive than the Unicode Standard in the number of encoding forms: a character set that conforms to ISO/IEC 10646 is also conformant to the Unicode Standard.

The Unicode Standard specifies a unique numeric value and name for each character and defines three encoding forms for the bit representation of the numeric value. The name/value pair creates an identity for a character. The hexadecimal value representing a character is called a code point. The specification also describes overall character properties, such as case, directionality, alphabetic properties, and other semantic information for each character. Modeled on ASCII, the Unicode Standard treats alphabetic characters, ideographic characters, and symbols, and allows implementation-defined character codes in reserved code point ranges. The encoding scheme of the Unicode Standard is therefore sufficiently flexible to handle all known character encoding requirements, including coverage of historical scripts from any country in the world.

C99 and C++ allow the universal character name construct defined in ISO/IEC 10646 to represent characters outside the basic source character set. Both languages permit universal character names in identifiers, character constants, and string literals. To enable universal character names, you must compile with the c99 invocation command, the -qlanglvl=extc99 or -qlanglvl=stdc99 options or related pragmas (for C), or the -qlanglvl=ucs option (for C and C++).

The following table shows the generic universal character name construct and how it corresponds to the ISO/IEC 10646 short name.

Universal character name ISO/IEC 10646 short name
where N is a hexadecimal digit
\UNNNNNNNN NNNNNNNN
\uNNNN 0000NNNN

C99 and C++ disallow the hexadecimal values representing characters in the basic character set (base source code set) and the code points reserved by ISO/IEC 10646 for control characters.

The following characters are also disallowed:

IBM extension

UTF literals

The C Standards Committee has approved the implementation of u-literals and U-literals to support Unicode UTF-16 and UTF-32 character literals, respectively. To enable support for UTF literals in your source code, you must compile with the option -qutf enabled. The following table shows the syntax for UTF literals.

Table 11. UTF literals
Syntax Explanation
u'character' Denotes a UTF-16 character.
u"character-sequence" Denotes an array of UTF-16 characters.
U'character' Denotes a UTF-32 character.
U"character-sequence" Denotes an array of UTF-32 characters.

XL C/C++ implements the macros uint_least16_t and uint_least32_t, which are defined in the header file stdint.h, as data types for UTF-16 and UTF-32 characters. The following example defines an array of uint_least16_t, including the characters represented by code points 1234 and 8180:

#include <stdint.h>

uint_least16_t  msg[] = u"ucs characters \u1234 and \u8180 ";
String concatenation of u-literals

The u-literals and U-literals follow the same concatenation rule as wide character literals: the normal character string is widened if they are present. The following shows the allowed combinations. All other combinations are invalid.

Combination Result
u"a" u"b" u"ab"
u"a" "b" u"ab"
"a" u"b" u"ab"
   
U"a" U"b" U"ab"
U"a" "b" U"ab"
"a" U"b" U"ab"

Multiple concatentations are allowed, with these rules applied recursively.

Related information

End of IBM extension