Unicode Normalization

Equivalence of sequences of Unicode values

Often a character can be represented in Unicode by several code sequences. For example, the GREEK SMALL LETTER ALPHA WITH DASIA AND OXIA AND YPOGEGRAMMENI can be written as U+1F85 (ᾅ), or U+1F05 U+0345 (ᾅ), or U+1F81 U+0301 (ᾅ), ..., or U+03B1 U+0314 U+0301 U+0345 (ᾅ), where the start of the sequence gives the main symbol, and the following unicode values indicate combining diacritics. All such sequences representing the same character are called canonically equivalent.

To be more precise, the combining classes for the combining diacritics U+0314, U+0301, U+0345 in this example are 230, 230, 240, respectively. These classes tend to indicate the position of the diacritic (above, below, ...) and it is assumed that diacritics in different positions can be ordered arbitrarily, while the order of diacritics in the same position is significant. Thus, U+03B1 U+0314 U+0301 U+0345 and U+03B1 U+0314 U+0345 U+0301 and U+03B1 U+0345 U+0314 U+0301 are equivalent, but U+03B1 U+0314 U+0301 U+0345 and U+03B1 U+0301 U+0314 U+0345 are not. The latter is equivalent to U+1FB4 U+0314 (ᾴ̔).

Preferred representatives of an equivalence class

In each equivalence class of canonically equivalent unicode sequences (all representing the same character), the Unicode Consortium singles out two representatives, Normalization Form C and Normalization Form D. Roughly speaking, NFC is the short form, fully composed, like U+1F85, and NFD is the long form, fully decomposed, in some well-defined order, like U+03B1 U+0314 U+0301 U+0345. (These are the two non-lossy normal forms. There are also lossy normalizations, not suitable for storage, but possibly suitable for search and comparison.) For full details, see Unicode Normalization Forms.

Text encoding - NFC or NFD?

In any nontrivial project, it is a good idea to standardize the data representation. Unfortunately, the Unicode Consortium provides us with two standards, and we have to choose. Does one have advantages over the other? This is the nfc-vs-nfd question.

"There is no difference"

In this SIL page, this question is answered

The question is not reasonable, and points to a misunderstanding of Unicode. This misunderstanding has spawned a number of myths and led to debates such as the above.
In fact, Unicode declares that there is an equivalence relationship between decomposed and composed sequences, and conformant software should not treat canonically equivalent sequences, whether composed or decomposed or something inbetween, as different.

Unicode further states that software is free to change the character stream from one representation to another, i.e., to decompose, compose and/or re-order, whenever it wants. The only requirement is that the resultant string is canonically equivalent to the original.

So the original question is not reasonable: A team might think they can choose to use NFD (decomposed) for their data, but software just might change the data — and it doesn't even have to say it is doing this, because (by definition) this does not change the meaning of the encoded data in any way.

It is inappropriate to speak of standardizing on one particular representation such as NFD or NFC except in the context of a specific text process or data interchange format.

In the same way that searching or spellchecking may be simpler if the data is normalized first, it may be that keyboard design, or font design, or other user interface elements may be easier to implement if, for that specific process, a particular normal form can be assumed. But this does not imply that the data must always be maintained in that form; it may be transparently transformed to other equivalent representations for other purposes.

This answer says that there is no semantic difference between the two representations, that they encode the same symbols. That is true, and 666 and DCLXVI encode the same numbers, but I surely hope that no software will silently change one into the other. Changing data is always a very bad idea. It leads to data loss, even when you think both forms are equivalent.

The Wikipedia page Unicode normalisation says today:

In one specific instance, the combination of OS X errors handling composed characters, and the samba file- and printer-sharing software (which replaces decomposed letters with composed ones when copying file names), has led to confusing and data-destroying interoperability problems. Applications may avoid such errors by preserving input code points, and only normalizing them to the application's preferred normal form for internal use.

"Choose NFC if possible"

The prevailing opinion seems to be that choosing NFC is a good idea.

The TEI-P5-Guidelines say

The Unicode Consortium provides four standard normalization forms, of which the Normalization Form C (NFC) seems to be most appropriate for text encoding projects.

The above-cited SIL page recommends

Output data that may become input to unknown processes in NFC.
If you have an option, archive in XML/NFC.

The Linux Unicode FAQ says

NFC is the preferred form for Linux and WWW.

The WWW Character Model says

NFC has the advantage that almost all legacy data (if transcoded trivially, one-to-one, to a Unicode encoding) as well as data created by current software is already in this form; NFC also has a slight compactness advantage and a better match to user expectations with respect to the character vs. grapheme issue. This document therefore chooses NFC as the base for Web-related early normalization.

The WWW argument here seems to be that there is a lot of legacy ISO-8859-1 text, and conversion to Unicode is easiest if these ISO-8859-1 values remain single units and are not decomposed. Note that WWW only hopes, but does not require:

In 2004/2005, the Internationalization Working Group decided that early uniform normalization was dead and that requiring normalization of content (such that applications could assume that content was already normalized) was no longer a reasonable position for Charmod. ... HTML5 does not require NFC. (May 2011)

There is another argument in favor of NFC: it is easier to create and use fonts with precomposed characters. Shaping code is needed for many non-Western applications, but precomposed characters suffice in the West. If nobody normalized this page, you will probably see differences in the rendering of this Greek accented alpha above.

"NFD is far superior"

As long as the code is some opaque string of bytes it does not matter what one uses. But as soon as one wants to do something, e.g. compare text with and without length marks, or with and without accents, the job is trivial in NFD and requires largish tables in NFC.

This is somewhat similar to the relation between Chinese/Japanese characters and Latin letters. Whether the code for a word is a single indivisible unit or a sequence that codes information about the elements does not matter as long as one only copies. But as soon as one uses the information in some way (what is the radical? how many strokes? how black is the character? what syllables are there? how can this word be hyphenated?) the monolithic code requires big tables that may not even be available, and the structured code is much easier to use.

For web use where fuzzy search is important, I find that NFD is an order of magnitude faster than NFC, and avoids the need for tables.

Linux

I wondered why the FAQ said ‘NFC is the preferred form for Linux’, and asked some people. Bruno Haible replied

NFC is preferred, because the W3C recommends the use of NFC normalized text on the Web.

NFC has the highest probability to work in existing programs. For example, still in 2012, KDE's terminal emulator (konsole) drops accents of Latin characters when it receives them in decomposed form:

$ /usr/bin/printf '\u00D6\n' Ö $ /usr/bin/printf 'O\u0308\n' O
For processing of European languages, it allows for simpler software. Without the recommendation for NFC, the adoption of Unicode would have been slower.

xterm

I am an xterm user and noticed three flaws in xterm-271, related to copy&paste. By far the worst is that this version of xterm recodes, changing symbols to NFC. That means that what one selects from an xterm differs from what was printed, and string search will fail, making this xterm unusable. The second flaw is that while recoding to NFC, if a sequence SAB was encountered, where A and B are combining accents and SA has no shorter form, but SB is equivalent to T, this SAB is replaced by T and A is lost. The third is that when SABC is encountered, C is lost: at most two combining accents are preserved. For the first two a patch is given here. For the third I changed charproc.c

-    Ires(XtNcombiningChars, XtCCombiningChars, screen.max_combining, 2),
+    Ires(XtNcombiningChars, XtCCombiningChars, screen.max_combining, 3),

since I have encountered cases with 3 accents, but not yet with 4.

Conversion

The utility uconv will convert from/to NFC/NFD: uconv -f utf8 -t utf8 -x nfc converts stdin to NFC, and uconv -f utf8 -t utf8 -x nfd converts stdin to NFD.

Filenames

On Linux, by default filenames are not in any particular character set, and all bytes are significant. (Some filesystem types, such as FAT, may behave differently.) Normalizing would be impossible. On MacOS it seems that some version of NFD (or FCD) is used. The version used is frozen and does not evolve together with the Unicode standard, since disk filenames do not spontaneously change. Old volumes must remain valid. This gives a lot of problems. (Not because something is wrong with NFD, or this version of NFD, but because one should never change data. Filenames must not be normalized.)

Of course filenames occur in files, and in URLs, in Makefiles and HTML web pages. Changing data, e.g. in text files, to NFC would cause interoperability problems. Always leave data as it is.