Xterm copy and paste

Something is wrong with the selection and paste mechanism of xterm (in xterm-271; fixed in xterm-279). In a UTF-8 locale, when running uxterm, copy and paste of accented characters fails in two ways: (i) what is copied is not the same as what was printed, (ii) combining accents are lost.

A demo.

% cat xtest
#include <stdio.h>

char s[6] = { 0x61, 0xcc, 0x85, 0xcc, 0x8a, 0 };

int main() {
        return puts(s);
}
This tiny program xtest prints the single accented character a-bar-ring (with two accents: combining bar and combining ring).

With xterm-271 the output shown is a-ring, and select and paste gives the two bytes 0xc3 0xa5, that is, U+00e5, a precomposed a-ring. That is unfortunate: if the text was grepped from a file, and one searches for the occurrence of this text in the file, no such occurrence is found, because some bytes were changed (the decomposed a plus combining ring turned into the precomposed a-ring) and some bytes were lost (the bar is nowhere to be seen).

This bug is caused by precomposition code in xterm: for each character position a main character and a string of combining diacriticals is stored. When the combination of main character and one of the combining diacriticals is recognized as separate Unicode character, then the main character is replaced by the precomposed character, and all diacriticals are discarded. Ouch.

Does precomposing have any use? Not for me, it only harms. Visual information is lost (no more bar), and selection and paste becomes useless. However, the font designer might have created a nice shape for the combined character, while the decomposed accented character is perhaps produced by general mechanisms that give a less nice result.

Since xterm becomes entirely unusable for linguistic work where such heavily accented characters occur with high frequency, we have to fix this bug. Here is a patch (against xterm-271).

diff -ur xterm-271/precompose.c xterm-271a/precompose.c
--- xterm-271/precompose.c      2007-02-05 02:06:36.000000000 +0100
+++ xterm-271a/precompose.c     2012-05-03 23:57:36.000000000 +0200
@@ -1032,6 +1032,7 @@
 #define UNICODE_SHIFT 21
 
 int do_precomposition(int base, int comb) {
+#if 0
   int min = 0;
   int max = sizeof(precompositions) / sizeof(precompositions[0]) - 1;
   int mid;
@@ -1051,5 +1052,6 @@
     }
   }
   /* no match */
+#endif
   return -1;
 }
(it just makes do_precomposition a dummy so that no precomposition is done, and the text stays as it is).

Luit

There are further copy and paste problems on xterm caused by the use of luit. That is a potentially harmful program. Never use it unless you know what it does and that you need it

Luit is a filter used by xterm. It parses input as if it were ISO 2022, assembles escape sequences etc. That is what one wants if one really needs ISO 2022 support. And if not, most of this is harmless, luit just passes on what it sees. But some of it is harmful: if there was an ESC character in the recent past, luit thinks that it is assembling an escape sequence, and it corrupts non-ASCII UTF-8 sequences.

For me this meant that UTF-8 text would be corrupted in emacs shortly after the use of some ESC-X command (because the ESC brings luit in a state where it thinks that it is assembling an escape sequence, and it does not accept UTF-8 in that state).

Detection: say pstree and look whether the output contains xterm───bash───pstree or xterm───luit───bash───pstree. There are xterm options -lc (use luit) +lc (don't use luit), -lcc (specify the luit pathname).

Whether xterm uses luit depends on locale and xterm version (and compilation flags).