"Wget escapes the character ‘/’ and the control characters in the ranges 0–31 and 128–159. This is the default on Unix-like operating systems."

wget failure to handle UTF-8 filenames

The beautiful program wget works fine with ASCII filenames. But it fails terribly when the environment is UTF-8.

I just downloaded 50000 files where Firefox shows the correct filenames, but wget produces gibberish. For example, the filename 诘棋总动员 is saved as ??%98??%8B?%80??%8A??%91%98 where the former is Unicode U+8bd8 U+68cb U+603b U+52a8 U+5458, that is, UTF-8 e8 af 98 e6 a3 8b e6 80 bb e5 8a a8 e5 91 98, but the saved filename consists of the hex bytes e8 af 25 39 38 e6 a3 25 38 42 e6 25 38 30 bb e5 25 38 41 a8 e5 25 39 31 25 39 38. We see that the hex values 80, 8a, 8b, 91, 98 have become 25 38 30, 25 38 41, 25 38 42, 25 39 31, 25 39 38. Ach.

The question marks are because the resulting characters are not valid UTF-8, and the resulting filenames cannot be used on this system.

Fix after the fact

I wrote a small utility wgetfix.c that recursively fixes the corrupted filenames. This is useful if you discover this wget problem after the fact.

A workaround

Read the docs. Aha, wget will by default destroy filenames, but leaves them alone if one gives the --restrict-file-names=nocontrol. That is,

wget -r -np -nc URL

will produce junk if URL uses UTF-8 filenames and the local system is UTF-8 as well (as is rather common nowadays), and things are better with

wget -r -np -nc --restrict-file-names=nocontrol URL

(Reminds me of the old days, where ftp would by default destroy the files copied, and one had to say BINARY to get them undamaged.)

This --restrict-file-names=nocontrol is a misnomer: it is not so bad when control characters are escaped. They may well cause problems. But on a UTF-8 system the values 128-159, that is 0x80-0x9F, are not control characters but parts of ordinary symbols, and escaping them is a bad idea.

The future

Sooner or later the wget maintainers will discover that not all the world is American and that some people use non-ASCII symbols. The proper behaviour is for wget to see whether the local system is UTF-8 (e.g., whether LC_CTYPE ends in .UTF8 or .UTF-8) and if so to omit the escaping of 0x80-0x9F.