Next Previous Contents

17. Browsers

Web addresses are given by a URL (uniform resource locator). The syntax is specified by rfc2396. A URL only contains the symbols A-Za-z0-9_-.!~*'()%;:/?@&=+$,#.

Characters outside this set (or inside this set, if desired) are encoded using the %xx triple, where % is the percent character, and the xx is the representation of the byte in hex. A multibyte character is represented by a sequence of %xx triples.

17.1 Unicode

The world is slowly moving to standard character sets. First we had the 7-bit ASCII, good for English, then the 8-bit ISO 8859-1, good for Western Europe, and soon the (approximately) 16-bit Unicode (ISO 10646), good for the world.

What is this approximately? Originally Unicode used 16 bits and ISO 10646 used 32 bits, and the initial 16-bit part of this space (the BMP: Basic Multilingual Plane, Plane 0) coincided with the Unicode assignment. Currently it is agreed to use the range 000000-10FFFF, of slightly over 20 bits. (See also rfc3629.)

Systems may use their own favorite encoding internally, maybe using a 16-bit encoding with escape characters, or maybe using a 24-bit encoding. But externally everything is represented by bytes, so one needs an encoding of Unicode in bytes.

Now C uses NUL as string terminator, and UNIX uses filenames separated into path components by the slash character / and terminated by NUL, and it would be inconvenient to have a multi-byte representation where NUL or slash might occur internally as part of a larger character. For this reason Ken Thompson devised the representation now known as UTF-8. It has the property that ASCII bytes are encoded by themselves and never occur as part of a multi-byte code. A convenient side effect is that conversion from ASCII to UTF-8 is trivial: nothing needs to be changed.

The UTF-8 code works as follows.

      Unicode range    |        UTF-8 octet sequence
      (hexadecimal)    |              (binary)
   0000 0000-0000 007F | 0xxxxxxx
   0000 0080-0000 07FF | 110xxxxx 10xxxxxx
   0000 0800-0000 FFFF | 1110xxxx 10xxxxxx 10xxxxxx
   0001 0000-0010 FFFF | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

This Unicode Transformation Format is in principle capable of representing up to 31-bit numbers: 0xxxxxxx is good for 7 bits, 110xxxxx 10xxxxxx is good for 11 bits, 1110xxxx (10xxxxxx)2 gives 16 bits, 11110xxx (10xxxxxx)3 gives 21 bits, 111110xx (10xxxxxx)4 gives 26 bits, and 1111110x (10xxxxxx)5 gives 31 bits. (And yes, one might even go further.)

Since originally ISO 10646 was to use 31- or 32-bit values, many UTF-8 encoders and decoders were written to also accept and produce these longer sequences.

Security aspects

Accessing the parent directory

Web servers present a public tree, but must not allow access to the rest of the system. In the old days, a very simple trick was using URLs involving /../ or \..\ to walk up to a parent directory.

Of course, servers nowadays check. But how well do they check? .. is not forbidden, as long as one remains inside the WWW tree. But if a server thinks that it suffices that during a pathname scan at no point the number of backup steps /.. exceeds the number of different steps /something, then maybe it is fooled by use of //...

Hiding ASCII

As we already remarked, an arbitrary byte can be coded in hexadecimal preceded by a percent character %. That is, tilde ~ becomes %7E and dot . becomes %2e and 0xC0 0xAF becomes %c0%af as part of a URL.

Web servers that checked for /../ and \..\ can now be tricked by use of for example /%2e./.

Multiple UTF-8 representations

Many UTF-8 decoders are willing to decode UTF-8 sequences, also when they are not the shortest representation of the decoded value. This is what happens when one decodes e.g. the UTF-8 pair 110xxxxx 10yyyyyy into the 11-bit value xxxxxyyyyyy without any further check.

In an environment with such decoders, every Unicode character has multiple UTF-8 representations. E.g., the slash character 0x2F is also represented by 0xC0 0xAF, and also by the triple 0xE0 0x80 0xAF, and also by the quadruple 0xF0 0x80 0x80 0xAF. Even the 5- and 6-byte versions 0xF0 0x80 0x80 0x80 0xAF and 0xF0 0x80 0x80 0x80 0x80 0xAF might work.

Only the shortest representation is legal.

Using illegal UTF-8

If a web server is smart enough to recognize not only /../ but also ASCII variations using %xx escapes, it may fall for versions like /%c0%ae./ with an illegal UTF-8 representation of dot.

This particular trick (known as the "IIS/PWS Extended Unicode Directory Traversal vulnerability") was very successful against IIS 4 and IIS 5 around 2000/2001. It allowed one to read or write arbitrary files and execute arbitrary commands. Many viruses like Code Red and Nimda used some variation of it. The URL
would e.g. yield the response
Directory of c:\inetpub\scripts

10/01/2001  03:46p      <DIR>          .
10/01/2001  03:46p      <DIR>          ..
               0 File(s)              0 bytes
               2 Dir(s)   2,527,547,392 bytes free
(The scripts part here is to make sure that IIS will regard the URL as pointing to an executable.)

Microsoft nonstandardness

Microsoft IIS also allows one to encode a 16-bit value using the non-standard %uxxxx with four hexadecimal symbols x. Also this encoding may be used to avoid detection. See e.g. this advisory.

Multiple decoding

When some outer security layer does decoding of %xx hex escapes in order to catch forbidden substrings like /../ and \..\, and this outer security layer was added later around an already existing setup, then chances are good that the existing setup will also decode such escapes. But then %255c will be decoded by the outer security layer to %5c, which upon further decoding becomes \.

For example, Microsoft IIS 4.0 and 5.0 had this vulnerability (2001) and e.g. upon double decoding the URL\ would allow one to execute the command DIR C:\.

Embedded NUL

One can embed a NUL character in a URL using %00. Often parts of the URL string will be handled by programs written in C for which NUL is a string terminator. Thus we have a multiple parses problem.

For example, when the URL is checked by Squid against its access control lists, then Squid finds no problem instead of denying the request. (Mar 2004. See also the next item.)

Hiding true identity of a URL

Writing things in hex is already a good start if one wants to hide the name of the site. Another trick is to use the user@host notation: something like refers to the site, not to Recently (Dec 2003) an IE bug was discovered that makes hiding easier. If in a URL characters %00 and/or %01 occur, the rest of the URL may not be shown in the status and location bar. Now one only sees Also Mozilla is somewhat vulnerable. This trick is already being actively exploited.

In the same vein one has the so-called `homograph attack' where the attacker (phisher) registers a domain name that closely resembles a well-known domain name. Earlier this was done using small misspellings, or mimicking 'l' using 'I'. This time it is done using symbols in the Unicode character set that very closely resemble the ASCII symbols in the name to be spoofed. For example, the Unicode symbol Cyrillic a looks just like ordinary a, at most taken from a different font, and has code U+0430, that is, &#1072;. Recent versions of Mozilla and Firefox recognize this.

17.2 Cross-site scripting

Cross-site scripting (XSS) it the situation where an attacker leaves data somewhere, such that if an innocent user passes by later his browser is tricked into executing commands hidden in the data.

(Note that CSS stands for Cascading Style Sheets, something entirely different.)

A recent example (quoting from a Bugtraq post): LiveJournal (, an open source software package used to create popular Internet journals such as LiveJournal ( and DeadJournal (, is vulnerable to an XSS vulnerability which allows an attacker to execute script code in a user's browser. The vulnerability arises out of insufficient sanitization of a user-supplied URL pointing to an image that they wish to display as their journal's background. If we were to use the string "" as our URL, the following would be inserted into our journal's stylesheet:

body { background-image: url(; }
While LiveJournal removes all markup from this string, it does not filter out parentheses or semicolons, thus allowing us to insert JavaScript code into the stylesheet. For example:
); background:url(javascript:alert("XSS!")
If we were to submit the above as our URL, this is what would be inserted into the stylesheet:
body { background-image: url(); background:url(javascript:alert("XSS!")); }

Thus, this allows the attacker to execute arbitrary Javascript code in the victim's browser.

Example applications: Redirect the victim to a site of one's own choice. Steal her cookies. Change user preferences. Advertise. Letting the victim's IP be logged as the source of your hacking activity.

17.3 Hijack

In many cases it is possible to do bad things if a browser visits a malicious web site and then afterwards some other web site. Here a demo that shows that malicious web sites can hijack windows with a known name visited later, replacing their contents with contents of their own choice. Here we attempt to hijack the login page of the Postbank. (This works on most browsers with default settings, 2005-01-01.)

17.4 Annoyances

Probably useful as prank only, but it is easy to take over control of a GUI using JavaScript or Java or ShockwaveFlash or so.

Many advertisers misuse this to force the user to look at their ads. As far as I can see, there is next to no protection today in browsers like Mozilla. Don't know about other browsers. (Tell me.)

Here is a Javascript example. And if your browser is set to disallow moving a window, try this.

And a Java thread that doesn't want to die.

Exercise This reminds me of Michal Zalewski's signature. Question: What does the bash command :(){ :|:&};: do?

17.5 The Java virtual machine

Next Previous Contents