Though one can indeed manually choose some encoding (and not forget to disable that when visiting another site), actually the web site should have correctly specified it. Either the server or the web pages themselves should specify something, for otherwise all the browser can do is make some best guess. And of course, if an encoding is specified, then the HTML document should in fact use that encoding. Not so much for the web site from the question, as shown below:
To see if the web server specified something one needs to look at the so-called headers. Using the online service from web-sniffer.net to reveal the headers you'll get:
HTTP/1.1 200 OK
Date: Mon, 17 Aug 2009 17:47:03 GMT
Server: Apache
Last-Modified: Mon, 27 Nov 2006 23:38:49 GMT
ETag: "758b0606-1a316-4234309151440"
Accept-Ranges: bytes
Content-Length: 107286
Connection: close
Content-Type: text/html; charset=utf-8 (BOM UTF-16, litte-endian)
The last line seems a bit odd: how can the server claim something to be both UTF-8 and UTF-16? The value for charset
should be one of those registered with IANA (so, for example, UTF-8 without any comments). However, using the Wireshark packet sniffer rather than the online service reveals that the text (BOM UTF-16, litte-endian) is in fact a comment from the online service, not sent by the web server.
So: the web server claims it's going to send us a UTF-8 encoded HTML document.
However, the HTML document that follows is wrong (edited for readability):
ÿþ<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<title>Lesson 5</title>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<link href="main.css" rel="stylesheet" type="text/css">
</head>
...
Above, the line specifying the content type should be the first to appear within the <head>
, for otherwise the browser wouldn't know how to handle special characters in the <title>
. More important, the first two odd characters, ÿþ
, are in fact the hexadecimal codes FF and FE, which like the online service already noted, is the Byte-Order Mark for UTF-16, litte-endian.
So: the web server promised to send UTF-8 but then it sent markers that indicated UTF-16 LE. Next, in the HTML document, it claims to be using UTF-8 again.
Indeed, Wireshark shows that the actual HTML document is UTF-16 encoded. This implies that every character is sent using at least two bytes (octets). Like the 6 characters in <html>
are sent as the 12 hexadecimal bytes 3C 00 68 00 74 00 6D 00 6C 00 3E 00
. However, this very web site could very well have been plain ASCII, as it doesn't seem to use any non-ASCII characters at all. Instead, the HTML source is full of numeric character references (NCRs), such as:
यह दिल्ली
शहर है।
A browser displays the above as यह दिल्ली शहर है।. However, due to using NCRs and UTF-16, the single character य (Unicode U+092F) requires as many as 14 bytes in 26 00 23 00 32 00 33 00 35 00 31 00 3B 00
, because it is written using NCR य
while the 7 ASCII characters of the NCR itself are encoded using UTF-16. When not using NCRs, in UTF-8 this single य would require 3 bytes (E0 A4 AF
), and in UTF-16 two bytes (09 2F
).
For this HTML source using UTF-16 is a total waste of bandwidth, and the server is not using any compression either.
thanks! I'm not sure at all how I missed that one... – Babu – 2009-08-17T17:22:08.700
Strange enough, my Safari (on a Mac) does not even list UTF-16 as an option. (But, it renders fine, even when explicitly selecting Unicode (UTF-8), whereas Firefox does not display when selecting UTF-8. Maybe in Safari Unicode (UTF-8) is more like "UTF-8 if no BOM is found, otherwise use the BOM to decide on the Unicode encoding".) – Arjan – 2009-08-18T09:53:26.837