5

Server info (DNS and IPs removed):

cat /proc/version && uname -a && java -version

Linux version 2.6.16.33-xenU (*************) (gcc version 4.1.1 20070105 (Red Hat 4.1.1-52)) #2 SMP Wed Aug 15 17:27:36 SAST 2007
Linux ************* *************-xenU #2 SMP Wed Aug 15 17:27:36 SAST 2007 x86_64 x86_64 x86_64 GNU/Linux
java version "1.6.0_14"
Java(TM) SE Runtime Environment (build 1.6.0_14-b08)
Java HotSpot(TM) 64-Bit Server VM (build 14.0-b16, mixed mode)

I have some PHP code that is reading from an Excel file and doing string comparisons. It is failing on the server due to what seems to be a locale issue. On my local machine (OSX 10.8.5 Mountain Lion) however, it works!

On my local machine the locale is en_US.UTF-8. On the server the locale was POSIX but I changed it to en_US.utf8 since there was no en_US.UTF-8 when I looked at locale -a (interestingly, the list of locales on the server are all lower case but on my Mac they are all upper case, which is where this questions stems from).

Is there a difference between the two that could affect string comparisons?

Also, as per this SF post I ran locale -v -a. On the server, en-US.utf8 uses the UTF-8 codeset (I'm assuming this is the same as what I normally call charset?). However, on my local machine I seem unable to run the locale -v -a command, though locale and locale -a work fine.

Edit: A related question I asked on StackOverflow.

Matthew Herbst
  • 157
  • 1
  • 7

2 Answers2

9

TL;DR:

The codepage / character set .utf8 in en_US.utf8 is not officially recognised as far as I can tell. There is no IANA utf8 character set name. utf8 is likely generated by glibc - see final heading.

The IANA character set name is UTF-8.

  • The hyphen is important
  • Case is insensitive

Therefore, these are all valid:

  • en_US.utf-8
  • en_US.UTF-8
  • en_US.uTf-8

There is also a !case-sensitive! alias for the name UTF-8, namely: csUTF8.

Therefore, this would also be valid:

en_US.csUTF8

But I have never seen this in the wild.

The details, with chapter and verse

UTF-8 is a valid IANA character set name, whereas utf8 is not. It's not even a valid alias.

POSIX.1-2017, section 8.2 Internationalization Variables says:

If the locale value has the form:

language[_territory][.codeset]

it refers to an implementation-provided locale, where settings of language, territory, and codeset are implementation-defined.

Here the part in question is the [.codeset] part, which POSIX doesn't define, but IANA does.

For the character set defined by RFC2978: UTF-8, a transformation format of ISO 10646, the IANA Character Sets lists the name as:

UTF-8

and the note at the top says:

These are the official names for character sets that may be used in the Internet and may be referred to in Internet documentation.

An alias csUTF8 is provided, about which RFC2978 IANA Charset Registration Procedures, section 2.3 says:

All other names are considered to be aliases for the primary name and use of the primary name is preferred over use of any of the aliases.

IANA Character Sets also says:

The "cs" stands for character set and is provided for applications that need a lower case first letter but want to use mixed case thereafter that cannot contain any special characters, such as underbar ("_") and dash ("-").

In the cs alias, the case is significant (while the name is defined as case insensitive, above).

Given the alias csUTF8, en_US.csUTF8 would also be valid, but I have never seen this format in the wild.

While case matters in aliases, regarding names, IANA Character Sets says:

The character set names may be up to 40 characters taken from the printable characters of US-ASCII. However, no distinction is made between use of upper and lower case letters.

So while en_US.utf-8 is valid (a lowercase version of the listed UTF-8), en_US.utf8 doesn't refer to a IANA character set as it drops the -.

If it's not IANA, where does utf8 likely come from?

glibc's _nl_normalize_codeset() does the following:

  • Only passes characters or a digits (goodbye hyphen)

  • Converts characters to lowercase

    for (cnt = 0; cnt < name_len; ++cnt)
      if (__isalpha_l ((unsigned char) codeset[cnt], locale))
        *wp++ = __tolower_l ((unsigned char) codeset[cnt], locale);
      else if (__isdigit_l ((unsigned char) codeset[cnt], locale))
        *wp++ = codeset[cnt];
    
Tom Hale
  • 1,005
  • 1
  • 12
  • 23
0

No difference. They are one and the same.

Joe Sniderman
  • 2,749
  • 1
  • 21
  • 26
  • 1
    Thanks for the answer - do you have anything to support this? I realize it might just be a naming convention and there might be no docs. Thanks! – Matthew Herbst Aug 01 '14 at 00:23
  • Its just a naming convention. Expand both into their human-lang equivs: 'United States English, using the UTF-8 Charset' == 'United States English, using the UTF-8 Charset' – Joe Sniderman Aug 01 '14 at 00:42
  • @MatthewHerbst Well, "no difference" but look at the answer *that_should_be_accepted* by TomHale and why you should use "en_US.UTF-8" anyway. – Déjà vu Sep 02 '20 at 06:27
  • That is false on MacOS: There, `en_US.UTF8` fails, whereas `en_US.UTF-8` succeeds. The former is basically wrong, and glibc happens to correct it and hide a latent bug. I've recently ran into this issue. – Kuba hasn't forgotten Monica Sep 15 '20 at 03:13