Can the UTF-8 code page identifier (65001) be different on other computers?

3

1

I recently tried to explain to a friend how to create a simple one-line batch file:

subst t: "X:\Example"

On my machine that has been working fine for years, but on theirs I ran into an issue: their name contained non-ASCII characters (the turkish characters ı and ç to be exact) which weren't recognized properly.

The simple solution to this that I would be to add

chcp 65001

at the top of the file to change the active codepage to the UTF-8 one.

But this didn't work. On their computer this caused a crash of the command shell that was executing it. I made them try a few different values; 65000 crashed but 10000 didn't and all values below that which I tried worked as well, but did not correspond to the same code pages as the same values on my computer. Their default code page was different as well (857 instead of 850 as on my computer. This makes sense as, according to MSDN, 857 is a turkish code page and 850 a western european one).

I know that some code pages can change from computer to computer, but the MSDN page explicitly states that one should use UTF-8 because the other pages may change (altough there is a distressing lack of documentation regarding how and when they change).

Is that false? Can the value of 65001 change as well? If so, why would that cause a crash? Shouldn't it complain about "Invalid code page" at worst? And if it does change, how can one find out which value to use to get it or how else could I get it to accept non-ASCII characters?

I am using windows 10 with the english language (it came preinstalled with windows 8.1 italian) while my friend uses windows 7 turkish.

Annonymus

Posted 2016-08-26T21:30:58.700

Reputation: 133

at the top of the file Did you save the batch file with UTF-8 encoding? – dxiv – 2016-08-27T04:55:52.437

@dxiv yes, I did (and made sure my friend did as well) – Annonymus – 2016-08-27T06:33:57.383

Answers

2

Basically, Windows cmd (and it's batch script interpreter as well) relies on conformance of (current) active code page and batch script encoding. For instance, if you save a script from Notepad in so-called ANSI encoding (which strongly depends on Windows system locale), then you should run it under corresponding code page, see National Language Support (NLS) API Reference:

  • English (US) : ANSI corresponds to ACP 1252 (CP 437),
  • English (UK) : ANSI corresponds to ACP 1252 (CP 850),
  • Turkish : ANSI corresponds to ACP 1254 (CP 857),
  • Central Europe: ANSI corresponds to ACP 1250 (CP 852), etc.

Your presumption is right:

The simple solution to this that I would be to add chcp 65001 at the top of the file to change the active codepage to the UTF-8 one. … But this didn't work.

Unfortunately, neither Windows cmd nor batch interpreter cares about Byte Order Mark and treats it as a valid character - disregarding of currently active code page.
Hence, the first line (CHCP 65001 command in your case) of an UTF-8 encoded file is dirtied if the BOM is present; an attempt to run such dingy command would lead to error message ' CHCP' is not recognized as an internal or external command, operable program or batch file (errorlevel 9009).

Solution: save your script UTF-8 encoded without BOM.
Workaround if you can't do it (as Notepad always writes BOM): use a dummy command as the first line of your script, e.g. as follows:

@rem if this line is visibly executed then BOM is present >NUL 2>&1
@echo OFF
    rem save current code page to the `_chcp` variable
for /F "tokens=2 delims=:" %%G in ('chcp') do set "_chcp=%%G"
    rem change active code page to UTF-8 (silently)
CHCP 65001 >NUL
    rem echo this is UTF-8 encoded batch file %~nx0
echo(
subst t: "D:\bat\Unusual Names\Türkçe (Türkiye)\çğüşöıĞÜİŞÇÖ"
subst
dir /B /S t:\*.txt
subst t: /D
echo(
echo(  works as well for characters from Unicode Basic Multilingual Plane
subst t: "D:\bat\Unusual Names\CJK\中文(繁體)"
subst
dir /B /S t:\*.txt
subst t: /D
echo(
echo(  works even for characters from Unicode Supplementary Multilingual Plane
subst t: "D:\bat\Unusual Names\"
subst
dir /B /S t:\*.txt
subst t: /D
    rem set active code page back to previously saved value (verbose)
echo(
CHCP %_chcp%

Output:

==> utf8.bat

==> ´╗┐@rem if this line is visibly executed then BOM is present  1>NUL 2>&1

T:\: => D:\bat\Unusual Names\Türkçe (Türkiye)\çğüşöıĞÜİŞÇÖ
t:\ĞÜİŞÇÖçğüşöı.txt

  works as well for characters from Unicode Basic Multilingual Plane
T:\: => D:\bat\Unusual Names\CJK\中文(繁體)
t:\chinese traditional.txt

  works even for characters from Unicode Supplementary Multilingual Plane
T:\: => D:\bat\Unusual Names\
t:\Mathematical Bold Script.txt

Active code page: 852

Finally, you could remove the first line (containing BOM) from your script using more command as follows (note chcp 65001 before running more +1 …):

==> chcp 65001
Active code page: 65001

==> more +1 utf8.bat > utf8noBOM.bat

==> utf8noBOM.bat

T:\: => D:\bat\Unusual Names\Türkçe (Türkiye)\çğüşöıĞÜİŞÇÖ
t:\ĞÜİŞÇÖçğüşöı.txt

  works as well for characters from Unicode Basic Multilingual Plane
T:\: => D:\bat\Unusual Names\CJK\中文(繁體)
t:\chinese traditional.txt

  works even for characters from Unicode Supplementary Multilingual Plane
T:\: => D:\bat\Unusual Names\
t:\Mathematical Bold Script.txt

Active code page: 65001

==>

JosefZ

Posted 2016-08-26T21:30:58.700

Reputation: 9 121