How to add CP-1252 to CentOS

0

We are processing files that our clients generated on their local Windows machines which use the CP-1252 character set. Occasionally, while processing one of these files in our backend (running on CentOS), we get runtime errors (it's a Java backend, so RuntimeExceptions). If we remote in to the server and rename the file (using UTF-8) and re-run it, the file processes perfectly fine.

Is there any way to "add" CP-1252 to CentOS's available character sets so that this stops happening?

pnongrata

Posted 2012-08-20T20:53:29.560

Reputation: 2 212

Can you post the Java run-time exception that you receive? And call stack? Is the issue that there is a CP-1252 character in the file name that is being processed by a Java program? – HeatfanJohn – 2012-08-20T21:40:35.627

@HeatfanJohn - I will need a few hours before I can get access to the appropriate logs to get the exact stacktrace, but yes, you nailed it. It happens when there is a CP-1252 character in the file name and the system chokes. Simply SSHing in to the server, renaming it and re-processing the file fixes it, but is a sub-optimal (manual!) solution. – pnongrata – 2012-08-20T21:42:48.980

Do you have any control over the code that creates that file that is processed by your Java back-end or over the source code to the Java application that processes the file? – HeatfanJohn – 2012-08-20T21:56:54.963

Only the backend but not the (client-side) file generator. But the Java backend is 100% under our control. – pnongrata – 2012-08-20T22:01:25.593

How come you can't fix the Java program to read the data as bytes and then pass it through a decoder? – Ignacio Vazquez-Abrams – 2012-08-21T05:20:11.510

Answers

1

Check out this bug report from Oracle on the behavior of Java bug_id=4733494 related to the "default locale". According to this bug report (actually Sun/Oracle says that this behavior is really not a bug but just how Java was designed), from Sun/Oracle:

In versions of the JDK prior to 1.4, we always forced the "C" locale to the ISO8859-1 character set. In releases 1.4 and later, we support the "C" locale which requires restriction to 7-bit ASCII.

The recommendation is to set environment variable LC_ALL to en_US.ISO8859-1 or whatever the appropriate locale for the system should be es_ES.ISO-8859-1, etc.

Adding:

export LC_ALL="en_US.ISO-8859-1"

To the command file that runs your Java back-end should resolve the problem.

This is also documented in SO question: https://stackoverflow.com/questions/5663709/how-to-fix-java-when-if-refused-to-open-a-file-with-special-charater-in-filename

HeatfanJohn

Posted 2012-08-20T20:53:29.560

Reputation: 443

Thanks @HeatfanJohn (+1) - quick followup: what's this "C" locale? I've never heard of it before or seen it referenced anywhere. What purpose does it serve? Thanks again! – pnongrata – 2012-08-21T01:59:11.620

@zharvey I didn't know what that was either. From this http://www.chemie.fu-berlin.de/chemnet/use/info/libc/libc_19.html#SEC324 web page it appears to be a legacy GNU C locale.

– HeatfanJohn – 2012-08-21T02:42:29.763

@zharvey if you run locale from a command prompt on your Linux system, what is output? – HeatfanJohn – 2012-08-21T03:07:52.613

1The "C" locale does no (i.e. bitwise) collation, no number or currency formatting, and primitive date and time formatting, and does not translate the native strings used in an application. – Ignacio Vazquez-Abrams – 2012-08-21T05:17:02.437