Fix encoding of German umlauts in directories and filenames (ü = u╠ê and so on)

13

2

I have many zip-files where there are encoding errors for the German umlauts (äüöÄÜÖß). They show up in both the filename.zip as well as in the included directories and files like this:

  • Fünf = Fu╠ênf
  • Räuber = Ra╠êuber
  • Überfall = U╠êberfall

and so on. Usually I use Linux, but because of this issues I also tried a Windows7 VM but it results in the same encoding mess up. On Linux I played around with convmv and detox, but with no success.

When I use

  • convmv -f iso-8859-1 -t utf8 --replace --notest -r *

I get "Skipping, already UTF-8".

Any thoughts about this?

cider

Posted 2013-01-13T15:49:04.233

Reputation: 131

@cider try find -type f -print0 |xargs -r -n1 -0 convmv -f WINDOWS-1252 -t UTF-8 --notest This find files from current dir forward and runs convmv separately on each file. Filename is encoed as nullterminated list. – Manwe – 2015-06-26T21:35:34.443

What antique system are you using ? All current Linux distributions uses UTF-8 now. – BatchyX – 2013-01-13T15:51:36.530

Could this be a filesystem problem? Perhaps it is not mounted in UTF? – terdon – 2013-01-13T15:59:16.257

I use Linux Mint 13 (based on Ubuntu 12.04 LTS with Kernel 3.2.0-23), so this is far from antique. And as I already wrote I also tried that files on a Windows 7 VM. But of course I don't know what the one who created the zip files used. – cider – 2013-01-13T15:59:52.673

1This encoding seems some kind of DOS encoding. Usually if I see issues with UTF8 encoding the German umlauts look like ä = ä or Ü = Ãœ – cider – 2013-01-13T16:05:16.830

Answers

2

The reason that you're getting the "already UTF-8" warning is that those strings are really already in UTF-8. The "ü" character was encoded OSX-style as a 'u' followed by the two bytes "\xCC" and "\x88". These two bytes together make up the UTF-8 representation of \u0308, the combining diaeresis.

If you look at the code page 437 listing here, you'll see the \xCC character as "╠" and the \x88 character as "ê".

Whatever it is that you're using to display those character sequences is not interpreting them as UTF-8 but rather as CP437.

A quick proof, if you read ruby, that displays as expected in my UTF-8 terminal:

$ ruby -e 'puts "u\xCC\x88"' | iconv -f cp437 -t utf-8
ü
$ ruby -e 'puts "u\xCC\x88"'
ü

S2VpdGgA

Posted 2013-01-13T15:49:04.233

Reputation: 71

0

First note that character encoding is its own section of hell. In the Windows world there still exists a nasty dualism between UTF-8 and M$ playing stupid for a long time and insisting on ISO-8859 (guess who came up with it). As mentioned above it has almost certainly something to do with the file system. My solution is not a technical one, but one that has worked for me for many years now:

My personal bit of advice for file names is always the same: Just stick with alphanumerics plus dash ( - ) and underscore ( _ ). Write umlaute as ae, ue and oe. Don't use spaces and other special characters. It is a little bit inconvenient at first, but it will save you a lot of pain in unexpected places.

As a side note: yes this is sort of a nasty "hack" but if you work cross platform you often have to fall back to the last common denominator. You would take it for granted that something basic like character encoding would be a hard standard, but it turns out standards are a hard thing to get. This XKCD summes it up quite nicely

paradoxon

Posted 2013-01-13T15:49:04.233

Reputation: 596

I am sick and tired of encoding problems in file names when I try (and fail) to sync files between Mac, Windows and Linux (via Syncthing). I would adapt your advice, however in Turkish there are ç, ş, ı, ğ, ü, ö, not convenient to write with alphanumerics. I want to refrain from using cloud storage but this problem forces me to do so. – Teo – 2018-10-02T18:28:02.660

0

My guess is the filesystem that you are attempting to decompress or manipulate the files. FAT32 isn't going to like your umlauts. Try copying these files off of the flash drive (or what have you) and then decompress the zip file to see what kind of characters the filenames produce.

Both NTFS (Windows) and Ext4 (Mint) shouldn't have a problem with the name encoding.

The name encoding of the zip files themselves on the FAT32 system are most likely not going to change or be fixed when you copy them to a proper supporing filesystem, but the subdirectories when decompressed should be fine.

CenterOrbit

Posted 2013-01-13T15:49:04.233

Reputation: 1 759