4

We have a web server running CentOS 5.8 that uses SVN for version control. When trying to switch to the latest revision, we got an error about the filenames of files in an upload directory:

svn: Error converting entry in directory 'adm/emails/upload' to UTF-8
svn: Valid UTF-8 data
(hex: 54 79)
followed by invalid UTF-8 sequence
(hex: f6 6b 69 72)

Upon investigating, we noticed there were some files that had broken filenames:

$ ls ~/public_html/adm/emails/upload/
Ty?el?m?trendit.csv
Ty?kirja1.csv

To get the update completed quickly, we simply mved the files into our home directory. Surprisingly, their filenames looked fine in their new location:

$ ls ~/
Työelämätrendit.csv
Työkirja1.csv

After the update we moved them back to where they were and their filenames were broken again. What could cause this and how can we fix it? The system's locale is set to LANG=en_US.UTF-8.

  • Maybe you need to recreate the directory they're in (not rename, but recreate). After all, a directory is where the filenames are stored in. – Halfgaar Jun 18 '12 at 08:02
  • Are your home directories and your webroot on different filesystems? And if so, what filesystems are they? – Ladadadada Jun 18 '12 at 08:34
  • @Ladadadada: Good point, but it seems they're both on the same ext3 filesystem. – Kaivosukeltaja Jun 18 '12 at 09:42
  • @Halfgaar: I created a new directory and moved the problematic files there, but they still remain broken. – Kaivosukeltaja Jun 18 '12 at 09:43
  • it *seems* they're on the same FS? Can you post the output of `df -h` or `mount`? – Halfgaar Jun 18 '12 at 12:27
  • @Halfgaar: They're both on the `/` mount point. The two other mount points are for `/dev/shm` and `/tmp`. I can get the exact output tomorrow if still needed. – Kaivosukeltaja Jun 18 '12 at 14:26
  • a long shot: but could you run a fsck on the filesystem? You can do `tune2fs -T 20120101 /dev/sda1 (or whatever)` to set the last check time far in the past and then reboot. – Halfgaar Jun 18 '12 at 14:44

1 Answers1

1

The x54 x79 is the "Ty" in ASCII, which is valid ISO-8859-1 and UTF-8, but the xF6 x6B x69 x72 is "ökir", in ISO-8859-1 encoding and NOT valid UTF-8. That it's being translated both ways it somewhere between creepy and brilliant. Which brings up the question of whether the filesystem is even involved.

Most Unix filesystems are pretty agnostic about character sets - they just do bytes. You could check both filesystems, if there are two (one might not be ext3), the specifics about how they're mounted, and dig into whether the path through ~/public_html/adm/email/upload/ is going through NFS or something like it which might be layering another filesystem character set over the underlying one - Samba would be a really interesting thing to find there, since it has explicit charset options.

Of course, checking to see if LC_CTYPE is set oddly is a good idea too:

$ touch Työelämätrendit.csv
$ ls T*
Työelämätrendit.csv
$ LC_CTYPE=C ls T*
Ty??el??m??trendit.csv
$

Perhaps LC_CTYPE wasn't set in the SVN process? No hard to have happen, when it's being run indirectly by a webserver, batch job, etc.

Alex North-Keys
  • 531
  • 4
  • 6