
I'm not sure if I've grasped the issue here so if I haven't just say so and I'll edit the title.

My problem is the following:

I have an Ubuntu 12.04 server (UTF-8 locale) to which users upload files via a web app or through shell. So I have no control over naming conventions. These names are then placed into a UTF8 MYSQL database table.

Unfortunately it seems some of the files contain special characters that my database does not like.

One such example would be ́e (eU+0301) in place of é (U+00E9). My database does not enjoy this one bit and replaces such instances with e?. The shell itself has either displayed the info correctly when ls was used or has shown broken 'inexisting" symbols in the current folder route. And I've also seen the likes of E?? in place of́E (EU+0301) (which FYI should be É (U+00C9))

This is a headache as I can't even seem to run a find command on files with such characters.

So my first question is: Is there a shell command I can use to convert filenames on upload? (Something I could run recursively on a folder) Idealy it would convert them to the appropriate equivalent, but I don't care if I have to replace any such unicode sequences with an arbitrary character such as "_" for example.

Thanks in advance.

  • 379
  • 5
  • 15

1 Answers1


I tried to answer this, but ultimately I ended up writing a small article on UTF-8 and character conversion. (i.e. therefore I feel that this question is, regretfully, very close to off-topic)

The short version is that you can't do this in a sane fashion because you have no reliable way to coerce characters between encodings. HTTP and other encoding aware protocols/formats supply the encoding as part of the payload. File names do not, there is no filesystem metadata that indicates how the name is encoded.

This is a process problem. You have no way of controlling how people uploading the files are going to use the bits of the characters in the file name, and thus cannot do anything with that other than working with the raw bytes you've been given.

You have three options:

  • Run an automated process that junks anything with invalid UTF-8 continuation characters in the filename. You will still end up with filenames that are incorrectly expressed for your encoding, but at least programs won't puke. Your database should ideally have a UTF-8 encoding.
  • Store the filenames in the database as-is and do not allow any coersion between UTF-8 and the target encoding in your database to take place. Your database must use a single-byte encoding, and these strings may be invalid if interpreted as UTF-8.
  • Rearchitect what you're doing entirely.
Andrew B
  • 31,858
  • 12
  • 90
  • 128
  • Thanks for the input. Is there anyway I can use iconv or something similar to just take the filename bit by bit and get something that my programs "won't puke" as per your first point? It doesn't matter if the filename is affected, it's still better than removing it alltogether. – D.Mill Apr 07 '13 at 00:47
  • It's possible, but I don't know of any tools that do that off-handedly. It would be a program that scans the string from the beginning and discards any invalid multi-byte sequences. – Andrew B Apr 07 '13 at 00:56