Convert many files to the same encoding

1

1

I would like to make sure that all of my files are correctly encoded in UTF-8 in a big project repository. Is there a tool for that or a way to do it using unix tools?

mnml

Posted 2009-11-17T11:33:20.177

Reputation: 1 391

Answers

1

In general, there is no way to do this. UTF-8 has no "magic number" or marker, so you can only prove that a file is not in UTF-8 (if it contains invalid sequences), but not that it is.

You can however use a heuristic approach. What exactly works will depend on your data.

One idea:

  • Make a list of all files that are text files and contain non-ASCII characters. The second part is easy to do using perl or similar; the first will depend on what files you have. Unix file will also check for non-ASCII characters, but it's less reliable (only checks start of file).
  • If the list is small, check files manually. Otherwise, check which are valid UTF-8 (again, perl has modules for this, or use a tool like iconv or recode). The valid UTF-8 files are probably OK. The rest will have to be checked by hand (unless you know for certain how they are encoded).

sleske

Posted 2009-11-17T11:33:20.177

Reputation: 19 887