1
1
I would like to make sure that all of my files are correctly encoded in UTF-8 in a big project repository. Is there a tool for that or a way to do it using unix tools?
1
1
I would like to make sure that all of my files are correctly encoded in UTF-8 in a big project repository. Is there a tool for that or a way to do it using unix tools?
1
In general, there is no way to do this. UTF-8 has no "magic number" or marker, so you can only prove that a file is not in UTF-8 (if it contains invalid sequences), but not that it is.
You can however use a heuristic approach. What exactly works will depend on your data.
One idea:
file
will also check for non-ASCII characters, but it's less reliable (only checks start of file).iconv
or recode
). The valid UTF-8 files are probably OK. The rest will have to be checked by hand (unless you know for certain how they are encoded).