How can I convert multiple files to UTF-8 encoding using *nix command line tools?

46

21

Possible Duplicate:
Batch-convert files for encoding or line ending

I have a bunch of text files that I'd like to convert from any given charset to UTF-8 encoding.

Are there any command line tools or Perl (or language of your choice) one liners I can use to do this en masse?

jason

Posted 2009-08-01T00:29:36.420

Reputation: 625

Question was closed 2010-02-25T22:00:13.570

Answers

56

iconv does convert between many character encodings. So adding a little bash magic and we can write

for file in *.txt; do
    iconv -f ascii -t utf-8 "$file" -o "${file%.txt}.utf8.txt"
done

This will run iconv -f ascii -t utf-8 to every file ending in .txt, sending the recoded file to a file with the same name but ending in .utf8.txt instead of .txt.

It's not as if this would actually do anything to your files (because ASCII is a subset of UTF-8), but to answer your question about how to convert between encodings.

Vinko Vrsalovic

Posted 2009-08-01T00:29:36.420

Reputation: 2 276

2if your version of iconv does not support the -o parameter you can directly replace it with >> to use the shell redirection. – rob – 2015-10-09T08:45:46.720

2You should quote the var $i, in order to handle filenames with spaces. – Richard Hoskins – 2009-08-01T01:47:05.857

It will do things, it'll add a BOM for one... – jason – 2009-08-01T01:58:14.553

Are you sure iconv will add a BOM? I was under the impression that it wouldn't with UTF-8. – Richard Hoskins – 2009-08-01T02:08:06.727

5I just tested this with iconv (GNU libiconv 1.11), and it did not add a BOM. It is my understanding that iconv will only add a BOM if one is present in the input, which it would not be in ASCII. BOM are problematic, and not necessary with UTF-8. – Richard Hoskins – 2009-08-01T02:31:40.083

FYI, Windows has a tendency to drop BOMs in all Unicode files, even UTF-8. This can be seen with Notepad by choosing the encoding in the Save As dialog. The list "Unicode", "Unicode big endian" and "UTF-8" in addition to the classic "ANSI" encoding. All but ANSI include a BOM. – RBerteig – 2009-08-01T08:37:33.143

iconv follows the principle of least surprise, no BOM on input, no BOM on output. – Vinko Vrsalovic – 2009-08-01T09:24:00.660