74
28
There are many plain text files which were encoded in variant charsets.
I want to convert them all to UTF-8, but before running iconv, I need to know its original encoding. Most browsers have an Auto Detect
option in encodings, however, I can't check those text files one by one because there are too many.
Only having known the original encoding, I then can convert the texts by iconv -f DETECTED_CHARSET -t utf-8
.
Is there any utility to detect the encoding of plain text files? It DOES NOT have to be 100% perfect, I don't mind if there're 100 files misconverted in 1,000,000 files.
2Regarding speed: running
chardet <(head -c4000 filename.txt)
was much faster and equally successful for my use-case. (in case it's not clear this bash syntax will send only the first 4000 bytes to chardet) – ndemou – 2015-12-26T19:32:10.780@ndemou I have
chardet==3.0.4
, and the command line tool's actual executable name ischardetect
notchardet
. – Devy – 2018-03-26T14:26:02.0534Yes, and it's already packaged as
python-chardet
in Ubuntu universe repo. – Xiè Jìléi – 2011-06-25T06:21:19.510If it wasn't a perfect guess,
chardet
will still give the most correctly guess, like./a.txt: GB2312 (confidence: 0.99)
. Compared to Enca which just failed and report 'Unrecognized encoding'. However, sadly enough,chardet
runs very slow. – Xiè Jìléi – 2011-06-25T06:48:56.8971
@谢继雷: Have it run overnight or something like that. Charset detection is a complicated process. You could also try the Java-based jChardet or ... the original chardet is part of Mozilla, but only C++ source is available, no command-line tool.
– user1686 – 2011-06-25T12:13:28.867