How to auto detect text file encoding?

74

28

There are many plain text files which were encoded in variant charsets.

I want to convert them all to UTF-8, but before running iconv, I need to know its original encoding. Most browsers have an Auto Detect option in encodings, however, I can't check those text files one by one because there are too many.

Only having known the original encoding, I then can convert the texts by iconv -f DETECTED_CHARSET -t utf-8.

Is there any utility to detect the encoding of plain text files? It DOES NOT have to be 100% perfect, I don't mind if there're 100 files misconverted in 1,000,000 files.

Xiè Jìléi

Posted 2011-06-24T08:07:02.637

Reputation: 14 766

Answers

62

Try the chardet Python module, which is available on PyPi:

pip install chardet

Then run chardetect myfile.txt.

Chardet is based on the detection code used by Mozilla, so it should give reasonable results, provided that the input text is long enough for statistical analysis. Do read the project documentation.

As mentioned in comments it is quite slow, but some distributions also ship the original C++ version as @Xavier has found in https://superuser.com/a/609056. There is also a Java version somewhere.

user1686

Posted 2011-06-24T08:07:02.637

Reputation: 283 655

2Regarding speed: running chardet <(head -c4000 filename.txt) was much faster and equally successful for my use-case. (in case it's not clear this bash syntax will send only the first 4000 bytes to chardet) – ndemou – 2015-12-26T19:32:10.780

@ndemou I have chardet==3.0.4, and the command line tool's actual executable name is chardetect not chardet. – Devy – 2018-03-26T14:26:02.053

4Yes, and it's already packaged as python-chardet in Ubuntu universe repo. – Xiè Jìléi – 2011-06-25T06:21:19.510

If it wasn't a perfect guess, chardet will still give the most correctly guess, like ./a.txt: GB2312 (confidence: 0.99). Compared to Enca which just failed and report 'Unrecognized encoding'. However, sadly enough, chardet runs very slow. – Xiè Jìléi – 2011-06-25T06:48:56.897

1

@谢继雷: Have it run overnight or something like that. Charset detection is a complicated process. You could also try the Java-based jChardet or ... the original chardet is part of Mozilla, but only C++ source is available, no command-line tool.

– user1686 – 2011-06-25T12:13:28.867

35

I would use this simple command:

encoding=$(file -bi myfile.txt)

Or if you want just the actual character set (like utf-8):

encoding=$(file -b --mime-encoding myfile.txt)

user103313

Posted 2011-06-24T08:07:02.637

Reputation:

What if the extension is lying? – james.garriss – 2014-10-03T13:24:05.867

2@james.garriss: file extension has nothing to do with its (text) content encoding. – MestreLion – 2014-11-28T12:18:08.180

4Unfortunately, file only detects encodings with specific properties, such as UTF-8 or UTF-16. The rest -- oldish ISO8859 or their MS-DOS and Windows correspondents -- are listed as "unknown-8bit" or something similar, even for files which chardet detects with 99% confidence. – user1686 – 2011-10-28T19:09:05.817

6file showed me iso-8859-1 – cweiske – 2012-03-30T07:22:20.913

30

On Debian-based Linux, the uchardet package (Debian / Ubuntu) provides a command line tool. See below the package description:

 universal charset detection library - cli utility
 .
 uchardet is a C language binding of the original C++ implementation
 of the universal charset detection library by Mozilla.
 .
 uchardet is a encoding detector library, which takes a sequence of
 bytes in an unknown character encoding without any additional
 information, and attempts to determine the encoding of the text.
 .
 The original code of universalchardet is available at
 http://lxr.mozilla.org/seamonkey/source/extensions/universalchardet
 .
 Techniques used by universalchardet are described at
 http://www.mozilla.org/projects/intl/UniversalCharsetDetection.html

Xavier

Posted 2011-06-24T08:07:02.637

Reputation: 434

3Thanks! From the project's homepage it wasn't obvious to me that there was a CLI included. It's also available on OS X when installing uchardet via Homebrew. – Stefan Schmidt – 2013-07-06T14:47:18.503

1I was a little confused at first because a ISO 8859-1 document was falsely identified as Windows-1252 but in the printable range Windows-1252 is a superset of ISO 8859-1 so conversion with iconv works fine. – Stefan Schmidt – 2013-07-06T14:56:51.930

16

For Linux, there is enca and for Solaris you can use auto_ef.

cularis

Posted 2011-06-24T08:07:02.637

Reputation: 1 169

1uchardet failed (detected CP1252 instead of the actual CP1250), but enca worked fine. (single example, hard to generalize...) – Palo – 2015-11-16T20:52:33.047

Enca seems too strict for me: enca -d -L zh ./a.txt failed with message ./a.txt: Unrecognized encoding Failure reason: No clear winner. As @grawity mentioned, chardet is more lax, however it's yet too slow. – Xiè Jìléi – 2011-06-25T07:06:24.793

10Enca completely fails the "actually does something" test. – Michael Wolf – 2012-03-01T18:59:29.533

2

For those regularly using Emacs, they might find the following useful (allows to inspect and validate manually the transfomation).

Moreover I often find that the Emacs char-set auto-detection is much more efficient than the other char-set auto-detection tools (such as chardet).

(setq paths (mapcar 'file-truename '(
 "path/to/file1"
 "path/to/file2"
 "path/to/file3"
)))

(dolist (path paths)
  (find-file path)
  (set-buffer-file-coding-system 'utf-8-unix)
  )

Then, a simple call to Emacs with this script as argument (see the "-l" option) does the job.

Yves Lhuillier

Posted 2011-06-24T08:07:02.637

Reputation: 21

2

Mozilla has a nice codebase for auto-detection in web pages:
http://lxr.mozilla.org/seamonkey/source/extensions/universalchardet/src/

Detailed description of the algorithm:
http://www-archive.mozilla.org/projects/intl/UniversalCharsetDetection.html

Martin Hennings

Posted 2011-06-24T08:07:02.637

Reputation: 121

1

UTFCast is worth a try. Didn't work for me (maybe because my files are terrible) but it looks good.

http://www.addictivetips.com/windows-tips/how-to-batch-convert-text-files-to-utf-8-encoding/

Sameer

Posted 2011-06-24T08:07:02.637

Reputation: 239

1

Getting back to chardet (python 2.?) this call might be enough:

python -c 'import chardet,sys; print chardet.detect(sys.stdin.read())' < file
{'confidence': 0.98999999999999999, 'encoding': 'utf-8'}

Though it's far from perfect....

echo "öasd" | iconv -t ISO-8859-1 | python -c 'import chardet,sys; print chardet.detect(sys.stdin.read())'
{'confidence': 0.5, 'encoding': 'windows-1252'}

estani

Posted 2011-06-24T08:07:02.637

Reputation: 546

1

isutf8 (from the moreutils package) did the job

Ronan

Posted 2011-06-24T08:07:02.637

Reputation: 209

2How? This answer isn't really helpful. – Moses – 2015-10-28T19:02:04.180

2It's not exactly was asked, but is a useful tool. If the file is valid UTF-8, the exit status is zero. If the file is not valid UTF-8, or there is some error, the exit status is non-zero. – ton – 2016-02-16T17:34:50.263

0

Also in case you file -i gives you unknown

You can use this php command that can guess charset like below :

In php you can check like below :

Specifying encoding list explicitly :

php -r "echo 'probably : ' . mb_detect_encoding(file_get_contents('myfile.txt'), 'UTF-8, ASCII, JIS, EUC-JP, SJIS, iso-8859-1') . PHP_EOL;"

More accurate "mb_list_encodings":

php -r "echo 'probably : ' . mb_detect_encoding(file_get_contents('myfile.txt'), mb_list_encodings()) . PHP_EOL;"

Here in first example, you can see that i put a list of encodings (detect list order) that might be matching. To have more accurate result you can use all possible encodings via : mb_list_encodings()

Note mb_* functions require php-mbstring

apt-get install php-mbstring 

See answer : https://stackoverflow.com/a/57010566/3382822

Mohamed23gharbi

Posted 2011-06-24T08:07:02.637

Reputation: 101