Is there a Linux command to find out if a file is UTF-8?

13

2

The Joomla .ini files require to be saved as UTF-8.

After editing I'm not sure if the files are UTF-8 or not.

Is there a Linux command like file or a few commands that would tell if a file is indeed UTF-8 or not?

Edward

Posted 2013-09-24T20:51:53.787

Reputation: 329

2You cannot tell the encoding of a file. You can only make a smart guess. You might mostly guess right, but sometimes guesses fail. file is an example of a program doing smart guesses. – Marco – 2013-09-24T21:17:15.210

1@Marco: It is possible to verify whether it is valid UTF-8 or not, however. There are some encodings which can mistakenly pass as valid UTF-8, but it almost never happens with ISO-8859- or Windows-125 encodings/charsets. – user1686 – 2013-09-24T21:40:24.090

Answers

29

You can determine the file encoding with the following command:

file -bi filename

Rik

Posted 2013-09-24T20:51:53.787

Reputation: 11 800

This answer should be accepted. The explanation for the -bi options is in the man file.

– Jérôme – 2016-01-13T14:04:13.280

is it supposed to work on macos as well ? I get regular file on a file I though was utf8 – nicolas – 2016-04-24T15:49:37.763

3@nicolas For MacOS you could try file -I filename (-I is a capital i). – Rik – 2016-04-24T16:07:11.743

@Rik I can confirm – nicolas – 2016-04-24T16:08:46.753

2Does this read the whole file? – ctrl-alt-delor – 2018-03-30T15:17:20.647

@ctrl-alt-delor What do you mean read the whole file? It shouldn't have to as the file encoding is probably placed in the header of the file. – kojow7 – 2018-04-20T15:33:09.357

@kojow7 utf-8 has no header. Pure ASCII (7-bit only), is indistinguishable from utf-8 (that is the point of it, a header will cause all sorts of problems). So if you have a file that is ASCII for the first MB then has a single UTF-8 character, then you will not know, unless you read the whole file. – ctrl-alt-delor – 2018-04-21T16:41:14.590

@kojow7 because if you only read a few bytes (3 are enough for the UTF-8 BOM) then the rest of the file can be, say, a PNG and thus not a valid UTF-8 file. – Alexis Wilke – 2018-12-28T10:11:23.450

6

There is, use the isutf8 command from the moreutils package.

Source: How can you tell if a file is UTF-8 encoded or not?


Pablo Olmos de Aguilera C.

Posted 2013-09-24T20:51:53.787

Reputation: 351

@davidpostill I'm curious, is bad practice to cite the author in the reference? – Pablo Olmos de Aguilera C. – 2016-08-28T20:26:56.110

No. However, it is good practice to make the link say where it leads me. Assume I'm reading only the blue text. After the edit, I can tell why and when I should click that. Before, I could not. (It wasn't me who made the edit but I'm like 94% sure that this is what it was about.) – Hermann Döppes – 2018-12-31T00:00:26.880

Nice, and works nicely with find -type f -exec isutf8 {} +, because it also quotes the filename. (And with using find ... -exec ... + is also fast) – Tomasz Gandor – 2019-03-22T13:28:19.303

0

Yet another way is to use recode, which will exit with an error if it tries to decode UTF-8 and encounters invalid characters.

if recode utf8/..UCS < "$FILE" >/dev/null 2>&1; then
    echo "Valid utf8 : $FILE"
else
    echo "NOT valid utf8: $FILE"
fi

mivk

Posted 2013-09-24T20:51:53.787

Reputation: 2 270