LINUX: Can a file which is shown as ASCII text contain UTF-8 data

Question

Version of LINUX Red Hat Enterprise Linux ES release 4

I need to confirm if an extract from a database has correctly output data with UTF-8 encoding. I created the file using the mechanism specified by the database vendor, but when I did

$ file extract.txt

it returned

ASCII text, with very long lines

However when I created a sub file from the main extract file and did

$ file sub_extract.txt

it returned

UTF-8 Unicode text, with very long lines

Therefore is my file actually OK and there is some limitation of the file command? Is there a better way of checking if a file contains UTF-8 data?

you are making a backup? load it and test it. You should always test your backups and you kill two birds with one stone. Also why did you call it A database? no need to be cryptic you can tell us which type of database it is. — foocorpluser, May 10 '12 at 16:42

score 3 · Answer 1 · answered May 10 '12 at 17:08

The file command uses only the beginning of the file to examine its content (for performance reasons). If your file contains only ascii characters in the beginning, the file command reports the file as ASCII.

If the extracted file contains UTF-8 characters in the beginning (or a BOM-Header), the command reports the file as UTF-8 (as in your second example).

See the man page of file for further informations regarding magic numbers and file headers.

score 1 · Answer 2 · answered May 10 '12 at 17:50

If you export a database that contains only english text and common control characters, and will encode any binary data in eg BASE64 in the export, ASCII and UTF-8 will EXACTLY be the same unless an explicit BOM exists.

PS: UTF-16 is a different animal, especially because it can LOOK like perfectly normal ASCII text to some tools, will look blank to others, and will confuse yet others to no end (I have seen some versions of perl reading and writing it fine and completely failing to match on the text with regexes...)

LINUX: Can a file which is shown as ASCII text contain UTF-8 data

2 Answers2