verify ASCII file with file command by shell scrript

5

With the file command I need to verify many files if they ASCII or other format

Sometimes I get from file command:

  file1: ASCII English text

And sometimes I get different answer from file command

  file2: Non-ISO extended-ASCII English text, with very long lines

I am really not sure if there are other answers with different syntax

My question is:

I write the follwing ksh syntax to verify if file is a ASCII but I not sure if the

following syntax is the optimal syntax in order to verify ASCII format?

   [[ ` file  $some_file | grep –c ASCII ` = 1 ]] && print "you have ascii file for sure"

If someone have other suggestion to verify ASCII format for sure!, I will very glad to see that

jennifer

Posted 2010-10-26T21:56:50.713

Reputation: 897

ASCII? In the days of internet and Unicode? You must be joking. – user1686 – 2010-10-26T22:23:15.317

You do realize that file is a heuristic guess and not a guarantee, right? yes | head -c $((2**20)) > blah; dd if=/dev/urandom bs=1 count=1024 >> blah; file blah says blah: ASCII text even though it's not. – ephemient – 2010-10-27T19:07:24.970

yes I am understand but what I need to do if I want to make selection of files type , what the best thing to do? , any idea? – jennifer – 2010-10-27T20:21:20.020

Answers

8

if LC_ALL=C grep -q '[^[:print:][:space:]]' file; then
    echo "file contains non-ascii characters"
else
    echo "file contains ascii characters only"
fi

ephemient

Posted 2010-10-26T21:56:50.713

Reputation: 20 750

hi ephemient - please explain LC_ALL=C before the grep command , why? – jennifer – 2010-10-26T22:43:53.830

2LC_ALL=C forces grep to treat [[:print:]] as the "printable ASCII" character class. Otherwise it means "printable <whatever your current locale is>", which may be non-ASCII. For example, most Linux boxes are set up with UTF-8 locales, in which case [[:print:]] would match non-ASCII character sequences that are valid UTF-8 printable characters. – ephemient – 2010-10-26T22:49:18.820

1@jennifer: name=value command is the syntax for temporarily setting an environment variable, in this case LC_ALL, for a single command. Setting locale to C makes sure [[:print:]] only matches ASCII characters (and not accented characters from your language). – user1686 – 2010-10-26T22:50:35.750

why I get "file contains non-ascii characters" for /etc/hosts , as you know hosts file is ASCII file? – jennifer – 2010-10-26T22:51:33.380

@jennifer: Fixed. Probably included a tab or something like that; I forgot [[:print:]] is [[:graph:] ] not [[:graph:][:space:]]. – ephemient – 2010-10-26T22:55:42.013

@ephemient hi , I check some files and I find that your code return "file contains non-ascii characters" but from file command I get: Non-ISO extended-ASCII English text, with very long lines how to support this? – jennifer – 2010-10-27T08:41:30.507

@ephemient hi again , from my previous remark from my point if I get "Non-ISO extended-ASCII English text" its also ASCII file , please if you can help me to update your code to support this – jennifer – 2010-10-27T10:03:06.850

@jennifer: "Non-ISO extended-ASCII" is not any specific encoding at all. I don't understand what you want to happen – it's clearly not ASCII. Note that there are many different ISO-8859-* and non-ISO variants of extended ASCII character sets that file does not differentiate between, and any attempt to determine the character set is (at best) a guess. – ephemient – 2010-10-27T13:19:20.113

@ephemient hi - but by VI its seems as simple text ordinary file with text and remarks what's the different – jennifer – 2010-10-27T13:24:42.783

I am really not understand -:( why I get Non-ISO extended-ASCII on ordinary text file with simple text lines , maybe bug in the file command? – jennifer – 2010-10-27T13:26:32.550

@jennifer: What does perl -ne 'END {print join($", sort {$a <=> $b} keys %c), $/} undef @c{map ord, split //}' say for this file? Any values (other than 9, 10, or 13) below 32 or above 126? What is the output of locale? – ephemient – 2010-10-27T13:31:36.797

hi @ephemient the output:9 10 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 91 93 94 95 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 124 126 147 148 – jennifer – 2010-10-27T13:41:53.570

perl -ne 'END {print join($", sort {$a <=> $b} keys %c), $/} undef @c{map ord, split //}' file_test 9 10 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 91 93 94 95 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 124 126 147 148 – jennifer – 2010-10-27T13:42:40.557

@ephemient hi did you have some concussions about what I send from the output? – jennifer – 2010-10-27T13:59:43.340

@jennifer: Yes, the 147 and 148 indicate that it's NOT ASCII. What is your locale? – ephemient – 2010-10-27T14:14:44.770

very strange this is configuration file with parameters and values , my target is to find all configuration files , and if they ascii files then I need to update those file by sed – jennifer – 2010-10-27T14:20:39.710

So maybe I need to defied the Non-ISO extended-ASCII as configuration file also as ascii – jennifer – 2010-10-27T14:21:54.517

@ephemient what's you think about the following: maybe I use the simple file command to verify if ASCII or Non-ISO extended-ASCII and then edit those files?

[[ file $some_file | grep –c ASCII = 1 ]] || [[ file $some_file | grep –c "Non-ISO extended-ASCII" = 1 ]] && print "you have ascii file for sure" – jennifer – 2010-10-27T14:32:48.297

@ephemient hi , did you agree with me about my solution? – jennifer – 2010-10-27T15:00:29.410

@jennifer: You haven't answered. What is your locale? In any case, if you are trying to detect whether you should sed a file or not by a skim of its contents, I believe your approach is fundamentally flawed. No, I do not agree at all. – ephemient – 2010-10-27T19:00:44.803

sorry but I don't understand about the "locale" - did you mean if the machine is linux or solaris then my machine is alinux machine – jennifer – 2010-10-27T20:07:09.890

about what you said that my approach is not very good , OK but what the other option did you have other idea? – jennifer – 2010-10-27T20:09:32.513

as I said my target to edit files some files are binary and some files are configuration files and some of them are application - what I need to do is only to edit the configuration files (ASCII) did you have other suggestion? – jennifer – 2010-10-27T20:11:27.427

What is the output of the locale command? § If the list of files were purely advisory, then a heuristic seems okay, but if you're actually going to be mangling them, it would be better to keep a registry of which files are to be affected. Linux package managers like dpkg and rpm keep track of configuration files; you can tie into their system, or build your own. – ephemient – 2010-10-27T20:13:13.977

the output: LANG=en_US LC_CTYPE="en_US" LC_NUMERIC="en_US" LC_TIME="en_US" LC_COLLATE="en_US" LC_MONETARY="en_US" LC_MESSAGES="en_US" LC_PAPER="en_US" LC_NAME="en_US" LC_ADDRESS="en_US" LC_TELEPHONE="en_US" LC_MEASUREMENT="en_US" LC_IDENTIFICATION="en_US" LC_ALL= – jennifer – 2010-10-27T20:26:09.723

and on my solaris machine: locale LANG=C LC_CTYPE="C" LC_NUMERIC="C" LC_TIME="C" LC_COLLATE="C" LC_MONETARY="C" LC_MESSAGES="C" LC_ALL= – jennifer – 2010-10-27T20:27:37.260

@jennifer: Take the LC_ALL=C part out and the command I gave will recognize those files on your Linux machine — but keep in mind that there will be unavoidable false positives, e.g. you can't tell the difference between ISO-8859-1 and ISO-8859-15 encodings, nevermind foreign encodings like SJIS or GBK. – ephemient – 2010-10-27T20:38:30.740

so the final syntax to verify if the file is ASCII is by : if grep -q '[^[:print:][:space:]]' file; then... ( I am right ?) – jennifer – 2010-10-27T20:41:02.383

@ephemient please your last opinion on my last remark -:) – jennifer – 2010-10-27T21:24:12.327

if I want to summary this issue you say that (grep -q '[^[:print:][:space:]]' file) syntax is more safe then to use the file command to match the ASCI string , am I right? – jennifer – 2010-10-27T21:26:08.223

@jennifer: file doesn't actually look at the whole file; it looks at the beginning, maybe looks at the end, and makes a guess. This actually looks at the whole file, so I believe that this method is safer. However, just by checking whether the contents of a file are consistent with a particular encoding (ASCII or otherwise) is still pretty meaningless on its own. – ephemient – 2010-10-27T21:28:54.123

OK I will use your syntax (grep -q '[^[:print:][:space:]]' file) in my code I hope everything will be OK – jennifer – 2010-10-27T21:31:33.843

1

How about...

if file -ib "$file" | grep -Eqs '^text/plain(;|$)'; then
    echo "It's text/plain."
fi

I don't know how common is --mime-type; if it's standard, use

if file -b --mime-type "$file" | grep -qs '^text/plain$'; then

Alternatively grep -qs '^text/' for any text type.

user1686

Posted 2010-10-26T21:56:50.713

Reputation: 283 655

0

Since you're parsing the output with code I'd suggest using the -i option on file so it outputs MIME types instead human-friendly strings. The MIME type output is more regular and that makes it a little easier to deal with in code.

As for the output types a look at man file says that:

/usr/share/file/magic
    Default list of magic numbers

/usr/share/file/magic.mime
    Default list of magic numbers, used to output  mime types
    when the -i option is specified.

Take a look at those files for all the MIME types it can report to determine which types you'll care about when parsing the output from file. I suspect all you'll care is that the MIME type starts with text/.

Ian C.

Posted 2010-10-26T21:56:50.713

Reputation: 5 383