I'm having trouble with a text file being marked as a binary

3

I have an executable that generates a text file as its output. The problem is that the text file comes out with a binary file flag of some sort. The result is something like this:

$ grep "grep string" output_file.txt
Binary file output_file.txt matches.

$ grep -a "grep string" output_file.txt
[correct results]

Some reading has indicated that grep looks for a null character in the first thousand or so bytes, then determines from that whether or not a file is 'binary', so my question is two-fold:

  1. Is there an easy way to strip null characters from my files (I can do this as part of my post-processing) to ensure that grep works correctly without the -a flag?

  2. Is there something obvious I should look for in my code to prevent null characters from being written to the file? I've looked through the code quite thoroughly and I don't see any obvious culprits.

    .

brightwellcd

Posted 2011-08-18T16:02:31.583

Reputation: 95

Answers

5

I can answer at least the first question. If you're using Unix/Linux you can use tr

tr -d '\000' < filein > fileout

where \000 is the null char. You can also strip all non-printable chars as you can see on the example here: "Unix Text Editing: sed, tr, cut, od, awk"

Regarding your second question, I don't know which is your programming language but I'd search for uninitialized variables which could be end being printed to the output file.

DrNoone

Posted 2011-08-18T16:02:31.583

Reputation: 1 267

I would vote this up if I could, but I'm apparently too new. :-/ – brightwellcd – 2011-08-18T16:53:18.983

I found a null string in my output. I ran this tr script and did a visual diff; quickly found the problem. I'll upvote this if/when I get enough reputation to do so. Thanks. – brightwellcd – 2011-08-18T17:58:08.740

4

I'm going to make a guess....

Your program writes the file in UTF-16, an encoding of Unicode that uses two bytes for each character. Every second byte is, most of the time, a null.

iconv -f utf-16 -t utf-8 < filein > fileout

will convert it to UTF-8, which most coreutils are comfortable with.

user1686

Posted 2011-08-18T16:02:31.583

Reputation: 283 655

Interesting, and I didn't know this about UTF-16. A question about this command - what exactly does the conversation remove from or do to the file? How will it behave in the use case of my question here?

– Hashim – 2018-10-10T21:37:46.477

1@Hashim: It doesn't quite remove anything; it reads values in one representation and writes the same values out in another. (Much like converting between hex and octal, or between PNG and BMP.) UTF-16 represents each codepoint value as a fixed-length two-byte code (or a pair of two such codes), which naturally has to be padded with a 0x00 byte if the value is below 256, while UTF-8 represents the same value as a variable-length code which doesn't require null-padding. How it'll behave with your file depends on whether your file is UTF-16 to begin with. – user1686 – 2018-10-10T21:56:04.040

@Hashim Is there a way of determining whether a file is UTF-16? Doing file myfile.txt simply shows the file as data. – Hashim – 2018-10-10T21:58:51.270

If it's text and looks like text in your text editor, look at what encoding the editor has detected. Try to perform the conversion, and check if the result still looks like text in your text editor. Or do a hexdump of your file, if you see that "every second byte" is 0x00, that almost always means UTF-16. – user1686 – 2018-10-11T04:38:05.400

Unfortunately opening the file in an editor is out of the question as the files I'm working on are too large - all more than 10GB. If there are no NUL bytes in the second column of a hexdump is it safe to conclude that the file is definitely not UTF-16? – Hashim – 2018-10-11T18:05:48.817

It may be first or second column (depending on UTF-16LE or UTF-16BE). For mostly-Latin1 text it will be there consistently. For other scripts – Greek, Hangul, etc. – it won't be there because the values are ≥256 and both bytes are fully in use. So it's a possible test but not a definite conclusion. (Note that the conclusion might also be that it's not a plain text file in the first place!) – user1686 – 2018-10-12T03:21:28.923