Consequences of removing NUL characters from a text file?

0

I have quite a large text file (around 20GB) that I use as a simple database, so each record is separated by a new line, and the breaking of this format will cause problems. This file also contains happens to contain some NUL characters, or at least that's what I suspect as grep treats it as a binary file.

I've come across this question and answer, which states:

Some reading has indicated that grep looks for a null character in the first thousand or so bytes, then determines from that whether or not a file is 'binary'.

For this reason, I'm thinking of stripping these characters from the file with something like:

tr < file-with-nulls -d '\000' > file-without-nulls

But I want to be sure that doing so won't break the formatting of the file. Is this a possibility at all?

Hashim

Posted 2018-09-19T01:17:11.673

Reputation: 6 967

1What about just doing it and saving to a new file. Then look at whether the new file works properly. Text files don't typically contain nulls, so we have no idea what function they might be serving. – fixer1234 – 2018-09-19T03:02:56.867

2Are the null character coming from mixing UTF-16 with UTF-8 ? UTF-16 text contains nulls. – matzeri – 2018-09-19T06:48:47.210

This answer says encoding error may be involved. – Kamil Maciorowski – 2018-09-19T06:56:04.230

@KamilMaciorowski If it is the encoding error described in that answer, which appears to be the less likely case to me, is stripping the file of NUL characters likely to harm it? I assume they're more or less independent of each other, so that even if the encoding error does exist, stripping the NUL characters are theoretically unlikely to cause any more harm. – Hashim – 2018-09-19T19:01:37.990

@matzeri I didn't create the files, they were sourced from the internet and have likely been edited by tens of people, and so acquired all manner of artifacts in the process. It could well be the case that you say, I just have no idea. – Hashim – 2018-09-19T19:01:51.773

@fixer1234 Because it would be impossible to definitively determine whether the formatting of a 20GB text file database with 1.2B lines and more than 1800 NUL characters is actually broken or not - failure here would be more than likely silent, so that single records would simply merge into previous ones, and grepping for those records would give the misleading impression that no matches for it could be found. This is why I have to be sure that what I'm planning is theoretically sound, because a file this size can't be scrubbed through after the fact to check everything is in order. – Hashim – 2018-09-19T19:02:55.437

@fixer1234 Regarding the function of the NULs in a newline-delineated text file, it's true such a file wouldn't usually contain NULs, but to put my question in another way - are they ever really needed in them? Are there any possible (theoretical, of course) ways that a newline-delineated text file would ever need to rely on NULs for its formatting, or can they safely be considered to be artifacts that can simply be stripped without affecting the position of newlines? – Hashim – 2018-09-19T19:07:10.557

what is the output of `file your-file-name' ? – matzeri – 2018-09-19T19:07:21.787

@matzeri "data" – Hashim – 2018-09-19T19:08:20.630

1data could mean that different test coding were mixed together. If the null are coming from normal ASCII characters in UTF-16 rappresentation you should see alternate null/ASCII couples. In that case removing the null is just converting the UTF-16 to ASCII. However if you have UTF-16 code that exceeds the ASCII range, removing the null will just leave the other characters in the wrong encoding. – matzeri – 2018-09-19T19:29:28.863

1If it's plain ASCII text, it shouldn't contain anything but text characters and LF or CR/LF. There is no formatting other than line breaks. If the content contained an extended character set, I don't think stripping out nulls would change anything, as matzeri already suggested (I don't think null changes the character). However, you don't need to verify the effect of every last null. If they came from something like UTF-16, all of them will have the same effect if removed. Find the location of a few examples and verify those after cleanup. (cont'd) – fixer1234 – 2018-09-19T20:23:43.353

1That said, if it is absolutely critical that you not accidentally modify the data, don't remove the nulls. They apparently aren't a source of problems, and 1800 in a 20GB file won't make a real difference. – fixer1234 – 2018-09-19T20:23:49.453

Answers

2

Your best bet is to not remove the NULs, as they are most likely a core part of the file and you risk corruption or even completely breaking the file if you remove them.

biddy smith

Posted 2018-09-19T01:17:11.673

Reputation: 21