Using grep on URLs not working

4

1

I've got a couple of text files (a.txt and b.txt) containing a bunch of URLs, each on a separate line. Think of these files as blacklists. I want to sanitize my c.txt file, scrubbing it of any of the strings in a.txt and b.txt. My approach is to rename c.txt to c_old.txt, and then build a new c.txt by grepping out the strings in a.txt and b.txt.

type c_old.txt | grep -f a.txt -v | grep -f b.txt -v > c.txt

For a long while, it seemed like my system was working just fine. However, lately, I've lost nearly everything that was in c.txt, and new additions are being removed despite not occurring in a.txt or b.txt. I have no idea why.

P.S. I'm on Windows 7, so grep has been installed separately. I'd appreciate it if there are solutions that don't require me to install additional Linux tools.


Update: I've discovered one mistake in my batch file. I used ren c.txt c_old.txt without realising that ren refuses to overwrite the target file if it exists. Thus, the type c_old.txt | ... always used the same data. This explains why new additions to c.txt were being wiped out, but it does not explain why so many entries that were in c.txt have gone missing.

gibson

Posted 2014-07-31T17:46:31.313

Reputation: 167

1A single > causes the text file to be overwritten each time, you would use >> to append to an existing file. – Ƭᴇcʜιᴇ007 – 2014-07-31T18:15:03.500

if you'd broken that down e.g. try echo sdfsd >c.txt you'd have seen that > overwrites. And that it's thus not a grep problem. As techie said, use >> – barlop – 2014-07-31T19:09:36.737

The > is intentional. Appending would never remove any entries from c.txt, thus failing to eliminate entries in c.txt that also exist in a.txt or b.txt. – gibson – 2014-07-31T19:31:38.763

1

Thoughts: (1) Try grep -f a.txt -v c_old.txt | grep -f b.txt -v > c.txt, because type … | looks like cat … |. (2) Try grep -f a.txt -f b.txt -v c_old.txt > c.txt. (Neither of these should make a difference in the result, but they are stylistically simpler.) …

– Scott – 2014-07-31T21:19:34.187

1… Then (3) Try adding -F (--fixed-strings) in case you’re getting any weird results where a . in a.txt or b.txt matches some other character in c.txt. (4) Check a.txt and b.txt to verify that neither of them has acquired a very short line (not a full URL) that’s matching lots of things. (5) Try to find a URL that’s getting stripped out, and find the line in a.txt or b.txt that’s causing that to happen. – Scott – 2014-07-31T21:20:18.270

perhaps to help track down the problem, would be.. Say you run that once a day? Then include in a bat file, a pause before and after and a check of the file size of c.txt.. you can check it by eye and make sure you haven't lost stuff from it unexpectedly, and if so then catch it near the time it happens – barlop – 2014-07-31T23:09:09.270

Answers

0

Well, I don't really have much data to go on, since there's not a huge number of new additions to a.txt and b.txt since I originally asked the question, but since fixing the ren issue (replaced it with move /Y), things have been working smoothly.

So, things are working better. I'm still not sure how the initial data loss happened, but it may be that I messed up at some point when editing the scripts, and didn't do my test runs in a safe environment.

gibson

Posted 2014-07-31T17:46:31.313

Reputation: 167