Removing lines in text document that end the same

0

We have a script that downloads a list of domains for our squid box to block, but we keep getting warnings like the following:

2015/03/02 17:08:47| WARNING: You should probably remove '.artnau.com' from the ACL named 'chat_domains'
2015/03/02 17:08:47| WARNING: '.artnau.com' is a subdomain of '.css.artnau.com'
2015/03/02 17:08:47| WARNING: because of this '.css.artnau.com' is ignored to keep splay tree searching predictable
2015/03/02 17:08:47| WARNING: You should probably remove '.artnau.com' from the ACL named 'chat_domains'
2015/03/02 17:08:47| WARNING: '.chatserve.com' is a subdomain of '.eagles.chatserve.com'
2015/03/02 17:08:47| WARNING: because of this '.eagles.chatserve.com' is ignored to keep splay tree searching predictable

Is there any way to go through the file, and remove the subdomains from existing domains in the list?

So grab the first line and check to see if there are any other lines in the text that end with that text and remove it?

Andrew

Posted 2015-03-03T02:07:35.057

Reputation: 45

Can you post part of the file, including the lines which are conflicting? – Tim – 2015-03-03T02:49:56.477

It's a 15k line text document that just lists domain names to be blocked by squid. Here's a link to the file: http://pastebin.com/6mLB2KfZ

– Andrew – 2015-03-03T03:03:43.640

Basically it would have a list like: .artnau.com .css.artnau.com .chatserve.com .eagles.chatserve.com

It would be great to (for example) to get .chatserve.com from the list and remove any other lines in the text that end with .chatserve.com (but keep the original line) – Andrew – 2015-03-03T03:04:06.110

Answers

1

Here is how you would do it in Perl:

$ cat a.txt
.artnau.com
.bar.foo.example.org
.chatserve.com
.css.artnau.com
.eagles.chatserve.com
.example.com
.foo.example.org
$ cat a.txt | perl -ne 'BEGIN { my %h; } $h{$_} = ""; END { foreach (keys %h) { $orig = $_; $_ =~ s/^\..*?\./\./; print $orig if not exists $h{$_} } }' | sort > b.txt
$ cat b.txt 
.artnau.com
.chatserve.com
.example.com
.foo.example.org

The Perl one-liner loops through a.txt and adding every line to a hash named %h. After every line of the file is added, it goes through every key in the hash, removes the first part of the domain (the first period until the second period), and if the resulting string is not in the hash, prints it out. The output is then piped through sort (you can probably guess what that does) and saved to b.txt.

Tim

Posted 2015-03-03T02:07:35.057

Reputation: 1 375

Please excuse my ugly Perl, I'm sure there's a more elegant solution. – Tim – 2015-03-03T04:16:21.557

The useless cat is more offensive.

– tripleee – 2015-03-03T05:47:08.313

1

@tripleee So you're more of a dog person?

– Tim – 2015-03-03T19:08:59.397

Thanks for that... is there any way of doing the action on the file it self (without creating a new file)? – Andrew – 2015-03-10T05:55:20.170

Also, I need to run that command on every file called domain in a directory tree... I tried find BL -name "domains" -print0 | xargs -0 -I file cat file | perl -ne 'BEGIN { my %h; } $h{$_} = ""; END { foreach (keys %h) { $orig = $_; $_ =~ s/^\..*?\./\./; print $orig if not exists $h{$_} } }' | sort > file2 but it gave me one line in a file called file2. – Andrew – 2015-03-11T04:15:07.510