Counting by number of occurrences in first column

0

I have an input file like this:

ATTACK-RESPONSES id check returned root
BACKDOOR ACKcmdC trojan scan
BACKDOOR hack-a-tack attempt 
BACKDOOR WinCrash 1.0 Server Active
ICMP Destination Unreachable Port Unreachable
ICMP Destination Unreachable Port Unreachable
ICMP Destination Unreachable Port Unreachable
SNMP trap tcp

Output:

1 ATTACK-RESPONSES id check returned root
3 BACKDOOR
3 ICMP Destination Unreachable Port Unreachable 
1 SNMP trap tcp 

I want to find and match the longest common substring in each line of the text and return the number of repetitions for each of them – so from input I'd need to get Output.

How could I do this?

Arash

Posted 2013-05-16T18:36:49.633

Reputation: 678

1Please give a more specific example of what your expected output is. I don't understand what "the longest common substring in each line" is. (Please capitalize the pronoun "I" in your posts.) What actual problem do you need to solve? It seems like you want to parse your first file for something specific? What would that be? – slhck – 2013-05-16T19:09:18.703

1"longest common substring" limited to complete words I guess. – Hauke Laging – 2013-05-16T19:11:17.073

by longest sub-string I mean the longest sub-string that is in common between different lines. for example ICMP Destination Unreachable Port Unreachable is the longest possible string that is repeating in different lines – Arash – 2013-05-16T19:15:15.703

In your second output that line only occurs once. Please describe your actual problem (or at least show the expected output). This seems like a very contrived thing you're asking for. – slhck – 2013-05-16T19:17:11.247

output 2 is extracted from output 1 and i want to reach to it – Arash – 2013-05-16T19:20:43.887

That's not what your question said – or at least from what I've understood. I tried to clarify that part, hope it makes sense still? – slhck – 2013-05-16T19:22:17.777

1@slhck: I think what he’s trying to say is “take all the lines that begin with the same word, and then find the longest string (or perhaps the longest string of words) that is an initial substring of all of those lines.”  (And also count that group of lines that begin with the same word.) – Scott – 2013-05-16T19:25:28.200

1I guess the task is this: (a) Strip off the irrelevant parts of the lines. (b) Sort the lines. (c) Repeat for every line: Is the line identical to the following line? If not: Leave out the last word and compare again. If they match: Compare the following lines to this string and take all matching lines out of the set. The sorting is non-trivial though: Shorter lines must be later than lines whose beginning matches the shorter line. – Hauke Laging – 2013-05-16T19:48:53.887

1I think the easiest way to solve this, still, would be to find out what the actual issue is. Why would one need the count of said lines? – slhck – 2013-05-16T20:38:29.947

Answers

1

This is rather difficult with a single pass, and even more difficult if you don't assume that the start needs to be the same.

You could write a perl script that matches regular expressions against previous lines, sort of like this:

my @words_on_line = split(/ /, $current_line);
my $i = 0; my $substring = ''; my $expression = '';
do {
  $expression = join(' ', $words_on_line[0..$i++]);
  if ($previous_line =~ m/^$expression/) {
    $substring = $expression;
  }
} until ($substring ne $expression);

Then, you'd also have to check the next line and potentiall reduce the substring match, e.g. when you have

FOO a b c
FOO a b
FOO d

The first match (from 2 to 1) would give you FOO a b, but comparing below, you'd only get FOO.

Which boils down to: you need to buffer your lines until you get a no-match line. So instead of printing, you'd do something like

unless ($substring) {
  push @buffer, $current_line;
  foreach (@buffer) {
    unless (m/$substring/) {
      $buffer_substring = $substring;
    }
  }
} else {
  print scalar @buffer, " $buffer_substring\n";
}

And then you'd just combine this.

If it's not "first common starting from beginning of line", you'd have to check every possible of sequence of words against every possible sequence of words in other lines, which is utterly complicated and which I will not reproduce here.

towo

Posted 2013-05-16T18:36:49.633

Reputation: 514

thank you but how i could run this? i try to run in this ay : perl yourscript.pl my file but it returns error of syntax error at sc.pl line 4, near "];" and i try to pipe bute this error still there! :( – Arash – 2013-05-17T05:37:57.723

1These are only parts of a script, not a whole solution. – towo – 2013-05-17T17:15:11.733