This is rather difficult with a single pass, and even more difficult if you don't assume that the start needs to be the same.
You could write a perl script that matches regular expressions against previous lines, sort of like this:
my @words_on_line = split(/ /, $current_line);
my $i = 0; my $substring = ''; my $expression = '';
do {
$expression = join(' ', $words_on_line[0..$i++]);
if ($previous_line =~ m/^$expression/) {
$substring = $expression;
}
} until ($substring ne $expression);
Then, you'd also have to check the next line and potentiall reduce the substring match, e.g. when you have
FOO a b c
FOO a b
FOO d
The first match (from 2 to 1) would give you FOO a b
, but comparing below, you'd only get FOO
.
Which boils down to: you need to buffer your lines until you get a no-match line. So instead of printing, you'd do something like
unless ($substring) {
push @buffer, $current_line;
foreach (@buffer) {
unless (m/$substring/) {
$buffer_substring = $substring;
}
}
} else {
print scalar @buffer, " $buffer_substring\n";
}
And then you'd just combine this.
If it's not "first common starting from beginning of line", you'd have to check every possible of sequence of words against every possible sequence of words in other lines, which is utterly complicated and which I will not reproduce here.
1Please give a more specific example of what your expected output is. I don't understand what "the longest common substring in each line" is. (Please capitalize the pronoun "I" in your posts.) What actual problem do you need to solve? It seems like you want to parse your first file for something specific? What would that be? – slhck – 2013-05-16T19:09:18.703
1"longest common substring" limited to complete words I guess. – Hauke Laging – 2013-05-16T19:11:17.073
by longest sub-string I mean the longest sub-string that is in common between different lines. for example ICMP Destination Unreachable Port Unreachable is the longest possible string that is repeating in different lines – Arash – 2013-05-16T19:15:15.703
In your second output that line only occurs once. Please describe your actual problem (or at least show the expected output). This seems like a very contrived thing you're asking for. – slhck – 2013-05-16T19:17:11.247
output 2 is extracted from output 1 and i want to reach to it – Arash – 2013-05-16T19:20:43.887
That's not what your question said – or at least from what I've understood. I tried to clarify that part, hope it makes sense still? – slhck – 2013-05-16T19:22:17.777
1@slhck: I think what he’s trying to say is “take all the lines that begin with the same word, and then find the longest string (or perhaps the longest string of words) that is an initial substring of all of those lines.” (And also count that group of lines that begin with the same word.) – Scott – 2013-05-16T19:25:28.200
1I guess the task is this: (a) Strip off the irrelevant parts of the lines. (b) Sort the lines. (c) Repeat for every line: Is the line identical to the following line? If not: Leave out the last word and compare again. If they match: Compare the following lines to this string and take all matching lines out of the set. The sorting is non-trivial though: Shorter lines must be later than lines whose beginning matches the shorter line. – Hauke Laging – 2013-05-16T19:48:53.887
1I think the easiest way to solve this, still, would be to find out what the actual issue is. Why would one need the count of said lines? – slhck – 2013-05-16T20:38:29.947