Find the longest matching prefix in file?

For example, file a.txt:

/abc
/abc/def
/abc/xyz
/abcd
/fghi

Give input and expected results are:

/abc/dog     => /abc
/abc/def12   => /abc/def
/dog         => (NONE)

Is this possible using only shell commands or grep, sed, awk, etc. stuff?

Xiè Jìléi

Posted 2010-12-13T16:18:51.203

Reputation: 14 766

I think that the query /abc/dog should return /abc/d. – mrucci – 2010-12-13T20:27:05.677

1@mrucci: I believe the desired result is that the input should match the entire line in a.txt rather than partial matches. – Paused until further notice. – 2010-12-13T21:18:11.060

Answers

One way to do this is to somewhat reverse the idea of which is the input and use a.txt as the patterns to search for and what you're calling "input" (I'll call "file2") to be what is searched in:

grep -o -f a.txt file2

echo "/abc/dog" | grep -o -f a.txt

These won't output anything for "/dog", although the echo version will have a non-zero return code.

Edit:

This will more closely match your requested output:

while read -r line
do
    match=$(echo "$line" | grep -of a.txt)
    match=${match:-(NONE)}
    printf "%-12s => %s\n" "$line" "$match"
done < file2

You can force the search patterns to start at the beginning of the line like this:

grep -o -f <(sed 's/^/^/' a.txt) file2

Paused until further notice.

Posted 2010-12-13T16:18:51.203

Reputation: 86 075

This needs a.txt to be sorted descending though, but that is easily done using sort -r. – Arjan – 2010-12-13T20:04:40.860

Oh, and this would also match /dog/abc, unless the patterns in a.txt are prefixed with ^? (Like ^/abc) – Arjan – 2010-12-13T20:15:08.680

@Arjan: I added a way to anchor the patterns to the beginning of the line. Can you show me an example of sort -r being necessary? – Paused until further notice. – 2010-12-13T21:46:03.907

Yup, I saw the change but cannot upvote twice ;-). echo "/abc/def12/" | grep -o -f a.txt should yield /abc/def, right? Without sorting it would show /abc. (At least, on a Mac it does.) Maybe it should be sorted by length, not alphabetically, but I guess it has the same effect. – Arjan – 2010-12-13T21:48:50.037

@Arjan: On my Linux system, echo "/abc/def12/" | grep -o -f a.txt yields /abc/def without sorting. – Paused until further notice. – 2010-12-13T22:19:09.243

Hmmm, see my results. But well, the question asker will see if echo "/abc/def12/" | grep -o -f <(sort -r a.txt) is needed instead. (Nice answer!)

– Arjan – 2010-12-13T22:25:55.683

@Arjan: That's weird. I'm using grep 2.5.4. What locale are you using? That could possibly have an effect. I'm using en_US.utf8, but a few others I tried gave the same results. – Paused until further notice. – 2010-12-13T22:41:58.630

locale gives me the following. Not sure if that helps? LANG= LC_COLLATE="C" LC_CTYPE="UTF-8" LC_MESSAGES="C" LC_MONETARY="C" LC_NUMERIC="C" LC_TIME="C" LC_ALL= – Arjan – 2010-12-13T22:59:24.160

No, I tried it in the C locale and it worked. I can't reproduce the /abc result that you see. You're using GNU grep on a Mac? – Paused until further notice. – 2010-12-13T23:18:41.250

Yup, just the included thing. grep --version shows grep (GNU grep) 2.5.1 Copyright 1988, 1992-1999, 2000, 2001 Free Software Foundation, Inc. And no GREP specific environment variables... )I do have COMMAND_MODE=unix2003, maybe that matters? Using UNSET COMMAND_MODE makes no difference.) – Arjan – 2010-12-13T23:39:27.650

@Arjan: I'm surprised it's GNU grep since OS X is BSD-based. – Paused until further notice. – 2010-12-14T00:23:37.450

Sounds like a job for Perl, so here's an awk solution. Minimally tested.

#!/bin/sh
prefixes_file=$1
shift
awk -vprefixes_file="$prefixes_file" '
BEGIN {
    while (getline <prefixes_file) { ++prefixes[$0]; }
}
{
    for (n = length; n >= 0; --n) {
        if (prefixes[substr($0,1,n)]) {
            print $0, "=>", substr($0,1,n);
            break;
        }
    }
    if (n == -1) { print $0, "=>", "(NONE)"; }
}' "$@"

Gilles 'SO- stop being evil'

Posted 2010-12-13T16:18:51.203

Reputation: 58 319

A simple shell script should do the job:

#!/bin/sh

query=$1
file=$2

for i in $(seq 1 ${#query})
do
    current_query=$(echo $query | cut -b1-$i)
    grep -q "$current_query" "$file" || break;
    longest_match=$current_query
done

echo "$longest_match"

You can use it like:

longest_match.sh '/abc/dog' a.txt

and it will print the longest match of the query /abc/dog found in the file a.txt, i.e. /abc/d

mrucci

Posted 2010-12-13T16:18:51.203

Reputation: 8 398

That could be incredibly slow. – Paused until further notice. – 2010-12-13T21:15:30.297