How do you remove all occurrences of values in one list from another list?

I have a list of symbols such as...

wer
sfe
efo

How do I remove all instances of these (unique) symbols from another list of (non-unique) symbols?

So in the following list, the lines starting with wer would be removed twice, and sfe once:

wer-alskjdfi
efr-4siosejf
rte-alskjdfs
wer-alskjsef
sfe-ooskjdfi

Every other line should be untouched, with the symbol and characters after "-" remaining:

efr-4siosejf
rte-alskjdfs

I need to do this using sed/awk/grep/bash or other command line tools. I know how to write a sed command to search and remove one value at a time, but how do I do this for 100k+ values?

barrrista

Posted 2012-11-20T19:02:23.543

Reputation: 1 519

Answers

What if file 2 has characters after each of those symbols? I want to do the same but keep the trailing characters.

OK, make a copy of file2 that has only the field that you want to filter on. And, if the current file2 has the “non-unique symbol” immediately followed by the “trailing characters” (e.g., efr-42, rte-17, etc.), make another copy of file2 where they are space-separated. Here are example commands based on the example data you provided:

sed 's/\(...\).*/\1/'        file2.sorted > file2.symbol_only
sed 's/\(...\)\(.*\)/\1 \2/' file2.sorted > file2.separated

sed 's/\([^-]*\)-.*/\1/'        file2.sorted > file2.symbol_only
sed 's/\([^-]*\)\(-.*\)/\1 \2/' file2.sorted > file2.separated

… based on the new data that you added to your question. Then use comm as before:

comm -13 file1.sorted file2.symbol_only > file2.no_match

… and join the symbols up with the trailing characters:

join file2.no_match file2.separated

If necessary, use another sed to remove the spaces you added.

It occurs to me that you could build on this trick to get the output file back into file2’s original order.

Produce a copy of the original file2 with line numbers.
Shuffle the line numbers to the right of the symbols.
(the above, starting with the sort commands)
Sort the output on the original line number.
Strip out the line numbers.

Let me know if you need help with this.

Scott

Posted 2012-11-20T19:02:23.543

Reputation: 17 653

Assuming your lists reside in files

awk -F- 'NR==FNR {exclude[$1]++; next} !($1 in exclude)' list_of_symbols filename

grep is also an option

grep -v -f <(sed 's/^/^/' list_of_symbols) filename

The sed bit adds a regexp anchor to the beginning of each line.

glenn jackman

Posted 2012-11-20T19:02:23.543

Reputation: 18 546

Do you need to retain the order of your second file? Can you state a maximum number of times that a line can be repeated? If the answers to both questions are “no”, I’d suggest comm:

sort file1 file1 > file1.sorted     sort file2 > file2.sorted
-------------------------------     -------------------------
efo                                 efr
efo                                 rte
sfe                                 sfe
sfe                                 wer
wer                                 wer
wer

comm -13 file1.sorted file2.sorted
efr
rte

Include enough copies of file1 in file1.sorted to cover the maximum number of occurrences of any string in file2.

Scott

Posted 2012-11-20T19:02:23.543

Reputation: 17 653

thanks Scott. What if file 2 has characters after each of those symbols? I want to do the same but keep the trailing characters. – barrrista – 2012-11-20T19:59:25.597

Without knowing anything about SED etc, the basic design in my personal pseudocode is:

sort the list of strings to be removed (List A)

sort the list of strings which contains items to be removed (List B)

For each item in List A

Repeat until Item (List B) > Item (List A)
    if the Item (List B) equals Item (List A) 
        remove item (List B)
    next Item (List B)
Next Item (List A)

Note: "Removing" an item might be problematical - better to replace this line with one adding the item to a new

Fred

Posted 2012-11-20T19:02:23.543

Reputation: 11