Getting complement of two text files on the Unix command line

1

Is there a fast Unix-based utility (ideally written in C) for getting efficient, streaming intersection/union of tab-delimited text files?

For example, allow queries such as "give me the all the entries that in file A that have a column value K that does not appear in any column K of file B".

For example, if file A is:

bob sally sue
bob mary john

and file B is:

john sally sue
foo bar quux

then the complement of file A relative to B on column 2 would return "bob mary john", since that's the only in file B that has a value in column 2 that does not appear in file B.

I'd prefer not to use a database, but I would like a command-line based utility. Is AWK the answer or is there something simpler?

user46976

Posted 2011-12-04T20:58:04.407

Reputation: 1 078

It sounds as if the best way is to read those contents into an actualy database. Because DB queries are a perfect answer to your question. – Hennes – 2015-12-09T20:45:09.540

I suggest python; reading the file, splitting on whitespace to a list, making it a set and doing set operations before writing out. – Eroen – 2011-12-04T21:06:19.287

1@Eroen The user mentions speed and memory efficiency. Python and set operations is anything but. – Daniel Beck – 2011-12-04T21:20:55.603

2This page describes a number of set operations on the unix command line. I'm not sure how you want your columns to be, but if they are not important a quick awk script would merge them to a single col. – Eroen – 2011-12-05T00:53:36.880

1I've read your question, three times by now, even the example you gave does not help me very much. I've done joins and exclusive search on columns multiple times with AWK, but, sorry please help me to understand your problem. E.g. "bob mary john" isn't a column, its a row, or column value K, and column K, makes no sense to me. – math – 2011-12-09T11:35:56.573

@math That part is quite easily solved by just transposing the contents of file B, and then comparing columns instead. I'm no awk ninja, but I'd be surprised if awk couldn't perform a matrix transposition. – Martin Frost – 2011-12-12T13:48:00.740

No answers