Grep to filter gigantic CSV file

5

I'm dealing with a CSV file that contains over 2 million lines. Quite a big one.

I need to use grep (or any better way) to retrieve the entire row if the value of the second column matches 'jpn' or 'por'.

I tried using grep '<\jpn\>' || grep '<\por\>' file1.csv > file2.csv but had no success so far. The major issue being that 'por' is a common occurrence on the third column, which generates over a million of undesired rows.

Would any more experienced user be se kind and lend me some help?

Thanks in advance! :D

YLeven

Posted 2014-10-17T02:28:35.710

Reputation: 123

Answers

11

What Linux OS are you on? Try using egrep if you want to use regular expressions in systems whose grep version is outdated (e.g. Solaris).

Anyway, here is an awk solution:

awk -F, '$2 ~ /jpn|por/ {print}' file1.csv > file2.csv

Explanations:

  • awk for column-based operations
  • -F, to define the column separator/tokenizer, in this case I use a comma
  • $2 ~ /jpn|por/ tests column #2 with expression /jpn|por/
    • $2 is column #2
    • /jpn|por/ is a regular expression to match jpn or por
  • {print} specifies what the awk should output if it found a matching line
    • print to print the whole input line (alternatively, print $3 will just print column #3)
  • ... file1.csv specifies to read from an input file instead of stdin

aff

Posted 2014-10-17T02:28:35.710

Reputation: 322

I'm using OS X 10.10. - Your tip worked like a charm! Thank you very much! ;) – YLeven – 2014-10-17T23:24:44.887

1

grep '; jpn;\|; por;' /path/to/file1.csv > file2.csv

Assuming that semicolon (;) is the separator.

Gaurav Joseph

Posted 2014-10-17T02:28:35.710

Reputation: 1 503

0

I'm not sure, but maybe this could work:

grep '^[^,]*,\(por\|jpn\)\(,\|$\)' file1.csv > file2.csv

Assuming that "," is the separator.

MinusFour

Posted 2014-10-17T02:28:35.710

Reputation: 166

0

This seems to work for me:

egrep '^.?,(jpn|por),.*$' file1.csv > file2.csv

DopeGhoti

Posted 2014-10-17T02:28:35.710

Reputation: 543