Find/replace on specific column in large file with multiple patterns

0

1

I have a 100GB CSV file. Looks something like this:

11,aa,gg,tt
ew,11,rggr,tr

And a second one which specifies substitutions (20,000 lines in this file):

gg,q
aaa,b

What I want to do is to go over every line in the first file and substitute the value of the third column if it appears in the second file (e.g. substitute gg with q in this case). Note that this is only if the whole string appears in the third column. So for the current example the result would be:

11,aa,q,tt
ew,11,rggr,tr

What would be a good way of achieving that?

jack

Posted 2016-11-17T16:01:45.207

Reputation: 3

Answers

1

Given the size of your files I would choose to write this in a language like Perl or Python so I could load the replacement file into memory (into a hash or dictionary for quick lookups), then parse through the main csv file line by line. Here is an example in Python of how to do that, assuming your csv files were called "replace.csv" for the replacements and "big.csv" for the main csv file.

#!/usr/bin/python
import csv
import sys

repl = dict()

# Load replacements into memory
with open("replace.csv", "r") as csvfile:
    reader = csv.reader(csvfile, delimiter=',')
    for row in reader:
        if len(row) >= 2:
            repl[row[0]] = row[1]

# Do the replacements
with open("big.csv", "r") as csvfile:
    reader = csv.reader(csvfile, delimiter=',')
    writer = csv.writer(sys.stdout)
    for row in reader:
        if len(row) >= 3:
            if row[2] in repl:
                    row[2] = repl[row[2]]
            writer.writerow(row)

virtex

Posted 2016-11-17T16:01:45.207

Reputation: 1 129