How to edit the header of a huge CSV file in-place?

I have several huge CSV files in which I want to swap two column names.

I do not want to modify/copy/rewrite the data.

The operation is very cheap in C: fopen the file, fgets the header, fseek or rewind, manipulate the header (preserving its length), fputs the new header, fclose the file.

This can also be done in ANSI Common Lisp (CLISP, SBCL or GCL):

 (with-open-file (csv "foo.csv" :direction :io
                      :if-exists :overwrite)
   (let ((header (read-line csv)))
     (print header)
     (file-position csv 0)
     (write-line (string-upcase header) csv)
     (file-position csv 0)
     (read-line csv)))

and takes a fraction of a second (sed takes a few minutes because it reads and re-writes the whole file even it you tell it to modify just the first line, ignoring the crucial information that the size of the header did not change).

How do I do that with the "standard unix tools" (e.g., perl)?

sds

Posted 2012-12-17T21:17:40.880

Reputation: 1 600

Answers

If you do not know the length of the header, head -n1 seems like a reasonable way to get the first line.

To write it in-place back to the head of the file, you can use dd:

head -n1 file.csv | ./do-some-processing | dd of=file.csv bs=1 conv=notrunc

the conv=notrunc is critical to leave the rest of the file intact, and bs=1 is to stop on byte boundary.

b0fh

Posted 2012-12-17T21:17:40.880

Reputation: 2 005

Nice, I never knew about notrunc. But note that ./do-some-processing must preserve the length of the header (as specified by the OP.) Just a warning for tl;dr folks (like me :) – Owen – 2015-05-08T19:41:14.230

I would suggest sed for this, you can specify to only make the substitution on the first line such as 1s/foo/bar/:

$ cat file
col1,col2,col3
1,2,3
3,2,1
...

$ sed -e '1s/col1/tmp/' -e '1s/col3/col1/'  -e '1s/tmp/col3/' file
col3,col2,col1
1,2,3
3,2,1
...

Use -i to store the change back to the file:

$ sed -i -e '1s/col1/tmp/' -e '1s/col3/col1/'  -e '1s/tmp/col3/' file

iiSeymour

Posted 2012-12-17T21:17:40.880

Reputation: 525

1This does modify the file as required, but this re-writes the data, i.e., time shows that the sed command takes about the same time (few minutes) as cp. – sds – 2012-12-18T15:01:47.393

If all you want is to swap two words, then all you need is in-place rewriting of a few bytes.

This is an easy task for a commandline hexadecimal editor.

I would recommend hexedit which I just used to edit a 30 Gb .csv file. The time spent on opening/saving the file was negligible (less than a second). In fact, my time was mostly spent looking up its keyboard shortcuts... (TAB to switch to ASCII display, Ctrl-X to save and exit).

Jealie

Posted 2012-12-17T21:17:40.880

Reputation: 101

The idea was to use a batch, not an interactive, process. – sds – 2017-07-12T14:14:01.347

-1

Or maybe "head" the file to remove the first line to a separate file.

Then change the heading file and merge the two back together.

Julian Knight

Posted 2012-12-17T21:17:40.880

Reputation: 13 389

So why mark this down and accept the same answer above? – Julian Knight – 2014-08-12T16:00:01.950

because the answer above is constant in time and this answer is not – sds – 2014-08-12T17:22:48.060

In all fairness, you didn't explain how to merge the files together. Using dd was the non-trivial insight. – b0fh – 2014-09-24T10:29:17.570

this copies the data, i.e., the time spent is proportional to the data size. – sds – 2012-12-18T14:37:13.227