Editing the first/last lines of a 1GB+ text file on Windows without loading the entire file into memory

I have some flat-text data files ("CSV") with sizes up to 3GB and simply need to remove the first 3 lines of text, and add an empty line at the end. Since I have a lot of these files, I would like to find a fast way of doing this.

The problem with these first lines is that they are not CSV data, but random text that doesn't follow the column format. Because of this, SQL Server's Bulk Insert statement can't process these files.

One option would be to use a PowerShell script, but using Get-content or streams would always involve reading the entire file and completely outputting it again. Is there a way to directly modify the file on-disk, without loading it entirely into memory and recreating the file?

Preferably, I'm looking for a PowerShell way to do this. Although third-party tools might also be interesting...

Wouter

Posted 2016-12-01T14:50:01.637

Reputation: 1 259

1Not an answer to the question asked, but if you ever get to refactor: This is what a database does quite well. – Hennes – 2016-12-01T14:55:39.723

@Hennes: That would work if the first 3 lines were actual data lines, but they are random text. Edited my question to make this clearer. I formulated it badly earlier... – Wouter – 2016-12-01T15:00:30.827

Ah. I see plenty sulution which include readint he whole file (searched on "trim beginning of a file"). It will be interesting what comes up for windows and which does not read the unchanged parts. – Hennes – 2016-12-01T15:03:56.267

Editing by loading into RAM would of course be trivial :) – Wouter – 2016-12-01T15:08:02.467

Have you tried Get-Content along with Set-Content to get the first 3 lines and/or BaseStream and read/replace 3 lines? Unless you read in the entire file neither suggestion would result in the entire file being read into memory.

– Ramhound – 2016-12-01T16:32:12.943

Put GNU/Linux in a virtual machine and run it there – Neil McGuigan – 2016-12-01T20:13:42.140

1You might be able to hex edit the 3 lines (or script a routine) into a proper row format with the proper number of field and record delimiters, and then use FIRSTROW. This would only require seeking a few bytes into the file. – Yorik – 2016-12-01T21:17:28.293

Answers

Removing content from the beginning of a file requires rewriting the file.

You can use tail -n +4 input.csv > output.csv to remove the first three lines (requires 105 seconds for a 15 GB Wikipedia dump on my low-end server, i.e. about 150 MB per second). On Windows tail is available with Cygwin e.g.

aventurin

Posted 2016-12-01T14:50:01.637

Reputation: 236

I guess there's no way to not read the whole file in the memory, at least I don't know any.

$csv = gci "C:\location" -filter *.csv | % { 
    (Get-Content $_.FullName | select -skip 3) | Set-Content $_.FullName 
    Add-Content -path $_.FullName -value ""
}

This would be a PowerShell solution which requires to load the whole file into memory.

search every csv from a location with gci,
loop over the found csv files with foreach alias %,
get their whole content (can take some time) with get-content,
select everything but skip the first 3 lines select -skip
and set that content to the file with set-content.
the last line will add a new line to the file add-content

Edit: You can try to make this whole thing faster by adding the -ReadCount Parameter to your get-content call.

-ReadCount (int)

Specifies how many lines of content are sent through the pipeline at a time. The default value is 1. A value of 0 (zero) sends all of the content at one time.

This parameter does not change the content displayed, but it does affect the time it takes to display the content. As the value of ReadCount increases, the time it takes to return the first line increases, but the total time for the operation decreases. This can make a perceptible difference in very large items.

Edit2: I tested get-content with readcount. sadly i couldn't find a text file larger than 89mb. but the difference is already significant:

PS C:\Windows\System32> Measure-Command { gc "C:\Pub.log" -readcount 0 }


Days              : 0
Hours             : 0
Minutes           : 0
Seconds           : 1
Milliseconds      : 22
Ticks             : 10224578
TotalDays         : 1.18340023148148E-05
TotalHours        : 0.000284016055555556
TotalMinutes      : 0.0170409633333333
TotalSeconds      : 1.0224578
TotalMilliseconds : 1022.4578




PS C:\Windows\System32> Measure-Command { gc "C:\Pub.log" -readcount 1 }


Days              : 0
Hours             : 0
Minutes           : 0
Seconds           : 10
Milliseconds      : 594
Ticks             : 105949457
TotalDays         : 0.000122626686342593
TotalHours        : 0.00294304047222222
TotalMinutes      : 0.176582428333333
TotalSeconds      : 10.5949457
TotalMilliseconds : 10594.9457

so get-content $_.FullName -readcount 0 is the way to go

SimonS

Posted 2016-12-01T14:50:01.637

Reputation: 4 566

Not an answer to the question, but as it seems that it won't be possible without loading into memory, this is definitely the next best thing. Have an upvote :) – Wouter – 2016-12-05T09:28:54.723

use -readcount 0 will definitely get an out of memory exception for a 3 GB file – chingNotCHing – 2016-12-12T09:32:10.797

@chingNotCHing I tried it with a 5.4GB File right now and didn't get an out of memory exception(I got 8GB memory in my system). the memory was at 98% which isn't really good you're right, and PowerShell saves this memory until you close PowerShell. it's not optimal like this, you're right. change -readcount 0 to -readcount 100 or -readcount 1000 for a better outcome – SimonS – 2016-12-12T10:29:13.177

U r lucky, I was working on a 32bit windows that time. – chingNotCHing – 2016-12-12T14:08:07.933

After digging a bit deeper, I think what this question boils down to is the following:

Is there a way to edit a file on a HDD formatted using NTFS, directly, in-place?

My answer would be that minor changes could be made by using Hex-editors that make direct changes on the harddrive level, but making huge changes like removing entire portions of the file would probably corrupt the filesystem. So the question again, boils down to:

Does NTFS support editing data-blocks assigned to a file without rewriting the entire file?

My guess is going to be... no. But I'd be interested to learn a bit more about the details of this...

Wouter

Posted 2016-12-01T14:50:01.637

Reputation: 1 259

3Editing of file data that does not change it's length has been a standard file system function for a long time. Adding data to the end of a file is also directly supported. Removal of data from the end of a file is supported by most file systems. But any changes elsewhere that changes the length of the file is more problematic and I am not aware of any file system that directly supports this. It can be done with software but requires a lot of copying of file data and that is going to be slow, particularly with large files. – LMiller7 – 2016-12-01T20:25:09.480

Ah, very good to know! Now I know for sure I'll have to go with plan B :) – Wouter – 2016-12-05T09:27:08.467