Remove Empty Second Line From Very Large CSV File in Windows 10

1

2

I have an very large CSV file (data for all the taxi trips in NYC during March). The second line is empty which angers PostgreSQL's import. I can run

(gc file.csv) | ? {$_.trim() -ne "" } | set-content file_trimmed.csv

to get rid of the offending line, but since it checks every line it takes a very longtime to run (the file is on an external drive and my computer is a potato).

Anyway to make this faster?

rhaskett

Posted 2017-10-31T21:02:33.223

Reputation: 123

inplace is fine. – rhaskett – 2017-10-31T21:08:41.777

$x=gc file.csv; $x[0,2+3..($x.Count-1)] | set-content file_trimmed.csv i.e. removing just the second line (index 1) could be faster. Unfortunately, I can't measure-command not having sufficiently large file:) – JosefZ – 2017-10-31T22:21:49.593

I can help you with that :) https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2016-03.csv Let me test your recommendation.

– rhaskett – 2017-10-31T22:33:32.737

@JosefZ Interestingly, it didn't seem to work on a much smaller file I ran as a test. https://s3.amazonaws.com/nyc-tlc/trip+data/green_tripdata_2014-03.csv The 2nd line wasn't removed.

– rhaskett – 2017-10-31T22:45:47.960

Is PostgreSQL OK with there being extra spaces before/after the column names in the first row? If so, it'd be pretty easy to adjust a few bytes in place without needing to read in the whole file. – Ben N – 2017-10-31T23:00:11.610

@BenN I'm not sure but I'm happy to test. – rhaskett – 2017-10-31T23:13:48.010

There is 0x0A0A0D0A i.e. LF,LF,CR,LF i.e. two empty lines between header line and first data line… Use $x[0,3+4..($x.Count-1)] – JosefZ – 2017-10-31T23:56:34.160

play with the readcount parameter of gc – SimonS – 2017-11-01T08:07:50.953

Answers

1

Use

gc file.csv | ? {$_.trim() -ne "" } | set-content file_trimmed.csv

What's wrong in the original command (paraphrased Delete all blank lines from a text file using PowerShell in Tim Curwick's PowerShell blog):

The parentheses around the Get-Content statement force it to finish loading the whole contents into an object before sending them down the pipeline. (If we are writing to a different file than we were reading from, we could speed up the command by eliminating the parentheses, thus allowing us to read from the one and write to the other simultaneously.)

Test script 1264263.ps1 measures merely reading a large file and omits writing to an output one:

param (
        [Parameter()][string]$file = 'green_tripdata_2014-03.csv'
)

Push-Location 'D:\test'

#$file = 'green_tripdata_2014-03.csv'
"$file`: {0:N3} KiB" -f $((Get-Item $file).Length /1024 )

' GC $file                          :' + ' {0:N7} sec' -f (Measure-Command {
    $y = Get-Content $file
}).TotalSeconds

Start-Sleep -Seconds 1
' GC $file  | ? {$_.trim()}         :' + ' {0:N7} sec' -f (Measure-Command {
    $y = (Get-Content $file | 
        Where-Object {$_.trim()}) #| Set-Content "$file2"
}).TotalSeconds

Start-Sleep -Seconds 1
' GC $file  | ? {$_.trim() -ne ""}  :' + ' {0:N7} sec' -f (Measure-Command {
    $y = (Get-Content $file | 
        Where-Object {$_.trim() -ne "" }) #| Set-Content "$file2"
}).TotalSeconds

Start-Sleep -Seconds 1
'(GC $file) | ? {$_.trim() -ne ""}  :' + ' {0:N7} sec' -f (Measure-Command {
    $y = (Get-Content $file) | 
        Where-Object {$_.trim() -ne ""} #| Set-Content "$file2"
}).TotalSeconds

Pop-Location

Output shows that improved command (case #3) could work cca 10 times faster than the original one (case #4):

PS D:\PShell> D:\PShell\SU\1264263.ps1
green_tripdata_2014-03.csv: 197,355.560 KiB
 GC $file                          : 27.4584778 sec
 GC $file  | ? {$_.trim()}         : 59.2003851 sec
 GC $file  | ? {$_.trim() -ne ""}  : 61.0429012 sec
(GC $file) | ? {$_.trim() -ne ""}  : 615.8580773 sec
PS D:\PShell>

JosefZ

Posted 2017-10-31T21:02:33.223

Reputation: 9 121

Many thanks. On the larger yellow file, the elapsed time changed from longer than I had patience for to 10 minutes. – rhaskett – 2017-11-01T18:33:30.867