1
2
I have an very large CSV file (data for all the taxi trips in NYC during March). The second line is empty which angers PostgreSQL's import. I can run
(gc file.csv) | ? {$_.trim() -ne "" } | set-content file_trimmed.csv
to get rid of the offending line, but since it checks every line it takes a very longtime to run (the file is on an external drive and my computer is a potato).
Anyway to make this faster?
inplace is fine. – rhaskett – 2017-10-31T21:08:41.777
$x=gc file.csv; $x[0,2+3..($x.Count-1)] | set-content file_trimmed.csv
i.e. removing just the second line (index 1) could be faster. Unfortunately, I can'tmeasure-command
not having sufficiently large file:) – JosefZ – 2017-10-31T22:21:49.593I can help you with that :) https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2016-03.csv Let me test your recommendation.
– rhaskett – 2017-10-31T22:33:32.737@JosefZ Interestingly, it didn't seem to work on a much smaller file I ran as a test. https://s3.amazonaws.com/nyc-tlc/trip+data/green_tripdata_2014-03.csv The 2nd line wasn't removed.
– rhaskett – 2017-10-31T22:45:47.960Is PostgreSQL OK with there being extra spaces before/after the column names in the first row? If so, it'd be pretty easy to adjust a few bytes in place without needing to read in the whole file. – Ben N – 2017-10-31T23:00:11.610
@BenN I'm not sure but I'm happy to test. – rhaskett – 2017-10-31T23:13:48.010
There is
0x0A0A0D0A
i.e. LF,LF,CR,LF i.e. two empty lines between header line and first data line… Use$x[0,3+4..($x.Count-1)]
– JosefZ – 2017-10-31T23:56:34.160play with the
readcount
parameter ofgc
– SimonS – 2017-11-01T08:07:50.953