0
I have a text file that is couple GBs. I am trying to shuffle this text file in a pipe.
For example these are some sample lines of what I am using but it is not efficient and in fact the pipe does not seem to start until the whole file is read. Maybe I am wrong on it.
shuf HUGETEXTFILE.txt|some command
cat HUGETEXTFILE.txt|sort -R |some command
I also tried to use
split -n 1/numberofchunks HUGETEXTFILE.txt|sort -R|some command
But the piping ends when the first chunk finishes.
I am trying to find an efficient way to pipe text file shuffling in a pipe because I do not want to write hundreds of files everytime I need a new way of shuffling, or random distribution.
thanks
Have you tried to use
shuf --input-range=$LO-$HI
? Instead ofsplit ...
you can give toshuf
the range in linenumbers... – Hastur – 2014-07-25T22:31:38.770Well I amm trying to shuffle the whole file at once if possible. This just sounds like it would shuffle a range from the input file. – yarun can – 2014-07-25T22:45:07.323
also that argument just creates bunch of random numbers,i am not sure if that is what I need. Can you be more elaborate please? – yarun can – 2014-07-25T22:48:09.857
Have you tried using shuf with the --output option, then using cat outfile.txt | some command. I know you said you didn't want to write hundreds of files, but this is only one and the name can be resued, meaning you should only have one. – Tyson – 2014-07-25T23:03:36.400
You do know, that there simply is no "efficient" way to shuffle a multi-GB textfile (i.e. one, that doesn't fit in RAM) - shuffling is an intrinsicly expensive operation. – Eugen Rieck – 2014-07-25T23:04:47.990
@Eugen Rieck, I do not mind if it is split in multiple ways in a script solution either. I just do not want to deal with many hundreds of files if possible. – yarun can – 2014-07-25T23:42:38.307