Read n random lines from a potentially huge file

17

4

This challenge is about reading random lines from a potentially huge file without reading the whole file into memory.

Input

An integer n and the name of a text file.

Output

n lines of the text file chosen uniformly at random without replacement.

You can assume that n is in the range 1 to the number of lines in the file.

Be careful when sampling n numbers at random from the range that the answer you get is uniform. rand()%n in C is not uniform for example. Every outcome must be equally likely.

Rules and restrictions

Each line of the text file will have the same number of characters and that will be no more than 80.

Your code must not read any of the contents of text file except:

  • Those lines it outputs.
  • The first line to work out how many characters per line there are in the text file.

We can assume each character in the text file takes exactly one byte.

Line separators are assumed to be 1 byte long. Solutions may use 2 bytes long line separators only if they specify this need. You may also assume the last line is terminated by a line separator.

Your answer should be a complete program but you can specify the input in any way that is convenient.

Languages and libraries

You can use any language or library you like.

Notes

There was a concern about calculating the number of lines in the file. As nimi points out in the comments, you can infer this from the file size and the number of chars per line.

Motivation

In chat some people asked if this is really a "Do X without Y" question. I interpret this to ask if the restrictions are unusually artificial.

The task of randomly sampling lines from huge files is not uncommon and is in fact one I sometimes have to do. One way to do this is in bash:

shuf -n <num-lines>

This is however very slow for large files as it reads in the whole file.

user9206

Posted 2016-05-05T08:25:23.930

Reputation:

Why the downvote? – None – 2016-05-05T08:39:35.183

3This is trivial in languages like C that have fseek, and impossible in others. Additionally, what if n is greater than the number of lines in the file? – Mego – 2016-05-05T08:47:45.077

How can I skip lines? How can I know how many lines there are if I do not read them all? – Leaky Nun – 2016-05-05T08:50:08.470

@KennyLau "Each line of the text file will have the same number of characters and that will be no more than 80." You can calculate the exact position to jump to. – None – 2016-05-05T08:50:41.733

I don't even know how many lines there are. How can I calculate any position to jump to? – Leaky Nun – 2016-05-05T08:52:31.873

What if I overshoot? – Leaky Nun – 2016-05-05T08:53:55.803

In essence, this is a Do X without Y challenge. Attack drones are en route.

– Mego – 2016-05-05T09:00:38.373

4@Mego: regarding your point b): you can calculate the number of lines by dividing the file size by the length of a line. – nimi – 2016-05-05T09:01:59.013

@nimi A very good point! – None – 2016-05-05T09:02:29.673

I think this is fine, personally would add only one thing: line separators assumed to be consistent 1 byte long. Solutions may use 2 bytes long line separators only if they specify this need. – manatwork – 2016-05-05T09:22:24.763

Can we assume that the last line also has a line break at the end? – Fatalize – 2016-05-05T09:23:56.277

@Fatalize Yes please do. – None – 2016-05-05T09:24:22.107

8Do X without Y is a warning that starts with "This isn't always bad". The main problem is artificial restrictions like "don't use +" which gives advantage to languages which have a sum(). Not reading a file into memory is a clear and consistent restriction which is in no way arbitrary. It can be tested with a file larger than memory, which cannot be worked around by language differences. It also happens to have real world applications (although that isn't necessary for a golf...). – trichoplax – 2016-05-05T09:52:33.523

1It looks like this is actually a restricted complexity code golf where the memory usage is limited despite potentially huge files. It's not about not having certain things in your code but a limitation on how the code may act. – xnor – 2016-05-05T14:18:03.287

Where does it say all the lines are the same length? – feersum – 2016-05-05T18:40:35.550

@feersum " Each line of the text file will have the same number of characters and that will be no more than 80." – None – 2016-05-05T19:01:25.530

Should each line be output at most once, or is repetition allowed? In other words, are we outputting line X (where X is the output from the RNG) N times, or are we doing the equivalent of shuffling the lines and outputting the first N? – Mego – 2016-05-07T09:29:24.743

@Mego The latter . It should be functionally equivalent to the example bash one liner I gave – None – 2016-05-07T09:42:07.090

Just to clarify, is it expected that the output be in random order as well as being random lines, or is the order of the lines not important? – Robert Benson – 2016-05-12T13:39:59.547

@RobertBenson The order is not important. – None – 2016-05-12T16:40:19.143

Answers

6

Dyalog APL, 63 bytes

⎕NREAD¨t 82l∘,¨lׯ1+⎕?(⎕NSIZE t)÷l←10⍳⍨⎕NREAD 83 80,⍨t←⍞⎕NTIE 0

Prompts for file name, then for how many random lines are desired.

Explanation

Prompt for text input (file name)
⎕NTIE 0 Tie the file using next available tie number (-1 on a clean system)
t← Store the chosen tie number as t
83 80,⍨ Append [83,80] yielding [-1,83,80]
⎕NREAD Read the first 80 bytes of file -1 as 8-bit integers (conversion code 83)
10⍳⍨ Find the index of the first number 10 (LF)
l← Store the line length as l
(⎕NSIZE t)÷ Divide the size of file -1 with the line length
Prompt for numeric input (desired number of lines)
? X random selections (without replacement) out the first Y natural numbers
¯1+ Add -1 to get 0-origin line numbers*
Multiply by the line length to get the start bytes
t 82l∘,¨ Prepend [-1,82,LineLength] to each start byte (creates list of arguments for ⎕NREAD)
⎕NREAD¨ Read each line as 8-bit character (conversion code 82)

Practical example

File /tmp/records.txt contains:

Hello
Think
12345
Klaus
Nilad

Make the program RandLines contain the above code verbatim by entering the following into the APL session:

∇RandLines
⎕NREAD¨t 82l∘,¨lׯ1+⎕?(⎕NSIZE t)÷l←10⍳⍨⎕NREAD 83 80,⍨t←⍞⎕NTIE 0
∇

In the APL session type RandLines and press Enter.

The system moves the cursor to the next line, which is a 0-length prompt for character data; enter /tmp/records.txt.

The system now outputs ⎕: and awaits numeric input; enter 4.

The system outputs four random lines.

Real life

In reality, you may want to give filename and count as arguments and receive the result as a table. This can be done by entering:

RandLs←{↑⎕NREAD¨t 82l∘,¨lׯ1+⍺?(⎕NSIZE t)÷l←10⍳⍨⎕NREAD 83 80,⍨t←⍵⎕NTIE 0}

Now you make MyLines contain three random lines with:

MyLines←3 RandLs'/tmp/records.txt'

How about returning just a single random line if count is not specified:

RandL←{⍺←1 ⋄ ↑⎕NREAD¨t 82l∘,¨lׯ1+⍺?(⎕NSIZE t)÷l←10⍳⍨⎕NREAD 83 80,⍨t←⍵⎕NTIE 0}

Now you can do both:

MyLines←2 RandL'/tmp/records.txt'

and (notice absence of left argument):

MyLine←RandL'/tmp/records.txt'

Making code readable

Golfed APL one-liners are a bad idea. Here is how I would write in a production system:

RandL←{ ⍝ Read X random lines from file Y without reading entire file
    ⍺←1 ⍝ default count
    tie←⍵⎕NTIE 0 ⍝ tie file
    length←10⍳⍨⎕NREAD 83 80,⍨tie ⍝ find first NL
    size←⎕NSIZE tie ⍝ total file length
    starts←lengthׯ1+⍺?size÷length ⍝ beginning of each line
    ↑⎕NREAD¨tie 82length∘,¨starts ⍝ read each line as character and convert list to table
}

*I could save a byte by running in 0-origin mode, which is standard on some APL systems: remove ¯1+ and insert 1+ before 10.

Adám

Posted 2016-05-05T08:25:23.930

Reputation: 37 779

Ahh.. APL :) Is there some way to test this code in linux? – None – 2016-05-12T06:15:35.493

@Lembik Sure, this code is cross platform. Download from dyalog.com – Adám – 2016-05-12T06:18:13.173

As I don't read APL, could you explain the code? The tricky parts are sampling lines without replacement and jumping directly to the right place in the file to read the lines. – None – 2016-05-12T10:30:18.633

@Lembik That part is easy. ⎕NREAD's argument is TieNumber ConversionCode BytesToRead [StartByte]. It reads just the required bytes. The rest is just figuring out what to read. – Adám – 2016-05-12T21:13:28.873

@Lembik I'm curious why my answer didn't win the bounty. – Adám – 2016-05-14T21:59:39.810

@Nᴮᶻ I can't speak for Lembik but according to the bounty FAQ page if the bounty is not manually awarded by whoever initiated it, it will be awarded to the answer with the highest score

– Value Ink – 2016-05-15T06:26:57.967

I feel a little guilty (and did ask on chat about this) but I awarded the bounty to the answer I liked most. To be fair I have accepted your answer as well. – None – 2016-05-16T04:06:56.207

7

Ruby, 104 94 92 90 bytes

File name and number of lines are passed into the command line. For example, if the program is shuffle.rb and the file name is a.txt, run ruby shuffle.rb a.txt 3 for three random lines.

-4 bytes from discovering the open syntax in Ruby instead of File.new

f=open$*[0]
puts [*0..f.size/n=f.gets.size+1].sample($*[1].to_i).map{|e|f.seek n*e;f.gets}

Also, here's a 85-byte anonymous function solution that takes a string and a number as its arguments.

->f,l{f=open f;puts [*0..f.size/n=f.gets.size+1].sample(l).map{|e|f.seek n*e;f.gets}}

Value Ink

Posted 2016-05-05T08:25:23.930

Reputation: 10 608

Below 100 bytes! Maybe Ruby is the best golfing language after all. Does 'sample' avoid repetitions? – None – 2016-05-08T05:16:02.437

@Lembik http://ruby-doc.org/core-2.2.0/Array.html#method-i-sample It does avoid repetitions. Don't tell me... was I supposed to have repetitions?

– Value Ink – 2016-05-08T06:03:19.107

No you are perfect :) – None – 2016-05-08T07:08:42.017

Can you save any bytes by reading from stdin? ruby shuffle.rb 3 < a.txt gives you a seekable stdin. IDK Ruby, though. – Peter Cordes – 2016-05-14T04:33:41.917

@PeterCordes no luck, you get undefined method \size' for #<IO:<STDIN>> (NoMethodError)` because we need file size to determine number of lines. Also, reading from STDIN is probably reading the whole file into memory in order to pipe it in, so I think it's against the rules – Value Ink – 2016-05-14T08:03:31.243

I was imagining that Ruby might let you do f=stdin instead of f=open.... In Unix, standard input is just another file descriptor that's open when your process starts. Seeking on it works just like any other file descriptor: fails if it's not open on a seekable file. POSIX lseek(2) / C fseek(3) either works or fails, there's no read-and-discard fallback. (And if stdin is redirected from a file, it will work. With cat | ruby, it would of course fail.) Like I said, I don't know Ruby, just the underlying Unix syscalls. – Peter Cordes – 2016-05-14T08:10:52.427

1@PeterCordes That makes sense, but as mentioned, the point of failure is Ruby being unable to read the file size of stdin, so it didn't work out. – Value Ink – 2016-05-14T08:15:53.900

5

Haskell, 240 224 236 bytes

import Test.QuickCheck
import System.IO
g=hGetLine
main=do;f<-getLine;n<-readLn;h<-openFile f ReadMode;l<-(\x->1+sum[1|_<-x])<$>g h;s<-hFileSize h;generate(shuffle[0..div s l-1])>>=mapM(\p->hSeek h(toEnum 0)(l*p)>>g h>>=putStrLn).take n

Reads filename and n from stdin.

How it works:

main=do
  f<-getLine                   -- read file name from stdin
  n<-readLn                    -- read n from stdin
  h<-openFile f ReadMode       -- open the file
  l<-(\x->1+sum[1|_<-x])<$>g h -- read first line and bind l to it's length +1
                               -- sum[1|_<-x] is a custom length function
                               -- because of type restrictions, otherwise I'd have
                               -- to use "toInteger.length"
  s<-hFileSize h               -- get file size
  generate(shuffle[0..div s l-1])>>=
                               -- shuffle all possible line numbers 
  mapM (\->p  ...  ).take n    -- for each of the first n shuffled line numbers 
     hSeek h(toEnum 0).(l*p)>> -- jump to that line ("toEnum 0" is short for "AbsoluteSeek")
     g h>>=                    -- read a line from current position
     putStrLn                  -- and print

It takes a lot of time and memory to run this program for files with many lines, because of a horrible inefficient shuffle function.

Edit: I missed the "random without replacement" part (thanks @feersum for noticing!).

nimi

Posted 2016-05-05T08:25:23.930

Reputation: 34 639

Haskell rocks :) – None – 2016-05-05T19:22:34.980

1How does it avoid picking a line which was already chosen? – feersum – 2016-05-05T21:47:07.357

@feersum: oh, I missed that part. Fixed. – nimi – 2016-05-06T09:41:47.237

@Lembik: implementing shuffle by hand is far too long. Of all the libraries with shuffle functions I've found, Test.QuickCheck requires the least overhead. Oh, and taking elements from a list of random numbers and discarding duplicates isn't deterministic, so you can't go that way (but I guess it's longer anyway) – nimi – 2016-05-06T10:21:54.507

1Maybe there should be a separate challenge on sampling without replacement in small space. – None – 2016-05-06T11:17:14.293

3

PowerShell v2+, 209 bytes

param($a,$n)
$f=New-Object System.IO.FileStream $a,"Open"
for(;$f.ReadByte()-ne10){$l++}
$t=$f.Length/++$l-1
[byte[]]$z=,0*$l
0..$t|Get-Random -c $n|%{$a=$f.Seek($l*$_,0);$a=$f.Read($z,0,$l-1);-join[char[]]$z}

Takes input $a as the filename and $n as the number of lines. Note that $a must be a full-path filename, and assumed to be ANSI encoding.

We then construct a new FileStream object, and tell it to access the file $a with Open privilege.

The for loop .Read()s through the first line until we hit a \n character, incrementing our line-length counter each character. We then set $t equal to the size of the file (i.e., how long the stream is) divided by how many characters per line (plus one so it counts the terminator), minus one (since we're zero-indexed). We then construct our buffer $z to also be line length.

The final line constructs a dynamic array with the .. range operator.1 We pipe that array to Get-Random with a -Count of $n to randomly select $n line numbers without repetition. The lucky numbers are piped into a loop with |%{...}. Each iteration we .Seek to the particular location, and then .Read in a line's worth of characters, stored into $z. We re-cast $z as a char-array and -join it together, leaving the resultant string on the pipeline and output is implicit at the end of the program.

Technically we should end with a $f.Close() call to close out the file, but that costs bytes! :p

Example

a.txt:
a0000000000000000000000000000000000000000000000001
a0000000000000000000000000000000000000000000000002
a0000000000000000000000000000000000000000000000003
a0000000000000000000000000000000000000000000000004
a0000000000000000000000000000000000000000000000005
a0000000000000000000000000000000000000000000000006
a0000000000000000000000000000000000000000000000007
a0000000000000000000000000000000000000000000000008
a0000000000000000000000000000000000000000000000009
a0000000000000000000000000000000000000000000000010

PS C:\Tools\Scripts\golfing> .\read-n-random-lines.ps1 "c:\tools\scripts\golfing\a.txt" 5
a0000000000000000000000000000000000000000000000002 
a0000000000000000000000000000000000000000000000001 
a0000000000000000000000000000000000000000000000004 
a0000000000000000000000000000000000000000000000010 
a0000000000000000000000000000000000000000000000006 

1 Technically, this means we can only support a maximum of 50,000 lines, as that's the largest range that can be dynamically created in this manner. :-/ But, we can't just loop a Get-Random command $n times, discarding duplicates each loop, since that's not deterministic ...

AdmBorkBork

Posted 2016-05-05T08:25:23.930

Reputation: 41 581

2

Python 3, 146 139 bytes

from random import*
i=input
f=open(i())
l=len(f.readline())
[(f.seek(v*l),print(f.read(l)))for v in sample(range(f.seek(0,2)//l),int(i()))]
#print is here^

Input: [filename]\n[lines]\n

This solution heavily stole from @pppery but is python3.5 only and is a complete program.

Edit: Thanks to @Mego for the inline range and python3.x compatibility. Edit2: Clarification where the print is because i got two comments about it. (Comment is obviously not part of the code or the byte count.)

Alexander Nigl

Posted 2016-05-05T08:25:23.930

Reputation: 121

Thank you! Which part is python 3.5 only? – None – 2016-05-09T09:01:46.697

2r=range(f.seek(0,2)//l) will work, which shaves off 3 bytes and removes the need for 3.5. Even better, shave off 3 more bytes by inlining the range call in the sample call. Additionally, this is not a complete program - you need to actually print the list. – Mego – 2016-05-09T09:19:08.457

@Lembik: It was 3.5 only because i used r=[*range(f.seek(0,2)//l)] because i thought i couldn't sample a generator. Turns out I could.

@Mega: It is complete because it prints a every line inside the list comprehension print(f.read(l)) – Alexander Nigl – 2016-05-09T10:56:31.730

You do need a print statement though. – None – 2016-05-09T10:58:24.993

print is inside the list comprehension. – Alexander Nigl – 2016-05-09T11:01:04.247

Oops! Sorry about that. – None – 2016-05-09T13:50:37.330

2

Lua, 126 122

r=io.read;f=io.open(r())c=2+f:read():len()for i=1,r()do f:seek("set",c*math.random(0,f:seek("end")/c-1))print(f:read())end

Uses 2 bytes for line breaks. Change the 2 to a 1 for 1. I only have it as 2 because that's what my test file had.

Got myself under the PHP entry, but still 2nd place (currently). Curse you, Ruby entry!

Blab

Posted 2016-05-05T08:25:23.930

Reputation: 451

1Lua was the first programming language I learned, and even with all I've learned since then, it's still my favorite. It's just so versatile for its ease to write. – Blab – 2016-05-12T03:51:26.427

2

Bash (well, coreutils), 100 bytes

n=`head -1 $2|wc -c`;shuf -i0-$[`stat -c%s $2`/$n] -n$1|xargs -i dd if=$2 bs=$n skip={} count=1 2>&-

Explanation

This avoids reading the whole file using dd to extract the portions of the file we need without reading the file entirely, unfortunately it ends up quite large with all the options we have to specify:

if is the input file
bs is the block size (here we set it to $n which is the length of the first line
skip is set to the random integers extracted from shuf and equates to the ibs value skipping skip*ibs bytes
count the number of ibs length sections to return
status=none is needed to strip out the information normally output by dd

We get the line length using head -1 $2|wc -c and the filesize with stat -c%s $2.

Usage

Save above as file.sh and run using file.sh n filename.

Timings

time ~/randlines.sh 4 test.txt
9412647
4124435
7401105
1132619

real    0m0.125s
user    0m0.035s
sys     0m0.061s

vs.

time shuf -n4 test.txt
1204350
3496441
3472713
3985479

real    0m1.280s
user    0m0.287s
sys     0m0.272s

Times above for a 68MiB file generated using seq 1000000 9999999 > test.txt.

Thanks to @PeterCordes for his -1 tip!

Dom Hastings

Posted 2016-05-05T08:25:23.930

Reputation: 16 415

1I always love a bash answer but can you explain how this doesn't read the whole file? – None – 2016-05-13T10:28:51.340

2@Lembik added explanation! – Dom Hastings – 2016-05-13T10:35:59.923

1You can bs= instead of ibs=, since setting obs as well is fine. I guess you can't replace if=$2 with <$2 though, since this is still xargs's command line. \<$2 doesn't work either (xargs uses exec directly, without a shell). – Peter Cordes – 2016-05-14T04:00:22.120

I hope this isn't too much but I sort of love this answer :) Just tested it with a 1GB file. – None – 2016-05-14T15:23:02.967

Thanks for that @PeterCordes! -1 byte! I've golfed it down a little further, I tried using od instead of dd, but I can't get the output in a nice format, I got close with xxd -r -p but got some weird results. I might play with it again tomorrow! Glad you liked it @Lembik! I've just made it a bit shorter, but I still can't compete with APL or Ruby... Yet... – Dom Hastings – 2016-05-14T22:05:57.347

1re: redirecting stderr to stdin: You could also close stderr with 2>&-, so there's no danger of the output going anywhere (e.g. if stdin happened to be a read-write file descriptor). It works with GNU dd: It still produces its stdout before trying and failing to write to stderr. – Peter Cordes – 2016-05-14T22:16:22.320

@PeterCordes Nice! Didn't think to close it fully, even better! Updated! – Dom Hastings – 2016-05-14T23:05:22.737

1

Python 3 - 161 160 149 bytes

from random import*;
def f(n,g):f=open(g);l=len(f.readline());r=list(range(f.seek(0,2)/l));shuffle(r);[(f.seek(v*l),print(f.read(l)))for v in r[:k]]

This code defines a function which is called like f(10,'input.txt')

pppery

Posted 2016-05-05T08:25:23.930

Reputation: 3 987

1The challenge requires a full program, so I'm afraid a function definition is not enough. – nimi – 2016-05-07T21:08:27.567

To save a byte remove space between import and *. – mriklojn – 2016-05-07T22:14:36.810

1

@nimi Requiring a full program for this challenge seems to be arbitarily overriding the default code format rules

– pppery – 2016-05-08T00:22:55.577

@ppperry: yes, maybe, but that's just how it is. – nimi – 2016-05-08T00:30:23.160

To get the length of the file you could f.seek(0,2), that makes import os and os.stat obsolete.

– Alexander Nigl – 2016-05-09T08:18:16.857

1

C# 259 bytes without duplicates

class Program{static void Main(string[]a){int c=Convert.ToInt32(a[1]);var h=File.ReadLines(a[0]);HashSet<int>n=new HashSet<int>();while(n.Count<c)n.Add(new Random().Next(0,h.Count()));for(;c>0;c--)Console.WriteLine(h.Skip(n.ElementAt(c-1)).Take(1).First());}}

Ungolfed

class Program{static void Main(string[] a)
{
        int c = Convert.ToInt32(a[1]);
        var h = File.ReadLines(a[0]);
        HashSet<int> n = new HashSet<int>();
        while (n.Count < c)
            n.Add(new Random().Next(0, h.Count()));           
        for (; c > 0; c--)
            Console.WriteLine(h.Skip(n.ElementAt(c-1)).Take(1).First());
    }
}

File.ReadLines is Lazy. This has the additional benefit that all lines can have different length.

Running it would be:

sample.exe a.txt 10000

C# 206 bytes with duplicates

class Program{static void Main(string[]a){var n=new Random();int c=Convert.ToInt32(a[1]);var h=File.ReadLines(a[0]);for(;c>0;c--)Console.WriteLine(h.Skip((int)(n.NextDouble()*h.Count())).Take(1).First());}}

Ungolfed

class Program
{
    static void Main(string[] a)
    {
        Random n = new Random();
        int c = Convert.ToInt32(a[1]);
        var h = File.ReadLines(a[0]);
        for (; c > 0; c--)
            Console.WriteLine(h.Skip((int)(n.NextDouble()*h.Count())).Take(1).First());
    }
}

Master117

Posted 2016-05-05T08:25:23.930

Reputation: 389

I don't fully follow your solution. If all lines have different lengths then the task is impossible. Also, how you are randomly sampling lines without replacement exactly? I apologise my C# isn't good enough. – None – 2016-05-12T10:22:15.457

@Lembik You are right, i didn't think about duplicates. And I can count the amount of lines and extract lines by linenumber, which is why lines may be variable lengh. – Master117 – 2016-05-12T10:27:02.147

But you have to jump to a location in the file only knowing the line number don't you? You can't tell where that is unless all the lines have the same length. – None – 2016-05-12T10:28:42.803

@Lembik File.ReadLines("pathToFile") create a Lazy enumeration on all Lines of the File, File.ReadLines("pathToFile").elementAt(19) returns the 19th Line of the File. Kinda like a Map of all Linestarts. – Master117 – 2016-05-12T10:41:29.723

I don't think Lazy enumeration jumps (or seeks) in the file sadly. So it doesn't fit the rules currently. – None – 2016-05-12T10:46:58.673

To be a little clearer, your code reads (many) lines that it doesn't output which violates the rules of the challenge. – None – 2016-05-12T11:16:43.170

1

Python (141 bytes)

Keeps each line with equal probability, use with pipes too. It doesn't answer the skip ahead limitation of the question though...

Usage cat largefile | python randxlines.py 100 or python randxlines 100 < largefile (as @petercordes pointed out)

import random,sys
N=int(sys.argv[1])
x=['']*N
for c,L in enumerate(sys.stdin):
    t=random.randrange(c+1)
    if(t<N):x[t] = L
print("".join(x))

topkara

Posted 2016-05-05T08:25:23.930

Reputation: 179

3The whole point of this question is that you have to seek in the input stream. You should probably say that that's the part of the question's restrictions you're ignoring (although the read-from-a-pipe example usage makes that pretty clear). Reading from stdin with python ./randxlines.py 100 < largefile would be fine for a proper answer, though: in that case stdin will be seekable. – Peter Cordes – 2016-05-14T04:31:49.977