How does `wc -l` work?

11

3

I have to read a large file and before I start reading it, I need to know the total number of lines in the file (which are in millions).

I have implemented a lot of solutions and have found one. But during my search I was thinking to look at how wc -l works. I couldn't find anything on Google.

Though I have found a solution to my problem, I would still like to know how wc -l works since it can calculate the number of lines of a file with 92 Million lines in a few seconds!

How?

detraveller

Posted 2013-07-08T17:03:55.810

Reputation: 585

6http://lingrok.org/xref/coreutils/src/wc.c – Arjan – 2013-07-08T17:17:08.640

Answers

20

It reads the entire file and counts the number of line-endings. Counting line endings is really cheap; most of the time spent is reading the file. If the file happens to be (mostly) in the buffer cache, that will be cheap too. Otherwise, it will depend on the speed of your file storage.

In other words, there is no magic.

rici

Posted 2013-07-08T17:03:55.810

Reputation: 3 493

It reads the entire file and counts number of line endings? To get to the line ending, doesn't it basically read the whole line till it reaches the end? And that would mean it read the whole file, right? – detraveller – 2013-07-08T23:48:29.580

@detraveller: yes, it reads the entire file, like I said. It doesn't read it line by line, or all at once, but it reads every character and counts how many of those characters are line-end characters. – rici – 2013-07-09T00:31:25.223

7

WC just reads the file in blocks of raw bytes (preferable in multiples of the natural block-size of the underlying filesystem on which the file is located).
Then it just scans through the buffer counting the end-of-line characters. (It also counts spaces, tabs, form-feeds and other special characters, just in case you wanted other information than the -l output.)

Reading from disk is the costly part in terms of speeds. The scanning of the buffer takes neglect-able time compared to that.

Say you've got a 90 million lines with on average 100 characters per line.
That is around 9.000.000.000 characters or about 860 MB.
A decent PC with a SATA-3Gb/s drive will do that under 10 seconds. Even on a relatively slow filesystem with some other activity going on at the same time.
A fast machine with some performance tuning and a optimized filesystem can do it under 5 seconds, even without having to resort to SATA-6G and a SSD drive.

Tonny

Posted 2013-07-08T17:03:55.810

Reputation: 19 919

it just scans through the buffer counting the end-of-line (\n) characters -- "-l, --lines print the newline counts\n" -- extracted from wc.c – Rahul Patil – 2013-07-12T18:24:52.407

@RahulPatil Most implementations do a lot more than just counting newlines. See the example mentioned in the top comment above. That is the source of wc as used in the Linux core utilities. – Tonny – 2013-07-12T19:16:19.097

yes.. I have seen that.. just I mention because, question about wc -l.. sorry... – Rahul Patil – 2013-07-12T19:18:11.003

3

Welcome to the world of free software. You can always look at the source code

Although I must admit that I'm not a C programmer, so I'm not the one that can really explain the code for you (and I would be insterested myself).

What I know is that since wc does not open the file itself, but asks OS to do it, this largely depends on the OS, and of course, how the file is stored. Apart from that, I'd expect that correct programming practices must be in place, e.g. not trying to read the file as whole at once, etc.

Alois Mahdal

Posted 2013-07-08T17:03:55.810

Reputation: 2 014

What do you mean by saying 'not trying to read the whole file at once'? – detraveller – 2013-07-08T17:36:38.587

I mean loading the file to memory, say, to a single string / array. In Perl community this is called slurping, and it's a quick & dirty solution which is OK when you know you will be reading few lines, but feeding really huge file into memory at once is seldom a good idea. – Alois Mahdal – 2013-07-08T17:43:43.707

1On the other hand, you can read, say, 64 KiB, count newlines and throw it away, repeat... That way you'll eat up something over 64 KiB at most, no matter how huge the file is. (It's less easy when you realize that newline can have 2 bytes and thus become split between 2 chunks; now that's where the fun starts) – Alois Mahdal – 2013-07-08T17:45:54.653

Not too important, but: "since wc does not open the file itself, but asks OS to do it" -- not sure what you mean by that, but I doubt this is correct. It's certainly reading all characters by itself. – Arjan – 2013-07-08T17:49:41.577

@Arjan I mean that in the sense of "opening", not reading the bytes. IOW: it does not talk to filesystem driver or move the hard drive heads to the correct position :) It does not even care where/how the file is stored or cached. Yes, finally it does chew the bytes with its own teeth :) – Alois Mahdal – 2013-07-08T17:54:12.487

2@Arjan Although, to be really correct: excluding embedded systems, programs hardly really do the reading themselves, whole point of Kernel and OS is that it does the job for them. In fact, open(), close(), read() (be it Linux, Windows, socket or file) are all system calls that actual programs have no idea of inner workings. – Alois Mahdal – 2013-07-08T17:58:58.587