Fast way to recursively count files in linux

Question

I'm using the following to count the number of files in a directory, and its subdirectories:

find . -type f | wc -l

But I have half a million files in there, and the count takes a long time.

Is there a faster way to get a count of the number of files, that doesn't involve piping a huge amount of text to something that counts lines? It seems like an inefficient way to do things.

On most Unices when counting files like that the bottleneck is in querying the filesystem inode tables. Multiple find commands, or different commands querying the filesystem will generally not run any faster than one. Counting lines of text here is not the slow part, walking the inodes tables is. — Demosthenex, Apr 25 '12 at 19:02
Dupe: http://stackoverflow.com/questions/1427032/fast-linux-file-count-for-a-large-number-of-files Doesn't look like there's an ideal solution — aidan, Nov 23 '10 at 10:00
Piping is fast enough. The problem here is that reading from the disk is too slow. — Thorbjørn Ravn Andersen, Dec 05 '14 at 12:50

score 9 · Accepted Answer · answered Nov 23 '10 at 13:34

If you have this on a dedicated file-system, or you have a steady number of files overhead, you may be able to get a rough enough count of the number of files by looking at the number of inodes in the file-system via "df -i":

root@dhcp18:~# df -i
Filesystem            Inodes   IUsed   IFree IUse% Mounted on
/dev/sda1            60489728   75885 60413843    1% /

On my test box above I have 75,885 inodes allocated. However, these inodes are not just files, they are also directories. For example:

root@dhcp18:~# mkdir /tmp/foo
root@dhcp18:~# df -i /tmp 
Filesystem            Inodes   IUsed   IFree IUse% Mounted on
/dev/sda1            60489728   75886 60413842    1% /
root@dhcp18:~# touch /tmp/bar
root@dhcp18:~# df -i /tmp
Filesystem            Inodes   IUsed   IFree IUse% Mounted on
/dev/sda1            60489728   75887 60413841    1% /

NOTE: Not all file-systems maintain inode counts the same way. ext2/3/4 will all work, however btrfs always reports 0.

If you have to differentiate files from directories, you're going to have to walk the file-system and "stat" each one to see if it's a file, directory, sym-link, etc... The biggest issue here is not the piping of all the text to "wc", but seeking around among all the inodes and directory entries to put that data together.

Other than the inode table as shown by "df -i", there really is no database of how many files there are under a given directory. However, if this information is important to you, you could create and maintain such a database by having your programs increment a number when they create a file in this directory and decrement it when deleted. If you don't control the programs that create them, this isn't an option.

I was just writing almost exactly the same thing, but you beat me to it, with nice examples and everything. :) One minor addition is that if directories need to be differentiated but don't change often, that number can be cached, or if precision isn't necessary, estimated. — mattdm, Nov 23 '10 at 13:38
there is a separate inode for each regular file, directory, symbolic link, and named pipe, so depending on what kind of files you need to count `df -i` may not be appropriate. — Michael Martinez, Apr 24 '18 at 16:46

Christopher Schultz · Answer 2 · 2020-11-15T23:10:35.407

3

I wrote a custom file-counting program for this StackOverflow question: https://stackoverflow.com/questions/1427032/fast-linux-file-count-for-a-large-number-of-files

You can find the GitHub repo here if you'd like to browse, download, or contribute: https://github.com/ChristopherSchultz/fast-file-count

edited Nov 15 '20 at 23:10

answered Oct 25 '17 at 19:38

Christopher Schultz

1,056
1
11
20

abu_bua · Answer 3 · 2018-04-24T16:17:34.483

2

If you want to count recursively the number of files in a directory the locate command is the fastet one I know, assumed you have an up-to-date database (sudo update database .. made per default via chron job every day). However, you can speed up the command if you avoid the grep pipe.

See man locate:

-c, --count
       Instead  of  writing  file  names on standard output, write the number of 
       matching entries only.

So the fastest command is:

locate -c -r '/path/to/dir'

edited Apr 24 '18 at 16:17

answered Apr 24 '18 at 10:38

abu_bua

121
4

this indeed is substantially faster than running `find` on a large filesystem. It would be interesting to look at the source code for `locate` to find out how it achieves its speed. – Michael Martinez Apr 24 '18 at 16:48
The reason is that locate uses a database cache as data-input for the file structure. – abu_bua Apr 24 '18 at 19:40
that doesn't explain why the first time you run updatedb it's much faster than find. – Michael Martinez Apr 24 '18 at 23:23
It has only to look in the **/var/lib/mlocate/mlocate.db** file and search in this hash for the entry. Using find it has to search the whole Tree (a lots of inodes to look for). – abu_bua Apr 25 '18 at 09:54
I'm talking about the first time it creates mlocate.db. – Michael Martinez Apr 25 '18 at 15:52

score 1 · Answer 4 · edited Apr 26 '12 at 15:31

1

I would also try:

find topDir -maxdepth 3 -printf '%h %f\n'

And then process the output, reducing into a count for the directories.

This is especially useful if you anticipate the directory structure.

edited Apr 26 '12 at 15:31

jscott

24,204
8
77
99

answered Apr 25 '12 at 16:43

Oren

11
1

thibault ketterer · Answer 5 · 2021-01-05T11:03:41.503

1

if you have locate installed you can use

locate -c "$PWD"

more on locate, you can play with

locate '/' | grep -c "^$PWD"

or to get a result filesystem-wide

locate -S

It will be much much faster than find if you got many files.

the only drawback is, it also counts directories

And I recommend using plocate https://plocate.sesse.net/

edited Jan 05 '21 at 11:03

answered May 12 '15 at 10:05

thibault ketterer

111
3

score 1 · Answer 6 · answered Apr 24 '18 at 16:49

1

Parallelize it. Run a separate find command for each subdirectory and run them at the same time. Can automate this using xargs.

answered Apr 24 '18 at 16:49

Michael Martinez

2,543
3
20
31

This is an interesting idea, but its performance will be highly dependent upon the physical disk you are searching. If you do this on a spinning disk, performance will drop way down as each separate process asks the disk to perform seeks to uncoordinated locations, while a single `find` may benefit from sequential access. On an SSD, this technique is likely to improve performance. – Christopher Schultz Jun 17 '19 at 14:50

Andrew M. · Answer 7 · 2010-11-23T18:44:42.533

0

Try this handy little Python script to see if its any faster.

from os import walk
print sum([len(files) for (root, dirs, files) in walk('/some/path')])

Andrew

edited Nov 23 '10 at 18:44

answered Nov 23 '10 at 13:10

Andrew M.

10,982
2
34
29

Amazing. But does it actually work? I get only 1/5 of the files compared to `find dir -type f | wc -l`. – Déjà vu Nov 23 '10 at 13:51
@ring0: It should be `len(files)`. – Dennis Williamson Nov 23 '10 at 15:09
Oops, for some reason I was thinking dirs. Updated to reflect this. – Andrew M. Nov 23 '10 at 18:44
Seems to take the same amount of time as the `find | ls` solution. At least, on on this system is does. – aidan Nov 24 '10 at 18:01
1

Yes, I would imagine it does take the same amount of time as find, it has to do the same number of opendir/readdir/stat operations... – Sean Reifschneider Nov 27 '10 at 09:07
Yes, that makes sense. I was thinking `find` may be doing some pre-processing of the information, but benchmarks prove otherwise. :) – Andrew M. Nov 27 '10 at 14:37

Fast way to recursively count files in linux

7 Answers7