3

My sysadmin is telling me that we should remove old static files from a server and store them in a database instead because having too many files on a filesystem impacts the general performance of the system. Is the impact significant? We have about 20,000 files in a directory at the moment, and would expect to hit 100,000 sometime in the next few years. This is on a relatively recent Ubuntu LTS system. If 100,000 isn't significant, then what number would be?

Edit: This is different from Maximum number of files in one ext3 directory while still getting acceptable performance? because I don't care about directory performance, but rather about total system performance if the number of files on a system reaches an arbitrary number. In my specific case, the sysadmin is arguing that Apache will slow down due to the total number of files on the entire system.

samspot
  • 317
  • 1
  • 3
  • 7
  • 3
    As a side note, storing files in a DB sounds completely silly. Having a directory structure with fewer files per directory is the direction you probably want to look in. – Michael Hampton Sep 27 '13 at 19:28
  • @MichaelHampton I agree with you about the db. I edited my answer because to show that I am not concerned with directory performance at this time. – samspot Sep 27 '13 at 19:31
  • And, your sysadmin needs to go read that Q&A. – Michael Hampton Sep 27 '13 at 19:31
  • @MichaelHampton Any chance you can remove the duplicate tag now? – samspot Sep 27 '13 at 19:44
  • 2
    Your sysadmin is right. Sysadmins are always right when dealing with non sysadmins regarding sysadmin stuff they are in charge of. – TheCleaner Sep 27 '13 at 19:49
  • So at first you mention `having too many files on a filesystem impacts performance` but then `I don't care about directory performance` and finally `the sysadmin is arguing that apache will slow down due to the total number of files on the entire system.` So this is specifically about Apache? And furthermore, not about apache seeking in directories that have a lot of files, but just Apache in general slowing down because the filesystem has a lot of files, outside of a web root? – Wesley Sep 27 '13 at 19:50
  • @Wesley The sysadmin is asserting that the entire filesystem slows down appreciably as the file count increases. So this last part is what I'm trying to discover the truth of `Apache in general slowing down because the filesystem has a lot of files, outside of a web root?` I mentioned apache to give context though, so it is more of a general question about the filesystem. – samspot Sep 27 '13 at 20:04
  • @samspot Okay, I get it. So regardless of how files are distributed, you want to know if some arbitrary number of files on a volume will make performance lag. I edited the question just a little bit to add some more clarity and am voting to re-open. It takes a few more re-open votes. – Wesley Sep 27 '13 at 20:18
  • 2
    Also, can you inlcude what filesystem we're talking about? I want to assume ext4, but don't want to be over certain. Number of inodes used and available? Size of the volume? Type of volume (softrade/fakeraid, hardware RAID)? Type of Disks? – Wesley Sep 27 '13 at 20:26
  • @Wesley Thanks, Currently we are on ext4, although this discussion is for a hypothetical new system. I was hoping to get a more generalized answer if possible. The current system is using 26G of 62G. – samspot Sep 27 '13 at 20:36

1 Answers1

1

Since ext3 the handling of files in the file system is at least as fast as finding an indexed row in a database. This is called the HTree (actually, many indexes in databases still use a BTree.)

http://en.wikipedia.org/wiki/HTree

Older systems would start having problems at 1,000 files because the search was linear (start from the first file, and go through the entire directory to find the file you were interested in.)

Why using a database then?

PRO

Then you only need to transport the database from one computer to another (think of a cloud system...), especially if you want to use automatic replication between computers.

CON

All the database you send to the database goes through the network! This means a huge bottleneck. If you do not foresee using the replication feature of your database, then that's enough (for me) to avoid using the database. This will have a HUGE impact on your system. Use the file system directly, since anyway the database will be doing the same thing: save the data to a file!

P.S. Your admin seems to be from the past...

P.P.S. "ext3 HTree indexes are available in ext3 when the dir_index feature is enabled." -- I use ext4 so I don't worry too much about that, although it can be turned off in ext4; hopefully it is turned ON on your server...

Alexis Wilke
  • 2,057
  • 1
  • 18
  • 33