Can storing 300k files in one folder cause problems?

1

I'm crawling a large website (over 200k pages) using wget (is there a better tool btw?). Wget is saving all the files to one directory.

The partition is HFS (I think), will it cause problems if I have all the files in one dir? Assuming I will access all of them only from the console (I know Finder has problems with dirs>5k files).

Or is there perhaps a way to create a micro-partition that would be compressed and would allow for a fast, optimized access to this amount of files?

kolinko

Posted 2011-04-12T13:08:50.797

Reputation: 185

What flags are you using with wget? – Majenko – 2011-04-12T13:25:26.917

@Matt: -np, why do you ask? – kolinko – 2011-04-12T13:45:37.770

I usually specify -m - it keeps the file tree structure for me then - I don't know the layout of the site you're scraping, but that might reduce the number of files in each directory. – Majenko – 2011-04-12T17:06:11.133

Answers

1

Despite the feasibility of the underlying file-system, you REALLY should NEVER store that many files in one directory. When it comes time to browse the contents of that directory, you'll quickly discover that there is a HUGE amount of lag while the OS tries to build the file listing and such. It really puts a significant amount of strain on the system.

Most tools out there that do any sort of "web archiving" will usually build a directory structure similar to the website's layout. Nearly all websites do not base all their contents off of the root directory... i.e. mydomain.com/document-1 ... they will have some logistics behind it all that split it up into several paths (for a variety of reasons) i.e. images go in mydomain.com/images and stuff about goldfish are in mydomain.com/goldfish/ etc...

There are several tools out there that can & will build this sort of directory structure for you. even wget has options to download an entire site. Personally, I've used "httrack" in the past, and it worked quite well. There are also command-line options for wget to download an entire site as well. Look at the -r (recursive) option. Just make sure you setup your domain list so you don't download links infinitely across multiple sites. Best do some reading-up on the wget man page.

TheCompWiz

Posted 2011-04-12T13:08:50.797

Reputation: 9 161

2Depends what you use to browse the directories. Any gui client will probably be bad(TM) but I'm happy on linux in a bash shell. – Pricey – 2011-04-12T14:15:23.197

@PriceChild I would agree... except it's not only GUIs... typically there are cron jobs that periodically run things like updatedb and using ftp/sftp/etc... can also really ramp up the amount of resources unnecessarily needed. It's amazing how much can be saved by simply splitting up a directory structure. Keep in mind... I did use a lot of should (TM) in this post. There are extenuating circumstances of course... but this is merely advise with an alternative solution. – TheCompWiz – 2011-04-12T14:20:34.410

any suggestions on what to use instead? I'd like to have a quick&easy access to the files from console (I plan to run regexpes and such on them) - I don't want to split the files into dirs because writing shell scripts that would analyse all the files would be a pain then. – kolinko – 2011-04-12T14:55:53.200

11 word. egrep. Nearly all *nix tools have a recursive option to search all directories below a target... egrep -R some_word /some/path would be able to search through every directory for "some_word" and return the appropriate results. quick & easy are typically antonyms. It can be quick, but difficult to work with --==OR==-- easy but slow. It would help to know more about what it is exactly you're trying to accomplish. Perhaps a better option would be to throw the contents into an indexed database rather than using raw-files... – TheCompWiz – 2011-04-12T16:13:40.280

You're right, egrep is what I need. Thanks, I will do as you say :) – kolinko – 2011-04-13T08:58:03.663

-1

Wikipedia states that HFS has a file limit of 65535. So if your partition is indeed HFS, you'll hit that.


From Wikipedia:

Additionally, the limit of 65,535 allocation blocks resulted in files having a "minimum" size equivalent 1/65,535th the size of the disk. Thus, any given volume, no matter its size, could only store a maximum of 65,535 files. Moreover, any file would be allocated more space than it actually needed, up to the allocation block size. When disks were small, this was of little consequence, because the individual allocation block size was trivial, but as disks started to approach the 1 GB mark, the smallest amount of space that any file could occupy (a single allocation block) became excessively large, wasting significant amounts of disk space. For example, on a 1 GB disk, the allocation block size under HFS is 16 KB, so even a 1 byte file would take up 16 KB of disk space. This situation was less of a problem for users having large files (such as pictures, databases or audio) because these larger files wasted less space as a percentage of their file size. Users with many small files, on the other hand, could lose a copious amount of space due to large allocation block size. This made partitioning disks into smaller logical volumes very appealing for Mac users, because small documents stored on a smaller volume would take up much less space than if they resided on a large partition. The same problem existed in the FAT16 file system.

Pricey

Posted 2011-04-12T13:08:50.797

Reputation: 4 262

I believe this depends on the version of Mac OS that is being used. I think OS X (all versions) use a new partitioning system that mitigates this issue. – Joshua Nurczyk – 2011-04-12T13:17:20.527

5

Are you perhaps referring to HFS+? That has a max file count in the thousands of millions.

– Pricey – 2011-04-12T13:27:35.450

Yep, you got me, I was too lazy to look it up. That'll teach me. – Joshua Nurczyk – 2011-04-12T13:28:57.040

1I'd probably be willing to bet 50p Merlin is using HFS+ rather than HFS though... :-) – Pricey – 2011-04-12T13:29:41.097

3the drive is 300GB, and was formatted recently, so it's most probably HFS+ :) – kolinko – 2011-04-12T13:44:42.530