Storing a million images in the filesystem

Question

I have a project that will generate a huge number of images. Around 1,000,000 for start. They are not large images so I will store them all on one machine at start.

How do you recommended on storing these images efficiently? (NTFS file system currently)

I am considering a naming scheme... for start all the images will have an incremental name from 1 up I hope this will help me sort them later if needed, and throw them in different folders.

what would be a better naming scheme:

a/b/c/0 ... z/z/z/999

or

a/b/c/000 ... z/z/z/999

any idea on this ?

Are they tied to specific users or just generic? Are they grouped in any fashion? — , Dec 17 '09 at 16:55
only generic. a bunch of images generated by some technical equipment. i am naming them incremental from 1 up just to have am idea of a time refence. — s.mihai, Dec 17 '09 at 16:57
how are they going to be used/accessed? through a bespoke app or what? — dove, Dec 17 '09 at 16:58
i am storing in a database the Path of each image, when a users wants to access it the db gives me the path and then i load it from there. — s.mihai, Dec 17 '09 at 16:59
what about limitations regarding the number of files stored in a single folder and that stuff ? — s.mihai, Dec 17 '09 at 17:08
Out of curiosity, is there any compelling reason to use NTFS rather than FAT16 or 32, in terms of performance? I am purely interested in knowing if you have run any benchmarking over various file systems to get the best. — , Dec 17 '09 at 17:36
no performance tests done. i just wanted to get this app going and this is what i had in hand. — s.mihai, Dec 17 '09 at 17:53
Are you planning to delete lot of images? If so, is there a some pattern e.g. delete always oldest images, delete images randomly? — Juha Syrjälä, Dec 17 '09 at 18:00
At this time there is no plan to delete files, ever. Deleting older files randomly to keep folders clean... i doubt there is an even distribution of this thing we call "random" — s.mihai, Dec 17 '09 at 18:14
like someone mentions below. i'll just recreate the filepath from the file name. look at "Juha Syrjälä" answers — s.mihai, Dec 17 '09 at 18:53
@Mike, my point was actually that you create a sequence number for image (with database sequence or autoincrement column, for example) and then create filename and path from that sequence number. — Juha Syrjälä, Dec 17 '09 at 19:00
oh i definitely advocate that idea (used similar in the past). if you do groupings of letters per directory branch, just be aware of density distribution and conditions for how many 'dangling' characters to use for the actual filename - and how to handle odd versus even length names. — janos erdelyi, Dec 17 '09 at 19:01
Check out this question: http://stackoverflow.com/questions/1257415/best-way-to-store-retrieve-millions-of-files-when-their-meta-data-is-in-a-sql-dat You may gain some additional ideas over there... — Taptronic, Dec 17 '09 at 19:05
File systems are not generally a good storage mechanism for millions of small files as they tend to take up whole blocks of space (not just their true size.) If space becomes an issue AND if access will be infrequent and in batches, you may want to consider grouping a few hundred or thousand at a time into single files (.tar or .zip) — Chris Nava, Dec 17 '09 at 22:09
@user28770 FAT will be a bad choice here. Since the file allocation table is linear, you have to search all possible entries to get a file. You also don't have an allocation bitmap which will result in significant fragmentation — phuclv, Aug 15 '18 at 15:00

score 74 · Answer 1 · answered Dec 17 '09 at 17:32

74

I'd recommend using a regular file system instead of databases. Using file system is easier than a database, you can use normal tools to access files, file systems are designed for this kind of usage etc. NTFS should work just fine as a storage system.

Do not store the actual path to database. Better to store the image's sequence number to database and have function that can generate path from the sequence number. e.g:

 File path = generatePathFromSequenceNumber(sequenceNumber);

It is easier to handle if you need to change directory structure some how. Maybe you need to move the images to different location, maybe you run out of space and you start storing some of the images on the disk A and some on the disk B etc. It is easier to change one function than to change paths in database.

I would use this kind of algorithm for generating the directory structure:

First pad you sequence number with leading zeroes until you have at least 12 digit string. This is the name for your file. You may want to add a suffix:
- 12345 -> 000000012345.jpg
Then split the string to 2 or 3 character blocks where each block denotes a directory level. Have a fixed number of directory levels (for example 3):
- 000000012345 -> 000/000/012
Store the file to under generated directory:
- Thus the full path and file filename for file with sequence id 123 is 000/000/012/00000000012345.jpg
- For file with sequence id 12345678901234 the path would be 123/456/789/12345678901234.jpg

Some things to consider about directory structures and file storage:

Above algorithm gives you a system where every leaf directory has maximum of 1000 files (if you have less that total of 1 000 000 000 000 files)
There may be limits how many files and subdirectories a directory can contain, for example ext3 files system on Linux has a limit of 31998 sub-directories per one directory.
Normal tools (WinZip, Windows Explorer, command line, bash shell, etc.) may not work very well if you have large number of files per directory (> 1000)
Directory structure itself will take some disk space, so you'll do not want too many directories.
With above structure you can always find the correct path for the image file by just looking at the filename, if you happen to mess up your directory structures.
If you need to access files from several machines, consider sharing the files via a network file system.
The above directory structure will not work if you delete a lot of files. It leaves "holes" in directory structure. But since you are not deleting any files it should be ok.

answered Dec 17 '09 at 17:32

Juha Syrjälä

1,081
10
19

1

very interesting! splitting the filename ... i didn't thought of that. i assume this is the elegant way of doing it :-? – s.mihai Dec 17 '09 at 18:17
40

Using a hash (such as MD5) as the name of the file, as well as the directory distribution, would work. Not only would the integrity of the files be a side benefit to the naming scheme (easily checked), but you'll have a reasonably even distribution throughout the directory hierarchy. So if you have a file named "f6a5b1236dbba1647257cc4646308326.jpg" you'd store it in "/f/6" (or as deep as you require). 2 levels deep gives 256 directories, or just under 4000 files per directory for the initial 1m files. It would also be very easy to automate the redistribution to a deeper scheme. – Dec 17 '09 at 19:41
+1 I just noticed this answer was similar to the one I just posted. – 3dinfluence Dec 17 '09 at 20:18
1

I definitely agree on using the filessystem and creating an artficial identifier to "slice" up into folder names. But you should also try to get a random distribution of identifiers, i.e. do not use a sequence number. That would allow you have a more balanced tree of folders. In addition, with random distribution you can more easily partition the tree across multiple filesystems. I'd also use a ZFS based SAN with dedup turned on and a sparse volume for each filesystem. You could still use NTFS by using iSCSI to access the SAN. – Michael Dillon Aug 03 '10 at 15:32
If you go from right to left in step 2 the files are evenly distributed. Also you don't have to worry that you are not filling up with enough zeroes as you can an unlimited number of files – ropo Jan 13 '16 at 08:45
I also use this technique, with two optional variation 1. I convert the integer id in base-36 to have a more compact alphanumeric string, you can split digits by two (1296 possibilities) 2. I revert the string to have the less significant (right-most) digit on the top of the folder tree, in this way files fall evenly between folders – fustaki May 31 '16 at 10:21
It truly depends on you software architecture, some applications are better off with an equal distribution of files, while others are better with a sequence number. it really depends on your goal that you're trying to achieve. – Sander Visser Sep 16 '16 at 11:41
I use md5sum 3-directory structure and it works great. As mentioned, it is a decent distribution and the added benefit of dedupe for exact duplicates rather than linking inodes of dupes every day as I used to do. – Aaron May 26 '17 at 14:01
this is a great answer and works well for a while, but if you're crazy about performance, it's important to group those "small files" into multiple blocks of bigger files, so you can have less disk access for things like file length, etc.. File.length can take 100ms on heavy used systems. – Rafael Sanches Nov 19 '17 at 07:13
@Jha, The zero prefix is an optional step right? – Pacerier Nov 20 '17 at 22:22
@RafaelSanches I don't see how concatenating files into one big files saves you on disk access? Can you refer me to some article that goes in depth, perhaps one you wrote? – oligofren Dec 05 '17 at 10:35
An issue with this approach is that file names are predictable. If the files are accesible from the internet, a malicious hacker could enumerate them all with ease. – Juan Lanus Jan 26 '19 at 20:17

score 32 · Answer 2 · answered Dec 17 '09 at 17:12

32

I'm going to put my 2 cents worth in on a piece of negative advice: Don't go with a database.

I've been working with image storing databases for years: large (1 meg->1 gig) files, often changed, multiple versions of the file, accessed reasonably often. The database issues you run into with large files being stored are extremely tedious to deal with, writing and transaction issues are knotty and you run into locking problems that can cause major train wrecks. I have more practice in writing dbcc scripts, and restoring tables from backups than any normal person should ever have.

Most of the newer systems I've worked with have pushed the file storage to the file system, and relied on databases for nothing more than indexing. File systems are designed to take that sort of abuse, they're much easier to expand, and you seldom lose the whole file system if one entry gets corrupted.

answered Dec 17 '09 at 17:12

Satanicpuppy

5,917
1
16
18

yes. note taken ! – s.mihai Dec 17 '09 at 17:13
5

Have you looked at SQL 2008's FILESTREAM data type? It's a cross between database and file system storage. – NotMe Dec 17 '09 at 17:25
+1 on sticking with file server rather than a database as you are doing fast and infrequent IO operations. – Dec 17 '09 at 17:33
What if you're just storing a few hundred docs or pics per database - any downside to using database for storage? – Beep beep Dec 18 '09 at 04:38
1

+1 ... a filesystem is kind of a "database" anyway (ntfs for sure), so why make it overly complicated. – akira Jun 01 '10 at 14:44
Not all filesystems are equal. Some are robust and identical to databases, some will give you corrupt data upon power loss. – Pacerier Nov 20 '17 at 22:25

score 13 · Answer 3 · answered Dec 17 '09 at 20:17

I think most sites that have to deal with this use a hash of some sort to make sure that the files get evenly distributed in the folders.

So say you have a hash of a file that is something like this 515d7eab9c29349e0cde90381ee8f810
You could have this stored in the following location and you can use how ever many levels deep you need to keep the number of files in each folder low.
\51\5d\7e\ab\9c\29\349e0cde90381ee8f810.jpg

I've seen this approach taken many times. You still need a database to map these file hashes to a human readable name and what ever other metadata you need to store. But this approach scales pretty well b/c you can start to distribute the hash address space between multiple computers and or storage pools, etc.

Git uses a similar approach: https://git-scm.com/book/en/v2/Git-Internals-Git-Objects (to back this answer up) — mehov, Oct 12 '16 at 21:08

score 12 · Answer 4 · answered Dec 17 '09 at 17:01

Ideally, you should run some tests on random access times for various structures, as your specific hard drive setup, caching, available memory, etc. can change these results.

Assuming you have control over the filenames, I would partition them at the level of 1000s per directory. The more directory levels you add, the more inodes you burn, so there's a push-pull here.

E.g.,

/root/[0-99]/[0-99]/filename

Note, http://technet.microsoft.com/en-us/library/cc781134(WS.10).aspx has more details on NTFS setup. In particular, "If you use large numbers of files in an NTFS folder (300,000 or more), disable short-file name generation for better performance, and especially if the first six characters of the long file names are similar."

You should also look into disabling filesystem features you don't need (e.g., last access time). http://www.pctools.com/guides/registry/detail/50/

+1 for disabling 8.3 filename generation and last access time; those were the first thing that came to mind when I read "huge number of [files]" and "NTFS" (Windows). — rob, Dec 17 '09 at 22:23

score 8 · Answer 5 · answered Dec 17 '09 at 16:58

8

Whatever you do, don't store them all in one directory.

Depending on the distribution of the names of these images, you could create a directory structure where you have single letter top level folders where you would have another set of subfolders for the 2nd letter of images etc.

So:

Folder img\a\b\c\d\e\f\g\ would contain the images starting with 'abcdefg' and so on.

You could introduce your own appropriate depth required.

The great thing about this solution is that the directory structure effectively acts like a hashtable/dictionary. Given an image file name, you will know its directory and given a directory, you will know a subset of images that go there.

answered Dec 17 '09 at 16:58

Wim Hollebrandse

181
3

\a\b\c\d\e\f\ i am doing now, i was thinking there is a wise way of doing this. – s.mihai Dec 17 '09 at 17:00
1

That's a generally accepted solution of how to physically store them. Clearly generating the image URL's is something that can be easily done dynamically based on the image file name. Also, to serve them up, you could even introduce img-a, img-b subdomains on the images server if you wanted to, to speed up loading times. – Dec 17 '09 at 17:04
Wim - that's exactly what i am doing now, just thought there are some other folks who've hit this problem. – s.mihai Dec 17 '09 at 17:06
You might get better distribution by using the last character (or two, or three) rather than the first. – Mark Ransom Dec 17 '09 at 17:08
@Mark The point is illustrative. It depends on the distribution, as I mentioned. – Dec 17 '09 at 17:10
3

And +1 for "don't store them all in one directory". I'm supporting a legacy system that has put over 47000 files on a server in a single folder, and it takes about a minute for Explorer just to open the folder. – Mark Ransom Dec 17 '09 at 17:10
Yep. Seen it too. :-o – Dec 17 '09 at 17:12
5

Doing a\b\c\d\e\f\g makes the directory structure very deep and every directory contains only few files. Better to use more that one letter per directory level e.g. ab\cd\ef\ or abc\def\ . Directories also take up space from disk so you do not want too many of them. – Juha Syrjälä Dec 17 '09 at 17:25
It's an illustration - the concept remains the same. It doesn't necessarily make your directory structure deep as it also depends on filename length. Just because I started with a,b,c doesn't mean we need 26 levels. – Dec 17 '09 at 19:27
The problem with this approach is that you get hot spots. For instance almost no files will start with the letter z, u, q, etc. Better to use a file hash algorithm to evenly spread the files between your folders. – 3dinfluence Dec 17 '09 at 20:20
2

I had to support an application that had 4+million files all in one directory; it worked surprisingly well, but you could NEVER get explorer to open the folder, it would continually be sorting the new additions. +1 for NTFS being able to handle it without dying. – SqlACID Feb 23 '10 at 00:22
@SqlACID, 4 million in an NTFS folder? What's the max then? – Pacerier Nov 20 '17 at 22:29

score 6 · Answer 6 · answered Dec 30 '09 at 22:10

We have an photo store system with 4 million images. We use database only for meta data and all images are stored on the file system using an inversed naming system, where folder names are generated from last digit of the file, last-1, and so on. e.g.: 000001234.jpg is stored in directory structure like 4\3\2\1\000001234.jpg.

This scheme works very well with identity index in the database, because it fills evenly whole directory structure.

score 5 · Answer 7 · answered Dec 17 '09 at 16:59

5

I would store these on the file system but it depends on how fast will the number of files grow. Are these files hosted on the web? How many users would access these file? These are the questions that need to be answered before I could give you a better recommendation. I would also look at Haystack from Facebook, they have a very good solution for storing and serving up images.

Also if you choose file system you will need to partition these files with directories. I been looking at this issue and proposed a solution but its not a perfect one by any means. I am partitioning by hash table and users you can read more on my blog.

answered Dec 17 '09 at 16:59

Lukasz

361
2
10

the images are not meant for frequent access. so there is no problem with this. their number will grow quite fast. i assume there will be the 1mil. mark in 1 month. – s.mihai Dec 17 '09 at 17:02
i'm interested in the programmer view so that i don't overthink this too much – s.mihai Dec 17 '09 at 17:03
So if you do not need fast access Haystack is probably not for you. Using Directories for Partitions is the simplest solution in my view. – Lukasz Dec 17 '09 at 17:06

score 5 · Answer 8 · answered Dec 17 '09 at 17:24

5

The new MS SQL 2008 has a new feature to handle such cases, it's called the FILESTREAM. Take a look:

Microsoft TechNet FILESTREAM Overview

answered Dec 17 '09 at 17:24

Padu Merloti

181
5

score 4 · Answer 9 · answered Dec 17 '09 at 17:18

4

Quick point, you don't need to store a file path in you DB. You can just store a numeric value, if your files are named in the way you describe. Then using one of the well-defined storage schemes already discussed, you can get the index as a number and very quickly find the file by traversing the directory structure.

answered Dec 17 '09 at 17:18

Mr. Boy

281
3
9

:-? good quick point. just that now i don't have an algorithm for generating the path. – s.mihai Dec 17 '09 at 17:22

Taptronic · Answer 10 · 2009-12-24T14:05:43.183

Will your images need to be named uniquely? Can the process that generates these images produce the same filename more than once? Hard to say without knowing what device is creating the filename but say that device is 'reset' and upon restart it begins naming the images as it did the last time it was 'reset' - if that is such a concern..

Also, you say that you will hit 1 million images in one month's time. How about after that? How fast will these images continue to fill the file system? Will they top off at some point and level out at about 1 million TOTAL images or will it continue to grow and grow, month after month?

I ask because you could begin designing your file system by month, then by image. I might be inclined to suggest that you store the images in such a directory structure:

imgs\yyyy\mm\filename.ext

where: yyyy = 4 digit year
         mm = 2 digit month

example:  D:\imgs\2009\12\aaa0001.jpg
          D:\imgs\2009\12\aaa0002.jpg
          D:\imgs\2009\12\aaa0003.jpg
          D:\imgs\2009\12\aaa0004.jpg
                   |
          D:\imgs\2009\12\zzz9982.jpg
          D:\imgs\2010\01\aaa0001.jpg (this is why I ask about uniqueness)
          D:\imgs\2010\01\aab0001.jpg

Month, year, even day is good for security type images. Not sure if this is what you are doing but I did that with a home security camera that snapped a photo every 10 seconds... This way your application can drill down to specific time or even a range where you might think the image was generated. Or, instead of year, month - is there some other "meaning" that can be derived from the image file itself? Some other descriptors, other than the date example I gave?

I would not store the binary data in the DB. Never had good performance / luck with that sort of thing. Cant imagine it working well with 1 million images. I would store the filename and that is it. If they are all going to be JPG then dont even store the extension. I would create a control table that stored a pointer to the file's server, drive, path, etc. This way you can move those images to another box and still locate them. Have you a need to keyword tag your images? If so then you would want to build the appropriate tables that allow that sort of tagging.

You / others may have addressed these ideas while I was replying.. Hope this helps..

1.all files will be named uniquely 2.the system will grow and grow at first it will get out arround 1mil images and then grow at a rate of a couple tens of thousands per month. 3.there will be some sort of tagging of the files at some point in the future, that's why i want to store some sort of identification data in the db. — s.mihai, Dec 17 '09 at 18:29

score 3 · Answer 11 · answered Jul 16 '10 at 07:00

While I haven't served pictures on that scale, I've previously written a small gallery app for serving ~25k pictures on a 400MHz machine w. 512 MB of RAM or so. Some experiences;

Avoid a relational databases at all costs; while databases, no doubt, are smart about handling data, they aren't designed for such use (we got specialized, hierarchal key-value databases for that called file systems). While I have nothing more than a hunch, I'd wager that the DB cache goes out the window, if you throw really large blobs at it. While my available hardware was in the small end, not touching the DB at all on image lookup gave orders of magnitude better speed.
Research how the file system behaves; on ext3 (or was it ext2 at the time - can't remember), the limit of being able to efficiently look up sub-directories and files was around the 256 mark; so having only that many files and folders in any given folder. Again, noticeable speedup. While I do not known about NTFS, stuff like XFS (which uses B-trees, as far as I remember) is extremely fast, simply because they can do lookups extremely fast.
Distribute data evenly; when I experimented with the above, I tried to distribute the data evenly over all directories (I did an MD5 of the URL and used that for directories; /1a/2b/1a2b...f.jpg). That way it takes longer to hit whatever performance limit there is (and file system cache is void at such large datasets anyway). (contrarily, you might want to see where the limits are early on; then you want to throw everything in the first available directory.

score 3 · Answer 12 · answered Dec 17 '09 at 21:21

I would be inclined to create a date based folder structure, e.g. \year\month\day, and use timestamps for the filenames. If necessary, the timestamps can have an additional counter component if the images are to be created so fast that there may be more than one within a millisecond. By using a most significant to least significant sequence for the naming sorting, finding and maintenance are a breeze. e.g. hhmmssmm[seq].jpg

score 3 · Answer 13 · answered Dec 18 '09 at 00:55

I am involved in a project that stores 8.4 million images in the course of a year for documenting status of various devices. More recent images are accessed more frequently, and older images are rarely sought unless a condition was discovered which prompted someone to dig into the archives.

My solution, based on this usage, was to incrementally zip the images into compressed files. The images are JPGs, each approximately 20kB and do not compress much, so the ZIP compression scheme is none. This is done merely for concatenating them into one filesystem entry which greatly helps NTFS in terms of speed when it comes to moving them from drive to drive, or looking through the list of files.

Images older than a day are combined into a "daily" zip; zips older than a month are combined into their respective "monthly" zip; and finally anything over a year is no longer needed and consequently deleted.

This system works well because users can browse the files (either via the operating system or a number of client applications) and everything is named based on device names and timestamps. Generally a user knows these two pieces of information and can quickly locate any one of the millions of images.

I understand this is probably not related to your particular details, but I thought I would share.

score 2 · Answer 14 · answered Jun 27 '17 at 19:02

Might be late to the game on this. But one solution (if it fits your use-case) could be file name hashing. It is a way to create an easily reproducible file path using the name of the file while also creating a well distributed directory structure. For example, you can use the bytes of the filename's hashcode as it's path:

String fileName = "cat.gif";
int hash = fileName.hashCode();
int mask = 255;
int firstDir = hash & mask;
int secondDir = (hash >> 8) & mask;

This would result in the path being:

/172/029/cat.gif

You can then find cat.gif in the directory structure by reproducing the algorithm.

Using HEX as the directory names would be as easy as converting the int values:

String path = new StringBuilder(File.separator)
        .append(String.format("%02x", firstDir))
        .append(File.separator)
        .append(String.format("%02x", secondDir)
        .toString();

Resulting in:

/AC/1D/cat.gif

I wrote an article about this a few years ago and recently moved it to Medium. It has a few more details and some sample code: File Name Hashing: Creating a Hashed Directory Structure. Hope this helps!

We store 1.8 billion items using something similar. It works well. Use a hash that's fast and has low collisions rates and you're set. — CVVS, Apr 02 '18 at 15:41

score 2 · Answer 15 · answered Dec 17 '09 at 17:02

2

Perhaps a creation date based naming scheme - either including all the info in the file name or (better for browsing later) splitting it up in directories. I can think of the following, depending on how often you generate images:

Several images generated each day: Year/Month/Day/Hour_Minute_Second.png
A couple a month: Year/Month/Day_Hour_Minute_Second.png

etc. You get my point... =)

answered Dec 17 '09 at 17:02

Tomas Aschan

156
1
1
10

they are not continuously generated over time, so some folders will become fat and others stay... slim :)) – s.mihai Dec 17 '09 at 17:09
Well, you obviously don't have to create *each* folder, just because you're following this scheme. You could even have `Year/Month/Day/Hour/Minute` - decide how many levels of folders you need depending on how often the images are generated *when the rate is the highest* - and then just don't create folders that would be left empty. – Tomas Aschan Dec 17 '09 at 17:49

score 2 · Answer 16 · answered Dec 24 '09 at 14:27

Are you considering disaster recovery?

Some of the proposed solutions here end up mangling the file name (such that if the physical file was moved you would lose track of what file it really is). I recommend maintaining a unique physical file name so that if your master list of file locations gets corrupted, you can regenerate it with a small shell, er, powershell, script ;)

From what I read here it sounds like all these files would be stored on one file system. Consider storing them across multiple file systems on multiple machines. If you have the resources, determine a system of storing each file on two different machines in case you lose a power supply and the replacement is 2 days out.

Consider what kinds of procedures you would need to create to migrate files between machines or file systems. The ability to do this with your system is live and online may save you considerable headache down the road.

You might consider using a GUID as a physical file name instead of an incremental number in case your incremental number counter (the database identity column?) gets messed up.

If appropriate, consider using a CDN such as Amazon S3.

score 1 · Answer 17 · answered Nov 25 '15 at 20:53

I just ran a test on zfs because I love zfs, and I had a 500gig partition that I had compression on. I wrote a script that generated 50-100k files and placed them in nested directories 1/2/3/4/5/6/7/8 (5-8 levels deep) and let it run for I think 1 week. (it wasn't a great script.) It filled up the disk and ended up having about 25 million files or so. Access to any one file with a known path was instant. Listing any directory with a known path was instant.

Getting a count of the list of files however (via find) took 68 hours.

I also ran a test putting a lot of files in one directory. I got up to about 3.7 million files in one directory before I stopped. Listing the directory to get a count took about 5 minutes. Deleting all the files in that directory took 20 hours. But lookup and access to any file was instant.

Rolf · Answer 18 · 2017-05-26T13:37:17.840

I see other mention a database, but see no mention of that in your post. In any case, my opinion on this particular point is: either stick to a database or file system. If you have to mix the two, be careful about it. Things get more complicated. But you might have to. Storing a million photos in a database does not sound the best idea.

You might be interested by the following specification, most digital cameras follow it to manage file storage: https://en.wikipedia.org/wiki/Camera_Image_File_Format

Essentially, a folder is created, such as 000OLYMPUS and photos are added to that folder (for example DSC0000.RAW). When the file name counter reaches DSC9999.RAW a new folder is created (001OLYMPUS) and image are added again, resetting the counter, possibly with a different prefix (ex: P_0000.RAW).

Alternatively you could also create folders based on parts of the file name (already mentioned several times). For example, if you photo is named IMG_A83743.JPG, store it at IMG_\A8\3\IMG_A83743.JPG. It is more complicated to implement but will make your files easier to find.

Depending on the filesystem (this will require some research), you might be able to just dump all the images in a single folder, but, in my experience, this would usually cause performance issues.

score 1 · Answer 19 · answered Dec 17 '09 at 19:12

1

If you are on windows how about on an exFat filessytem

http://msdn.microsoft.com/en-us/library/aa914353.aspx

it was designed with storing media files in mind, and available now.

answered Dec 17 '09 at 19:12

Alex

111
2

score 1 · Answer 20 · answered Dec 17 '09 at 21:34

If they ALL are not immediately required and you can generate them on-the-fly and these are small images, why not implement an LRU memory- or disk-cache above your image generator?

This could save you from the storage and keep the hot images to be served from mem?

score 0 · Answer 21 · answered Feb 22 '10 at 23:28

0

You might want to be look at ZFS (file system, volume manager from Sun) Regards

answered Feb 22 '10 at 23:28

Ghominejad · Answer 22 · 2017-12-23T09:42:40.257

A clean way to generate the path from a large number is to easily convert it into hex then split it!

for example 1099496034834 > 0xFFFF1212 > FF/FF/12/12

public string GeneratePath(long val)
{  
    string hex = val.ToString("X");
    hex=hex.PadLeft(10, '0');
    string path="";
    for(int i=0; i<hex.Length; i+=2 )
    {
        path += hex.Substring(i,2);
        if(i+2<hex.Length)
            path+="/";
    }
    return path;
}

Store and load :

public long Store(Stream doc)
{
   var newId = getNewId();
   var fullpath = GeneratePath(newId)
   // store into fullpath 
   return newId;
}

public Stream Load(long id)
{
   var fullpath = GeneratePath(newId)
   var stream = ... 
   return stream;
}

Full source codes : https://github.com/acrobit/AcroFS

score -1 · Answer 23 · answered Dec 30 '09 at 21:34

Unfortunately filesystems are very bad (performance with many files per directory or deep directory trees, checking times on restart, reliability) at managing lots of small files, so the solution above that involves ZIP files is best if you want to use a filesystem.

Using a database manager is by far the best option; a simple one like BDB or GDBM for example; even a relatrional DBMS like MySQL would be better. Only lazy people who don't understand filesystems and databases (e.g. those who dismiss transactions) tend to use filesystems as databases (or somewhat more rarely, viceversa).

score -2 · Answer 24 · answered Dec 17 '09 at 17:01

-2

How about a database with a table containing an ID and a BLOB to store the image? Then you can add new table(s) whenever you want to associate more data elements with a photo.

If you're expecting to scale, why not scale now? You'll save time both now and later IMO. Implement the database layer once, which is fairly easy to start with. Or implement some thing with folders and filenames and blah blah blah, and later switch to something else when you start blowing up MAX_PATH.

answered Dec 17 '09 at 17:01

jdmichal

113
3

6

Been there, done that, have the scars to prove it. Databases that store images in large numbers are cranky almost beyond belief, and require inordinate amounts of maintenance. Much better to store them in the file system unless you have a specific need that can only be answered by a database (ours was version tracking.) – Satanicpuppy Dec 17 '09 at 17:05
1

And there are lots of utilities to deal with files and file systems, few to none to deal with files within a database. – Mark Ransom Dec 17 '09 at 17:14
2

Oh God No. Please dont use a database as large BLOB storage. – Neil N Dec 17 '09 at 17:26
Eek. Didn't know that databases (still?) have so many problems with BLOBs. – Dec 17 '09 at 18:56
How can such a bad solution that has so many comments still have a +1? no offence to the OP (I see it came from SO) but the downvote button is here for a reason! – Mark Henderson Jul 20 '10 at 23:08
Because it's not a universally bad solution. Sure, for most people I wouldn't recommend it, but there is a small subset of applications for which the benefits (yes there are some benefits) are actually quite important, like acid compliance, or replication, or as someone else mentioned having versioning. Though admittedly, this particular answer really didn't sell it very well and seemed to be recommending it for a the wrong reasons. – thomasrutter Feb 15 '13 at 13:07

Storing a million images in the filesystem

24 Answers24

Linked

Related