9

We have to store basic information about 3 million products. Currently the info is one 180 mb CSV which gets updated quarterly.

There will be about 30,000 queries per day, but the queries are just a very simple key value store. We only need to look up the product ID and display the rest of the information (which would all be in one record).

This is for the web, so fast performance is critical.

Should we use MySQL, even though we really don't need a relational database? Should we just generate 3 million static html files every quarter? Should we store a one line CSV for each product on something like Amazon S3 or Rackspace Cloud Files? What is the best way to do this?

Phil
  • 105
  • 1
  • 5

9 Answers9

15

Because MySQL is so widely supported and this is really quite a trivial thing to do I would suggest going with it. Unless the server has at least a few GB of memory I would suggest sticking with MySQL rather than using an in-memory system.

Once you start putting your data into a database, whether it's MySQL or something else, you will quite likely find that you will find more uses for it. Right now you're only talking about key value pairs but the rest of the data relating to your products must be stored somewhere. If that's not in a database I can't imagine the data storage being very efficient.

Whatever you do, do not create those three million files. We have seen a number of questions here already resulting from the problems so many files create.

John Gardeniers
  • 27,262
  • 12
  • 53
  • 108
13

You can use dedicated Key-Value type of NoSQL database which is optimized for this kind of tasks. Have a look at:

  • Redis -- Redis is an open source, advanced key-value store. It is often referred to as a data structure server since keys can contain strings, hashes, lists, sets and sorted sets.
  • MemcacheDB -- MemcacheDB is a distributed key-value storage system designed for persistent.
  • others (one of such lists can be found here: http://nosql-database.org/)

Of course you may use MySQL or any other relational database, but solutions specially designed for key-value type of data supposed to be better (otherwise what is the point of designing them in first place, except possibly the fact that it will be much smaller (in terms of RAM and HDD) solution).

LazyOne
  • 3,014
  • 1
  • 16
  • 15
  • We could use Redis, but do you think this would work on a P4 with 2 gigs of RAM? – Phil Jul 03 '11 at 23:04
  • @Phil Considering your CSV file is around 180MB -- should be fine. Although we used it in a project (only once so far) with around 200K records and server had 8GB RAM so it is difficult for me to compare. – LazyOne Jul 03 '11 at 23:11
6

And now for something completely different:

Given:

  • 180MB/3M products = 62 bytes/product on average.
  • 30,000 queries per day = 0.34 queries per second
  • Updated quarterly = essentially static data

Outside of the box solution:

Dump each product as a TXT resource record and store it in the DNS, e.g.:

$origin products.example.com.

product_1_name IN TXT "product 1 description"
product_2_name IN TXT "product 2 description"
...
product_3000000_name IN TXT "product 3000000 description"

Benefits:

  • extremely reliable and trusted (you already depend on it every day)
  • can be built on pretty much any platform
  • pretty much every language has support for DNS queries in one form or another
  • open source and commercial servers support different kinds of backend databases
  • can be trivially replicated (just specify multiple name servers)
  • handles atomic updates, even when replicated across a dozen servers
  • can be cryptographically signed to ensure data integrity
  • can handle orders of magnitude higher query per second rates (10,000 queries per second are easily handled with commodity hardware)

Reasons why this might be a bad idea:

  • you need to search the data (DNS is purely key/value lookup)
  • you need to hide the data (DNS has no confidentiality)
Theobroma Cacao
  • 341
  • 1
  • 1
  • 1
    If I could give bonus point for originality, this would get my vote. I wouldn't say DNS is reliable at all though, as on a typical home network it seems like magic if it works and a curse if it doesn't. – Martin Vilcans Jul 04 '11 at 21:25
  • 1
    I'm intrigued. I actually really like this idea, but for me, I'd go with something a bit more tried/tested like CouchDB – Tom O'Connor Jul 04 '11 at 23:53
  • Been watching some Monty Python? – Mark Henderson Jul 05 '11 at 03:18
  • Presumably this would be within an enterprise network. DNS reliability becomes an issue when packets have to brave the wilds of the Internet. Since, by default, DNS uses UDP, you have to rely on the DNS resolver's retransmission policy if a packet gets dropped. Within an enterprise network, the chances you'd get significant enough packet loss are (probably) negligible. And you can always force DNS to use TCP (albeit at a hit to performance, thought not significant in this case). And I guarantee, the DNS gets more lookups than all CouchDB installations combined :-). – Theobroma Cacao Jul 08 '11 at 06:42
  • Captain Hindsight here. One word: blockchain. – datashaman Feb 25 '19 at 19:23
4

MySQL with MyISAM and some good indexes sounds perfect for this. There are a lot of other options of course, but MySQL is very widely (if not universally) supported on any commercial web host. Depending on the speed you require, memcached might also be worth looking at, but without knowing the size of each key/value pair, storing 3 million of them in memory might be an even worse idea than a 180Mb CSV file (oh wait, it's a 180Mb CSV file, so we do know how big they are. They must be pretty small pairs, so memcached could be even better).

You do not want 3 million static HTML files, it will hurt your file system badly. A one-line CSV, even on S3, is going to have the same problem. Nobody wants 3 million files in a folder.

Mark Henderson
  • 68,316
  • 31
  • 175
  • 255
  • They are pretty small pairs...it's very basic data like price, manufacture date, warehouse number, etc. Less than 10 columns. So you think MySQL is the way to go, really? The server it's going to be running on is a P4 with 2 gigs of RAM- I think that should be fine? – Phil Jul 03 '11 at 22:57
  • @Phil - `So you think MySQL is the way to go, really?` - no, not really, but it's very flexible and as I mentioned, supported almost universally. However LazyOne has posted some good alternatives above. I couldn't remember the term NoSQL, but it was floating around in my brain somewhere – Mark Henderson Jul 03 '11 at 23:02
4

You could use the Berkeley Database which does exactly this sort of thing, even if it hasn't been hip since the dawn of Perl5. Berkeley only supports key value pairs, and you tie the whole db to a hash and access it as such.

Using Berkeley is well detailed in many of the older Perl references sitting on your shelf or try the Perldoc for the BerkeleyDB CPAN Module. I generally avoid using Berkeley DB (although my employer has much ancient code in which it plays prominently, and some of the DBs are as large as yours), because it is no fun when your data gets more complex.

brainbuz
  • 141
  • 2
  • 2
    BDB is old skool but *very* effective and appropriate for this situation. – womble Jul 04 '11 at 05:08
  • Beware of the license for Berkely DB http://en.wikipedia.org/wiki/Sleepycat_license it requires ALL source code be made available not just the DB part. – WolfmanJM Jul 04 '11 at 20:52
4

You've flagged your question as amazon S3.

I'd like to draw your attention to one of their other related products called Amazon SimpleDB.
It sounds like the SimpleDB data model would fit well with your type of application.

This is not a plug for it, but worth looking at especially if you're planning on using the Amazon cloud services.

The SDB data model resembles a spreadsheet.

See here for more info on it: http://aws.amazon.com/simpledb/ And the data model: http://docs.amazonwebservices.com/AmazonSimpleDB/latest/DeveloperGuide/

hookenz
  • 14,132
  • 22
  • 86
  • 142
1

Even though 180mb of data can be easily handled by any relational database, I would highly recommend MongoDB (http://www.mongodb.org/) above MySQL, Redis, MemcacheDB and other simpler key-value stores or relational databases. The reason is that for this sort of problem, MongoDB is the fastest, most expressive system to use, allowing super fast dynamic updates with no schema restrictions, so your documents can have different formats if you like them to. I was at a presentation from guardian.co.uk the other day and they have made a policy decision to ban all relational databases and use MongoDB exlusively for serving their news. You can get a feel on how fast their website is and which has been online since 1995 (the oldest online newspaper in the UK). They have also gone through all sorts of bottlenecks in the past because of relational databases. For 180mb, MongoDB is going to be serving everything from in-memory, so sub-ms loading times are likely to be the case.

snez
  • 111
  • 2
0

There will be about 30,000 queries per day, but the queries are just a very simple key value store. We only need to look up the product ID and display the rest of the information (which would all be in one record).

You said that your queries are just simple key lookups, with binary search you need 21 iterations on worst case, with hashed keys your queries are even faster. Three million records is small as long as you avoid joins (or other cartesian product-type operations) and linear searches.

I'd dare say pretty much anything would do fine. Your load is 30000 queries/day means that (assuming your load is constant throughout the day) you have a single query every 20 seconds; that's not too bad.

I'd recommend implementing in the technology that you are most familiar with first and then measure whether this is really the bottleneck of the system.

Lie Ryan
  • 418
  • 2
  • 6
0

The best way to do this really depends on the quality and nature of your data and queries. For starters, 180MB of data in a single table for products is not a problem, whichever way you look at it. And 30k queries per day is even less of a problem. With a properly configured database, any old desktop can handle this load.

Others have already pointed out your two major options, MySQL or a noSQL database.

If you have a certain number of attributes that exist for every single product (such as manufacturer, price, warehouse number, etc. then your best option is to have columns for these attributes and convert your key/value pairs into a flat table format, with a product ID as the primary key for that table. This will work very well even if some columns are only used by half of the rows, since for most products you will only need to run 1 query to retrieve all their attributes. Considering that this is data about products, I would guess that it is quite likely that this is the structure of your data.

If the attributes vary widely in presence and data type, then you might be better of using a noSQL database, which handle this scenario more efficienty than traditional SQL databases.

Regarding performance: I have previously worked for an e-commerce company, where for a long time the website was provided with data from a MySQL server. This server had 2GB of RAM, the database in total was approx. 5GB in size and under top load the server handled several thousand queries per second. Yes, we had done a lot of query optimization, but this is definitely doable.

wolfgangsz
  • 8,767
  • 3
  • 29
  • 34