Database scalability with write-heavy application

Question

I have a write-heavy application. The application is best compared to surveys - the customer creates custom questionares and this is saved to the database. Most of the requests are from their users submitting these forms. Later on our customers do complex reports and graphs on these submissions.

Making sure our application server (PHP) and the web server (Nginx) scales is quite easy, the trouble is scaling the database server onto multiple servers.

A lot of applications are more read heavy, so typically you'll have a master-slave replication setup where all writes go to a single master, but reads are distributed to the slaves. For us this doesn't work because we're doing writes most of the time.

I've seen mention of a master-master setup, but this typically hits a snag with auto incremented primary keys. The solution is typically to have one server do odd numbers, and the other do evens. I want to avoid that.

On some similar questions I've seen mention of the Tungsten Replicator and how it gives you a lot more flexibility with replication. Would this help me at all? What kind of benefits would this give me that MySQL's built in replication can not provide?

There is also MySQL Cluster, but this typically hits a snag with very large databases and complex queries (joins). I need to be able to run complex reports, so this probably won't work for me.

I'm looking for redundancy, automatic fail over, distributing requests, and data integrity.

Are there other RDMS that provide better solutions that are suitable for the web?

You may want to ask for recommendations on your database design on the [database SE site](http://dba.stackexchange.com/), as this doesn't sound like something that you should be having problems with unless there's a bad design or the hardware is woefully undersized. — Joe H., Sep 26 '11 at 14:05

score 7 · Accepted Answer · answered Sep 26 '11 at 08:32

There's no such a thing as a Grand Unified Database Layout. If there are custom questionaries, there, really, need to be custom tables. Otherwise you are on a quick path to a single-table-of-200-columns of VARCHAR(128)-with-no-primary-keys monstrosity out of thedailywtf.com, which is inefficient, unsupportable and will hurt you in the future.

Sharding, as recommended by toppledwagon may be a thing to consider, but first, double check, that your database is rationally designed. If it is not normalized, then have a very good, preferably backed by testing, reason, why it is not. If it has hundreds of tables, it's probably wrong. If it has single table, it is definitely wrong. Look at the ways you can divide your problem into independent sets. You will spend more effort up front, but the system will be better for it.

Million rows, with, let's say, 2k of data per row (which seems a lot of characters for a survey), is 2GB of memory. If you can throw a bit more hardware onto your problem, maybe you'll be able to keep your data set in RAM?

Which leads to the next question: What's your load in absolute numbers? Customer requests per second, translated to I/Os per second, divided into reads and writes per second, how many gigabytes of data, with what growth rate? How does your load scale with number of requests? Linearly? Exponentially? You don't have to publish your data, just write it down and think about it. What is it today, how do you think it is going to look in a year or two.

Wikipedia says a 15k rpm SAS drive will give you 175-210 IOps. How many do you need in RAID 10 to satisfy your current and projected load? How big is your data set? How many drives do you need to fit your dataset (probably a lot less than to meet the IOs requirement). Would buying a pair (or a dozen) of SSD be justifiable? Is local storage going to be just OK, or are you going to saturate two 8Gb fiber links to a high-end storage subsystem?

If currently you need 1k IOps, but have three 10k rpm HDDs in RAID 5, then there's no way your hardware will be able to satisfy your requirements. OTOH if your app has a user request per second and brings a 32 core 256 GB of RAM beast, backed by an enterprise-class storage to its knees, then chances are the problem lies not within hardware capabilities.

Sounds about right. Currently we have a single table that looks something like `survey_id`,`field_id`,`data`. The problem is `data` is a TEXT because it can be a text area or a drop down with numeric values. There are tens of millions of rows in this table, it's impossible to report on, and is just bad. We need a redesign. We're thinking about doing custom tables per survey, and custom db's per customer. Thinking about hiring a consultant dba. — Luke, Sep 26 '11 at 14:22
A consultant is good idea. You could also consider hiring a consultant IT Architect, who'd be able to create an overall design of the system. You cannot modify just the DB, because application won't agree with it. As you are going to modify the application as well, then having a rational design of the whole system will pay off in the future in terms of management and maintenance costs. — Paweł Brodacki, Sep 26 '11 at 14:32

score 1 · Answer 2 · answered Sep 26 '11 at 12:56

master-master setup, but this typically hits a snag with auto incremented primary keys

No - you just set up the auto-increment-increment and auto-increment-offset to avoid collisions

The solution is typically to have one server do odd numbers, and the other do evens. I want to avoid that.

Why? Surrogate keys, by their very nature are unrelated to the data they index. Assigning meaning to such values is very dangerous.

A quick look at the Tungsten link you provided does not reveal much about what it does - it does have a number of innacuracies (e.g. "you can do multiple masters replication, which is more than what you can do with MySQL native replication"). In the same paragraph it says that it can't handle conflicts. I'm not filled with confidence about the usefulness of this product.

Assuming that master-master replication (either with or without federation to limit replication) does not meet your requirements (but you need to re-examine your thinking about auto-increment field types) then you could shard the data between native clusters using mysqlproxy or use a nosql database.

score 0 · Answer 3 · answered Sep 26 '11 at 04:42

0

This sounds like a good case for sharding. If the data in one survey doesn't need immediate access to the data in another survey, then sharding your data will be easy. You'll setup a database that has basically a user ID key which points to a Survey DB. You can then setup multiple Survey DBs. Hopefully you'll also choose to set those up in a replicated tuples as well. Your application will need a bit of re-working.

Run your reports and do the joins in software. If that's also an option, sharding is the way to go.

answered Sep 26 '11 at 04:42

toppledwagon

4,215
24
15

Doing joins in PHP isn't an option. The database engine has to do the work. Theres a potential of doing a report on a million rows. Sounds like shading could be separating our customers into different databases, perhaps creating a custom table for each survey, then strategically hosting different databases on different servers. Is that a correct assumption? – Luke Sep 26 '11 at 04:57

Database scalability with write-heavy application

3 Answers3