MySQL InnoDB random data corruptions: hardware or software errors?

Question

I'm an administrator of a social game which uses MySQL(Percona 5.1.56 to be precise) for data storage(all tables have InnoDB type). There are about 2 millions of players in the game and database size is about 100Gb and it's gradually growing. There are a few tables which have >500 millions records already.

The game DB is running pretty smoothly even not sharded on a single powerful enough non-virtualized Linux Debian 6 server(24 GB RAM, hardware Adaptec RAID-10, with a couple of read-only slaves). The problem is that from time to time(once a month or two) MySQL crashes with data corruption as following:

 InnoDB: Database page corruption on disk or a failed InnoDB: file read of page XXXX.     
 InnoDB: You may have to recover from a backup.

Restoring from such errors is quite a painful process. Which usually requires promoting one of the slaves being a new master, directing the traffic to this new master and creating the backup slave for this master. There is some downtime which makes players really mad...

Percona folks told me it was the hardware's fault and at first I thought it was the hardware to blame too but after I've changed several servers I don't know what to think really.

Is there any chance it's MySQL corrupting the data? I've already started looking at alternatives(e.g PostgreSQL, or even something radical like Cassandra). But of course I know that every new product has its own baggage of bugs and quirks not to mention the costs of migration....

I'm pulling out my hair(today I've faced another crash), so if you have any ideas, please share...

By my answer, I was suggesting that the innodb log files may be paging out data before being properly committed. That's not a bug. This could possibly happen if innodb needs to hold a lot of data that can be rolled back and creates the room in the log files to do it. If the log files are not big enough, I see this as a possible. Sorry for stepping on any toes. Next time, I will ask for the settings rather than conjecturing. — RolandoMySQLDBA, May 26 '11 at 17:01
Seriously, do you really believe it's a normal behavior for DB to _damage_ data? And you call it "not a bug"? — pachanga, May 26 '11 at 19:47

score 2 · Answer 1 · answered May 24 '11 at 14:12

We have been running MySQL (and the Percona version in the past) for several years with databases with up to 300 million rows, with multiple read slaves. The only times I have seen these sort of issues have been related to hardware. Most frequently, bad drives, bad drive controllers, bad RAID controllers.

What kind of storage are you using? If you are using commodity hard drives, even in a RAID configuration, with your I/O levels you are going to be exceeding typical MTBF rates.

We are using hardware Adaptec RAID-10 with SAS drives... – pachanga May 24 '11 at 17:37 — pachanga, May 24 '11 at 17:37

MySQL InnoDB random data corruptions: hardware or software errors?

1 Answers1