-1

I have a script that runs each time Windows 7 starts up that backs up an MS SQL database to a Synology NAS loaded with 12 Seagate ST4000NM002A-2HZ101 hard drives. Everyday at 3am the backed up database is "restored" to a MS SQL database running in a docker container running on the Synology NAS. In the last 6 months I've had 8 out of the 12 hard drives fail. All failures have happened early in the morning, just after the database restore script has executed (restored the last database backup to the MySQL database running in docker container). All of the failed drives have been from the original batch (no failures from the replaced drives). Have I received a dodgy batch of drives or could restoring a corrupt database backup to a docker container be causing the problem?

Dave M
  • 4,494
  • 21
  • 30
  • 30
Gavin
  • 173
  • 1
  • 10

2 Answers2

1

Ah...

In the last 6 months I've had 8 out of the 12 hard drives fail.

Ok, let's see...

12 Seagate ST4000NM002A-2HZ101 hard drives.

Classified as "Enterprise Drive for Bulk Data Applications"

I woudl be inclined to say that abusing them as performance databases may be not smart, otoh... "a 2 million hour MTBF rating and support workloads of 550TB per year" - that would require a SMART check, but those do not look like drives that should fail in 6 months to that degree.

If you bought them at the same time from the same shop, I would dare saying you likely hit a very bad batch. They come with significant warranty, so - there should be no problem replacing.

Yes, failure around a backup sounds normal - those are higher stress situations - but the statistics of 8 of 12 failing within 6 months is absolutely totally surreal high.

TomTom
  • 50,857
  • 7
  • 52
  • 134
0

More evidence than what you have provided is required for that extraordinary causal claim of backup directly resulting in drive failures. Likely the backup restore is the most strenuous workload on the array, but that doesn't explain whether the failure is from aging wear and tear, manufacturing defect, or something else.

Hard drives are intended to last a couple years and keep uncorrected errors below 1 per 10^14 bits. Backblaze data shows annual failure rate at scale to be roughly 1%. One of your dozen drives failing will happen, but over 100% in a year implies something is faulty.

Hardware replacements should eventually resolve hardware faults. Drive replacements would fix it if the batch was faulty or the drives aging. Replacement of the NAS array might help a faulty backplane. UPS and power supply if power quality is not great. And so on.

John Mahowald
  • 30,009
  • 1
  • 17
  • 32
  • Thanks for taking the time. It's a raid 6 array setup less than 12 months ago connected to a UPS device - only in production use since September 2020. Annoyingly, none of replacement drives have failed so looking like a dodgy batch, however I have the exact same setup purchased at the exact same time and had 1 failed drive on that array. Only difference is there isn't a mysql backup and restore task. – Gavin Apr 15 '21 at 19:02
  • Sorry, but you copy/paste irrelevant info. "keep uncorrected errors below 1 per 10^14" - NOT THOSE. 10^15 as per data sheet. "Backblaze data shows annual failure rate at scale to be roughly 1%." - they also do not use enterprise discs as they found this not worthy. Which on THEIR scale with THEIR redundancy makes sense, but this is not what is used here. My research shows a flat 0% failure rate within the expected lifespan for enterprise drives, assuming SMART usage. As in: Now the first (ever in 10 years) drives in a server array tells me it WILL fail. – TomTom Apr 17 '21 at 16:47
  • That was worst case data with consumer drives, which even by those standards this is a bad batch. If you have actual failure data from a statistically significant number of enterprise drives - enough drive hours so you see some failures - please share. – John Mahowald Apr 17 '21 at 21:01