8

I'm not an expert of SANs, I'm writing here to get some clues on continuous and exasperating problems we're having which our supplier seems not to be able to solve.

we own an ENHANCE ES3160P4 SAN with 16 x 2 Tb disks that has been supplied for our video surveillance system. The SAN has been configured by the supplier to use 14 disks within a RAID 5 array, and 2 disks are global spares. The RAID is usually divided into 2 virtual disks of equal size that span across the whole RAID space. Each one results to be something more than 12 Tb. Each virtual disk corresponds to a single LUN, that is attached to a single video server which continuously stores video data and allows users to retrieve recordings when needed. The LUNs are formatted with NTFS and are attached to Windows Server 2012 videoservers through iSCSI. The videoservers tend to fully use the available space they have.

With this configuration the disks of the SAN are failing and failing, and each time the SAN cannot recover the RAID because another disk fails in the meanwhile. We lost the RAID like 4 times in the last few months.

This problem seems not to be caused by a bad SAN sample, because we own other three machines of the same type similarly configured that seem to have the same problems. Only one has no problems, but at the moment it's underused.

After some months of unknown tests and checks, the supplier ended up saying that it's well-known that the SAN should not be used 100% or it will degrade fast, also physically, and said that to solve the problem the virtual disks should be created leaving a 10-15% of the total space available in the RAID.

I searched the web for the problem, and didn't find specific statements saying this. It seems to me that it would be more reasonable to create virtual disks spanning the whole RAID and then underuse the LUNs (that is, allowing Windows to have free space and avoid fragmentation). If not, I don't understand why the ENHANCE SAN allows to create virtual disks that span the whole RAID if it's so "well-known" that some free space must be left, and why the supplier configured the system like this at the beginning...but that's another point.

In the end, we want to solve this situation. Any suggestion is accepted. As said, I'm not a SAN expert, but after so many problems I'd like to really understand whether the supplier knows what is going on or not, because we cannot accept this situation anymore.

Many thanks in advance! Regards

Edit: disk type As from the answer it seems to be relevant information, I add that the disks are all Western Digital model WD2001FYYG-01SL3.

z2k
  • 103
  • 5
  • 3
    Any properly engineered system, if it needed reserve space in order to function properly, would reserve space without offering it for use by clients. Snapshots may need space and Copy-on-Write filesystems do, but those usually have a small reserve for those purposes. At least by default, which can of course be overridden by the users if they are willing to take the risk. – ptman Feb 11 '15 at 09:18
  • At least the disks look good, they are 24/7 SAS disks, but they shouldn't fail that often... – Sven Feb 11 '15 at 09:21
  • 4
    The iossue is not free space, it is an idiotic configuration. 14 discs in a Raid 5 are not stable per mathematics, simple like that. Even Raid 6 may tax it. Generally - a Raid with 2tb discs is statistically not stable. Period. – TomTom Feb 11 '15 at 10:36
  • 1
    @TomTom: If you think it's simple mathematics, please do answer the question showing the math. My napkin math says the array is stable if reading 13*2TB to rebuild a degraded array is unlikely to fail. Raid 6 of course is better, that is stable if the rebuild is unlikely to encounter a double fault. – MSalters Feb 11 '15 at 12:11
  • @MSalters SERIOUSLY? Do some basic math - for example at http://www.raid-failure.com/raid5-failure.aspx ... 13x2 shows: The probability of successfully completing the rebuild is 12.5% .... something I would NOT call stable. – TomTom Feb 11 '15 at 12:18
  • @TomTom: That calculator actually says 100% chance, if you put in the WD2001FYYG error rates. I think you left the error rate setting at the default (which is the worst rate, for desktop drives) – MSalters Feb 11 '15 at 12:25
  • 4
    `With this configuration the disks of the SAN are failing and failing, and each time the SAN cannot recover the RAID because another disk fails in the meanwhile. We lost the RAID like 4 times in the last few months.` This is exactly because, as TomTom says, the disks are too big for RAID5. And probably RAID 6 too, FWIW. Your odds of a successful rebuild are nowhere near 100%, and you know this because you, yourself, stated that you've had "like 4" unsuccessful rebuilds in a matter of months. Your RAID config is idiotic and your vendor is incompetent, simple as that. – HopelessN00b Feb 11 '15 at 15:22
  • @MSalters Actually, it's *10* errors per 10e16 bit reads, so 1 per 10e15. Which calculates to about 80% successful rebuilds. Quite a lot lower than would be desired. Otherwise, the disks are stated to have 1.2M hours MTBF... at 40C and 550 TB/year. Unless they are over-written daily (possible on a video server), hmm... how cool are they where they are installed? – Eugene Ryabtsev Feb 11 '15 at 16:46

2 Answers2

10

From what you describe, the main problem is that they decided to use a RAID5 for such a large array, which is quite a bad choice for this setup, for exactly the reason you experience: Having a 2nd disk fail during the recovery breaks everything, and this second failure is all too likely to take that risk.

If they had used e.g. a RAID6 instead, having a 2nd disk fail during the recovery would not lead to a failed array and the recovery could proceed normally, at the cost of one disks worth of net storage capacity and a certain performance impact.

I can't see how leaving 15% free space would help at all with this problem, and while this might or might not be a good idea from a performance view point for the file system, this is clearly unrelated to the failing RAID. I call bullshit on that.

All that said, I can't help to wonder: Having this happening multiple times over the course of a few months appears to be too much even for a RAID5 system. I would suggest to look into the disk types used - it just might be your vendor used cheap desktop drives instead of 24/7 drives certified to be used in such a system.

Sven
  • 97,248
  • 13
  • 177
  • 225
2

I fully understand this is an old post, but as I continue to see large RAID5 arrays in production, I would like to add my thoughts here.

  • disks failing too often are generally a case of overheating and/or too much vibrations, which can be found on poorly-engineered systems or bad locations

  • such large RAID5 arrays should be strongly avoided. As a general rule, is much better to have a RAID6 array rather than a RAID5 + hotspare one. In the OP case, rather than having 1x parity disk with 2x global hotspares, it was much better to have 2x parity disk in a RAID6 configuration;

  • it is key to have a reliable system for error and status reporting: a unknowingly degraded, not monitored array is a recipe for disaster.

shodanshok
  • 44,038
  • 6
  • 98
  • 162
  • *continue to see large RAID5 arrays in production* "Bigger must be better!", right? I'd also add that such large arrays have ***HORRIBLE*** performance in general due to the poor geometry **and** contention between multiple LUNs shared from the same array, even if the arrays are built with RAID6. IME just about the largest arrays I'd recommend are 4+1 RAID5 and 8+2 RAID6. Some higher-end controllers can hide some performance issues with larger arrays, but the best controller ever won't help rebuild times. – Andrew Henle Feb 06 '18 at 11:39