2

Note: This question is real-world, but to analyse it, please note I've begun from a "theoretical" starting point of device and bus capability, which I acknowledge will not usually be at all representative of in-use bandwidth utilisation.

I have an array of 18 x SAS3 mixed 8TB and 10TB enterprise drives, being configured as 6 sets of 3 way mirrors under ZFS (FreeBSD). Currently they are all hanging off a single 24 port HBA (9305-24i).

It's hard to know how many drives work at peak together, but assuming they were all in use for reading, I get the following calculation worst case (may not be realistic?):

SAS3 simplex bandwidth: (12 gbits/sec) x (8/10 encoding) = 1.2 GB/sec raw data max
=> 18 x SAS3 maximum at peak: (1.2 x 18) = 21.6 GB/sec
But PCI-E 3.0 x 8 simplex bandwidth: 7.9 GB/sec

So at a first glance, it seems that the array could be throttled very badly under demand, because the link is limiting the array IO from 21.6 GB/sec down to 7.9 GB/sec each way: a loss of 64% of HDD I/O capability.

On the other hand, the file server is primarily used by 2 end-users: the file server itself which needs to read and write at highest speed as part of its file handling, and any other devices which are linked by 10 GbE, and hence can't consume more than 2 GB/sec simplex even with 2 link aggregation. Therefore potentially it can't use more than a fraction of the PCI-E link speed regardless, in any event.

(Even if I do some file management on the server itself via SSH, 2 GB/sec is still quite a good speed, and I might not complain.)

Also whatever SAS 3 might deliver in theory, 12 gbit = 1.2 GB/sec and even on maximum reading from its internal cache, it seems unlikely an enterprise HDD can utilise SAS bandwidth. SSDs yes, but HDDs? Less likely? Maximum read is usually quoted as around 200 - 300 GB/sec in datasheets.

My question is therefore, given the HBA can provide up to almost 8 GB/sec bandwidth across PCI-E, and the end users can consume at most 2 GB/sec, will there in fact be a throttling effect?

Put another way, does it matter that the disk array in theory is throttled from 22 GB/sec down to 8 GB/sec at the PCIE slot given the end users have a 2 GB/sec aggregated connection? Or will the PCI-E slot limitation still be an issue because the local system at times needs faster I/O than the end-device bandwidth would suggest?

If there is a limitation I can split the disks across 2 HBAs, but I'd like some idea how to assess if there's a real issue, before sacrificing a second PCIE slot to raise the bar on raw disk IO.

Stilez
  • 664
  • 6
  • 14

1 Answers1

8

Ah, did you ever bother to get away from the theoretical numbers? You state so nicely...

18 x SAS3 maximum at peak: (1.2 x 18) = 21.6 GB/sec

Yeah. Now show me a single hard disc (and you talk about HDD) that can deliver enough data to actually saturate it's SAS3 link. Hint: the cache is not the disc.

Your argumentation breaks down when you look at the real data numbers that hard discs can handle.

Quoting from Max SAS throughput of a disk stack?

So one SAS 10K HDD has ≈140 IOPS. With 8KB block it will be just 8 * 140 = 1120 KB/s of throughput.

Turning that x18 and you end up with a whopping 20160kb/second. Rounded 20MB. That is 0.1% of your bandwidth.

There is a reason SAS is good enough for handling hundreds of discs.

Now, this DOES change when you add SSD to the mix, and yes, then you are quite likely to blow out SAS. WITH ONE SSD. WHich is why U.2 form factor actually uses PCIe channels PER SSD and there are cases that handle 24 of them.

But as long as you do not talk SSD you basically are ignoring the fact that the throughput of the protocol layer (which is standardized) is absolutely irrelevant because your end devices are not capable of coming even close to start saturating this.

You do not ahve a throttling effect due to your bandwidth limit - you ahve one because you bascially have a highway for some lonely trucks.

TomTom
  • 50,857
  • 7
  • 52
  • 134
  • I started at the theoretical, and I saw up front the point you make so well. The problem is, that this kind of thing has a lot of subtleties that I could miss. For example, latency issues? 4k or random or mixed IO issues? Stuff that happens if traffic is duplex not simplex? Disks running at 200+ MB/sec IO sequential, or SAS protocol overheads if they're running a lot of 4k? Stuff that happens if disks are reading from or writing to cache not platters? I can't begin to guess if any of those might mean that the obvious answer is incorrect in real world cases, as a result..... – Stilez Jul 07 '20 at 14:58
  • 2
    Nothing. See, cache is small on discs. VERY small. In Raid Multi User scenarios there is no sequential IO. SAS was made for large storage with a lot of parallel processes using them, all sequential IO totally breaks down then. Sequential IO in this scenario (as well as cache) is a funny joke - a ghost you never see. The caches overload the moment someone looks at them ;) And without a USV to back things up (like the raid controller would have) you can not even use them as proper write caches. – TomTom Jul 07 '20 at 15:22
  • TomTom - thank you :) I know on the surface the question seemed naive - after all who worries about an 8G PCIE limit when actual maximum HDD data rates are so low by comparison (and the point about cache size is really helpful, I'd never thought about 256MB as "tiny", but I guess its under 1 sec data so it is?) But there are enough uncertainties, including transient burst rate behaviours, 4k vs seq, RW mix, protocol/PCIE err detect/other data, and perhaps other unknown pipeline behaviours, to make it "maybe not actually a naive question to ask". I really appreciate the insight + confirmation – Stilez Jul 08 '20 at 10:40
  • I should say that like measuring router throughput in terms of minimal packets not bytes, I'd be looking for rate limiting at worst case - for example sequential streaming by a single client across 10G point to point. That would lead to parallel 1MB seq reads across all HDDs rather than 8K near-random (remember the pool is striped across 6 x 3 way mirrors so it might try to read seq from all at once). But I think your answer suggests that even in that case, and allowing for all worst cases of everything else, HDD limits will ensure we never get close to the 8G limit even so. Is that correct? – Stilez Jul 08 '20 at 10:49
  • 1
    Yes. Even in that extreme edge case you will not hit the limits. – TomTom Jul 08 '20 at 11:40
  • Well, thats one PCXIE3 x8 slot usage, and a 2nd 9305-8i HBA, avoided ;-) – Stilez Jul 08 '20 at 14:08