30

We will have a machine at work, that on peak performance, should be able to push 50 ("write heads") x 75GB of data per hour. That's peak performance of ~1100MB/s write speed. To get that from the machine, it requires two 10GBi lines. My question is what kind of server+technology can handle/store such data flow ?

Currently for data storage we work with ZFS, although write speeds were never a question. (we are not even close to these speeds) Would ZFS (zfs on linux) be an option ? We also need to store a lot of data, the "IT guide" suggests somewhere between 50-75 TB in total. So it probably can't be all SSDs unless we want to offer our first-born child.

Some additions based on the excellent replies :

  • the maximum is 50x75GB/hour during peak which is less than 24h (most likely <6h)
  • We don't expect this to happen soon, most likely we will run 5-10x75GB/hour
  • it's a pre-alpha machine, however requirements should be met (even though a lot of question marks are in play)
  • we would use NFS as connection from the machine to the server
  • layout : generating machine -> storage (this one) -> (safe raid 6) -> compute cluster
  • so read speed is not essential, but it would be nice to use it from the compute cluster (but this is completely optional)
  • most likely it's going to be large data files (not many small)
SvennD
  • 739
  • 5
  • 18
  • 8
    mb as in megaBIT or megaByte? Please use MBi MiBi MByte or MB to denounce bytes. Also 2 10 gbit lines will give you 2400 MByte/s – mzhaase Jan 04 '17 at 14:12
  • the 2 lines are most likely for "failover" (?) I adapted my question as requested, thanks. – SvennD Jan 04 '17 at 14:32
  • 1
    It is more clear now, thanks. Some more questions. Peak performance is 1.1 GBps but what is average? How long do these spikes last? And what is the *minimum* continouus throughput you are willing to accept? Is the write one large file or multiple small ones? What kind of protocol will be used? What kind of redundancy do you want? It sounds like some kind of medical or scientific equipment, can you maybe link the datasheet? Since you are already using ZFS you could get into contact with a ZFS specialized storage company, of which there are a couple. They could spec out a system for you. – mzhaase Jan 04 '17 at 14:38
  • Its a "pre-alfa machine", so there are no realistic specs available, most likely we expect to run **at best** 20-25x the first few years will likely be 5-10x75GB/h. The files will most likely be one large files. We have different storage servers we will use for storage, its also probably cheaper to rerun then to store the data ... (so redundancy is not priority) – SvennD Jan 04 '17 at 15:01
  • As you are going to use NFS, consider to use [pNFS](https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Storage_Administration_Guide/ch09s02.html) (v4.1) for increased scalability – shodanshok Jan 04 '17 at 18:03
  • 2 10gb in LACP using MLAG or multiple link partners? I ask because I've seen it set up both ways and only one of them gives you 20gb total. – Aaron Jan 04 '17 at 18:41
  • 2
    Does this really have to be done with a single machine? Load balancing to multiple machines could make this easier. You could use shared storage or consolidate the data later. On AWS you might use an ELB, auto scaling, a number of EC2 instances, and EFS, but it sounds like you want hardware. Your question doesn't describe the use case very well. – Tim Jan 04 '17 at 19:09
  • Just use a SAN, or a DAS, or well, ask a vendor :) – yagmoth555 Jan 04 '17 at 20:02
  • @shodanshok pNFS seems to be for multiple clients & multiple servers, this is single server <-> single client. But I need to read up on pNFS, thanks ! – SvennD Jan 04 '17 at 21:03
  • @Aaron I contacted a network engineer for the setup of the network, I should ask. (to me anything above gigabit lines are new) – SvennD Jan 04 '17 at 21:04
  • @Tim Due to the pre-alpha stage, yes, also due to NDA and other restrictions it has to be a server, due to space constrains a single server would fit best. (server, switch, ups, ...) – SvennD Jan 04 '17 at 21:05
  • @yagmoth555 I haven't got experience with one of those, any suggestions ? – SvennD Jan 04 '17 at 21:06
  • I'am used to Equalogic SAN, as the IO is split between all the disk, SSD or not, thus it can handle a large load, but we talk about a 25k$ gear mininum. – yagmoth555 Jan 04 '17 at 21:19
  • Tie 2 m.2 SSD in a RAID, and you will have all the read/write you need. Then buffer it to disks. – cybernard Jan 05 '17 at 01:51
  • http://www.newegg.com/Product/Product.aspx?Item=N82E16820147566 get 20 for $28,000 and your done. 4TB x20=80TB. As long as you get an awesome RAID card your worries are over. – cybernard Jan 05 '17 at 02:00
  • or 40 of these http://www.newegg.com/Product/Product.aspx?item=N82E16820147441 The price is the same and the write cycles are spread out even more ensuring SSD wearing leaving is not an issue for ages. Also 40 x 500mb/s=20TB/s approx. done and done. – cybernard Jan 05 '17 at 02:05
  • 1
    Just a note, you don't need "peak" performance - you need "sustained" performance of 1.1GBpS – jsbueno Jan 06 '17 at 15:17
  • 1
    @jsbueno You are correct, however we can choose how many write heads to activate, so 1GB/s is "worst case" but considering that it might take hours it is sustained performance. – SvennD Jan 06 '17 at 20:31

8 Answers8

23

For such extreme write speed, I suggest against ZFS, BTRFS or any CoW filesystem. I would use XFS, which is extremely efficient on large/streaming transfer.

There are many missing informations (how do you plan to access these data? are read speed important? are you going to write in large chunks? etc.) to give you specific advices, however some general advices are:

  • use XFS on top of a raw partition or a fat LVM volume (do not use thin volumes)
  • tune the ioblock size to efficiently cope with large data writes
  • use an hardware RAID card with powerloss protected write cache; if using hardware RAID is out of question, use a software RAID10 scheme (avoiding any parity-based RAID mode)
  • use two 10Gb/s network interface with LACP (link aggregation)
  • be sure to enable Jumbo Frames
  • as you are going to use NFS, consider to use pNFS (v4.1) for increased scalability
  • surely many other things...
shodanshok
  • 44,038
  • 6
  • 98
  • 162
  • 3
    Also, if using XFS, put the journal on an SSD RAID1 pair. – T. B. Jan 04 '17 at 16:26
  • 2
    If using a RAID card with a powerloss-protected writeback cache, the journal can be left on the main array: the write cache will absorb and coalesce the journal writes. Moreover, from what the OP describes, metadata load should be quite low compared to data-streaming one. – shodanshok Jan 04 '17 at 17:53
  • 2
    ZFS would work just fine, and can go way faster than XFS. Sure, you'll need to set it up right, and have RAM and SSD's for the ZIL and SLOG, but that probably doesn't matter with the required speeds. – John Keates Jan 04 '17 at 20:55
  • 1
    While ZFS is a very good filesystem, I strongly doubt it can compete with XFS on a pure speed basis. For large/streaming writes, XFS is extremely efficient, showing basically zero overhead. Moreover, a CoW filesystem can show unexpected behavior when rewriting files, and is more prone to fragmentation. – shodanshok Jan 04 '17 at 21:08
  • This seems to be one of the easiest solutions, or ZFS experience was "oke" so far, but we never had any issue's with XFS. The extra protection ZFS offers is also not really needed, as its write/proces/remove. I'm just wondering if a good RAID card can keep up with 1GB/s write speed and if its "boostable" using SSD's.. Thanks for your answer. – SvennD Jan 04 '17 at 21:12
  • 3
    I view XFS on Linux as old technology. The OP could just as easily run ZFS atop hardware RAID. The reason I recommend ZFS is to allow incoming NFS synchronous writes to be absorbed by the SLOG at low latency without needing an all-SSD pool. – ewwhite Jan 04 '17 at 23:16
  • The write rate at peak would saturate a hardware RAID controller's cache (e.g. a 2GB write cache on an HP Smart Array controller), forcing most writes to go to disk. If these are synchronous writes due to the way the NFS clients are configured, the ZFS approach would offer a significant advantage over standard RAID. While a bunch of large SAS disks would satisfy the throughput requirement, latency and service time would be poor... Again, this depends on how the NFS exports and clients are configured (_sync versus async_) – ewwhite Jan 05 '17 at 06:54
  • Modern hardware RAID controllers have special provision to efficiently handling large streaming writes without trashing the entire cache. Moreover, a speedy disk array should quite easily absorb sequential writes at a very fast rate. Sure, ZFS has not the same feature set of ZFS, however, when talking about streaming performance, it is very fast. – shodanshok Jan 05 '17 at 08:27
  • 6
    A Shelby Cobra is "Old Technology" but it can still smoke most cars out of the gate. ZFS was never designed as a high performing filesystem to begin with, and although it is feasible to tune it such that it is blisteringly fast with a particular workload, it is not designed for it by default. It will take more hardware, a lot more memory, and a lot of tuning to get it to beat what XFS gives you for free with a few filemount and formatting options. – T. B. Jan 05 '17 at 18:38
  • In addition, If we are talking platter harddrives for data, then putting the journal on SSD makes more sense than using just a hardware RAID write back controller, assuming you have more than one I/O bus handling the load, so journal writes are going out in parallel with data writes. In addition, SSDs generally have buffering as well, so such a setup will absolutely smoke a single channel setup, assuming we are talking about a system that isn't tuned to the hilt, which again, takes a lot of effort and time. XFS is the best solution for this, assuming you want to have some time to eat lunch. – T. B. Jan 05 '17 at 18:38
  • @T.B. _"What XFS gives you for free..."_ Nah, not really. See the spec listed in [my answer](http://serverfault.com/a/824113/13325). That's not a tremendous amount of hardware. ZFS always outperforms my XFS systems because of strategic use of CPU, RAM and caching resources. And you get compression. While I was a proponent of XFS going back to 2003, the workload described by the OP is a easy ZFS win. – ewwhite Jan 05 '17 at 21:30
  • My experience tells me otherwise, so I suppose we are at an impasse, and I'll leave it to the OP. – T. B. Jan 06 '17 at 02:39
19

Absolutely... ZFS on Linux is a possibility if architected correctly. There are many cases of poor ZFS design, but done well, your requirements can be met.

So the main determinant will be how you're connecting to this data storage system. Is it NFS? CIFS? How are the clients connecting to the storage? Or is the processing, etc. done on the storage system?

Fill in some more details and we can see if we can help.

For instance, if this is NFS and with synchronous mounts, then it's definitely possible to scale ZFS on Linux to meet the write performance needs and still maintain the long-term storage capacity requirement. Is the data compressible? How is each client connected? Gigabit ethernet?


Edit:

Okay, I'll bite:

Here's a spec that's roughly $17k-$23k and fits in a 2U rack space.

HP ProLiant DL380 Gen9 2U Rackmount
2 x Intel E5-2620v3 or v4 CPUs (or better)
128GB RAM
2 x 900GB Enterprise SAS OS drives 
12 x 8TB Nearline SAS drives
1 or 2 x Intel P3608 1.6TB NVMe drives

This setup would provide you 80TB usable space using either hardware RAID6 or ZFS RAIDZ2.

Since the focus is NFS-based performance (assuming synchronous writes), we can absorb all of those easily with the P3608 NVMe drives (striped SLOG). They can accommodate 3GB/s in sequential writes and have a high enough endurance rating to continuously handle the workload you've described. The drives can easily be overprovisioned to add some protections under a SLOG use case.

With the NFS workload, the writes will be coalesced and flushed to spinning disk. Under Linux, we would tune this to flush every 15-30 seconds. The spinning disks could handle this and may benefit even more if this data is compressible.

The server can be expanded with 4 more open PCIe slots and an additional port for dual-port 10GbE FLR adapters. So you have networking flexibility.

ewwhite
  • 194,921
  • 91
  • 434
  • 799
  • thanks ewwwite; we would use NFS, and there is only one client (the machine) optionally we would use it as read device from our cluster. (but what processing or how is unknown) We have the "space" available on raid 6 storage servers. – SvennD Jan 04 '17 at 15:22
  • @SvennD If it's NFS and with synchronous mounts, then it's definitely possible to scale ZFS on Linux to meet the write performance needs and still maintain the long-term storage capacity requirement. Is the data compressible? That's another factor. However, the scope of this is beyond the advice I could give on an online forum for free. My contact information is available in my [ServerFault profile](http://serverfault.com/users/13325/ewwhite?tab=profile). Contact me if you need to discuss further. – ewwhite Jan 04 '17 at 15:25
  • 5
    ZFS is more than capable of what you're asking for. The first issue is going to be making certain your actual *hardware* is capable of it. It's going to be pretty easy to accidentally create a bottleneck tighter than 1GB/sec at the adapter or backplane level, if you're not careful. Make sure you get THAT part right, then ask about how to avoid gotchas on the ZFS side. – Jim Salter Jan 04 '17 at 18:07
  • @SvennD Edited with a basic design specification and rough costs. – ewwhite Jan 05 '17 at 00:42
  • I think I'd recommend an [Oracle X6-2L](http://www.oracle.com/us/products/servers/x6-2ldatasheet-2900788.pdf) over an HP server. The Oracle server comes with four 10GB network ports out-of-the-box. And in my experience HP nickle-and-dimes you to death for ILOM, licensing ILOM software, etc to the point an HP server is more expensive than an equivalent Oracle box. My experience also tells me that the Oracle box will outperform the HP box - and be a lot less likely than the HP box to have one of those hardware bottlenecks that @JimSalter mentions. Yes, buying from Oracle can be painful. – Andrew Henle Jan 06 '17 at 22:59
4

25Gbps Ethernet is already borderline-mainstream while PCIe-base NVMe will lap up that traffic easily.

For reference I recently built a small 'log capture' solution using four regular dual-xeon servers (HPE DL380 Gen9s in this case), each with 6 x NVMe drives, I used IP over Infiniband but those 25/40Gbps NICs would be the same and we're capturing up to 8GBps per server - works a treat.

Basically it's not cheap but it's very do'able these days.

Chopper3
  • 100,240
  • 9
  • 106
  • 238
  • 1
    Yeah, but how do you store ~50TB on NVMe's ? Spinners are cheap, so how do we merge to keep the speed up to par... – SvennD Jan 04 '17 at 15:35
  • Good point, realistically you're only got to get 4 x 4TB in one server, I use multiple servers, presumably you can't? otherwise it's just loads of 2.5" 10krpm's in R10 – Chopper3 Jan 04 '17 at 15:38
  • Don't want is more like it, we won't need those specs except to get in the door, and I don't want the nightmare of the overhead of multiple servers. for just one machine. Would R10 be fast enough ? (harware raid?) – SvennD Jan 04 '17 at 15:41
  • We have a Windows 2012R2 box that we built from spare kit that wasn't being used, we use it as a NAS, it's got 6 x 400GB SAS SSDs internally, 8 x D2600 shelves each with 25 x 900GB 10k SAS disks and a D6000 shelf with 70 x 4TB disks and that can flood a 10Gbps NIC easily - not tried it with a 25Gb NIC yet tbh. – Chopper3 Jan 04 '17 at 15:46
  • @Chopper3 if I remember correcly, 2.5" enterprise mechanical disks are limited to 2 TB. For such a project, I would use 3.5" 8/10 TB SAS or SATA disks, as they should give better density. – shodanshok Jan 04 '17 at 18:00
  • @shodanshok: High-capacity drives are shingled and don't have high transfer rates. And since you'd have just 5 drives at 10 TB each, the aggregate transfer rate would be even further impacted. – MSalters Jan 04 '17 at 18:54
  • 1
    @MSalters There are a number of 8/10 TB PMR (non-SMR) drives with transfer rate in the range of 200 MB/s. A 12 or 16 drive array, both in RAID10 and RAID6, should easily exceed the required 1.1 GB/s transfer speed. – shodanshok Jan 04 '17 at 19:19
2

Doesn't sound like a big deal. Our local hardware supplier has this as a standard product - apparently it can push 1400MB/s sustained in CCTV recording mode, which should be harder than your peak requirements.

(Link is to default 12GB config, but they note 20x4TB is also an option. No personal experience with this particular model server.)

MSalters
  • 690
  • 5
  • 6
  • 4
    Well, by "standard product" you refer to a "black software box" with 20 x 600gb sas 15k and 3 x enterprise ssd's. Its a fair offer, we got a similar one of our hardware vendor, but the licensing cost to me is crazy for something that is basically free (ZFS) Thanks for sharing the build ! (nice link) – SvennD Jan 04 '17 at 20:55
2

Sequential writes at 1100MB/s are not an issue with modern hardware. Anecdotally, my home setup with 8x5900 RPM laptop drives, 2x15000 RPM drives and 2x7200 RPM drives sustains 300 MB/s with a 16GB one-off payload.

The network is a 10GbE with fiber cables, 9000 MTU on ethernet, and the application layer is Samba 3.0. The storage is configured in raid50 with three stripes over three 4-drive raid5 volumes. The controller is LSI MegaRAID SAS 9271-8i with up to 6Gb/s per port (I have an additional, slower port-multiplier).

Talk to any seasoned sysadmin and they should be able to tell you exactly which controller(s) and drives would meet your requirements.

I think you can try with any 12Gb/s controller and configure two mirrored stripes of eight 7200 RPM drives each (almost any drive should do). Start 3-4 TCP connections to saturate the link and if a single pair of 10GbE cards can't handle it, use four cards.

2

Something of a tangent, but consider using InfiniBand instead of dual 10GbE links. You can get 56Gbps Infiniband cards quite cheap, or 100Gbps ones for not too much more, and on Linux it's easy to use NFS with RDMA over IB, which will give you extremely low latency and near theoretical line speed throughput (if your underlying storage can handle it). You don't need a switch, just two InfiniBand cards and a direct attach cable (or an InfiniBand fiber cable if you need longer distances).

A single-port Mellanox 56Gbps card (8x PCIe 3.0) like the MCB191A-FCAT is less than 700 bucks, and a 2-meter copper direct attach cable is like 80 dollars.

Performance will generally blow 10GbE out of the water in all use cases. There are no downsides, unless you need to access the server from lots of different clients that can't all use InfiniBand (and even then, Mellanox' switches can bridge 10GbE and 40GbE to IB, but that is a bit more of an investment, of course).

1

Doing this with ZFS is possible, however, consider using FreeBSD as FreeBSD has the faster network stack. This would allow possibly 100 GBit on a single machine.

1100 MBps sounds like a lot, but you can realistically achieve this by using only regular harddrives. You say you need 75 TB of space, so you could use 24 8 TB harddrives in mirrors. This would give you 12x write speed of a single drive, and 24x drive read speed. Since these drives have more write speed than 100 MBps, this should easily be able to handle the bandwidth. Make extra sure to not get SMR drives, as these have hugely slower write speeds.

ZFS does create checksums for every block. This is implemented single-threaded. As such, you should have a CPU with a reasonably fast clock rate to not block.

However, exact implementation details hugely depend on details.

mzhaase
  • 3,778
  • 2
  • 19
  • 32
1

We have pegged a 10G NIC dumping data to a Gluster cluster over their fuse client. It takes a little tuning bit you wouldn't believe the performance it can achieve since 3.0.