7

Backround: We are in need of a HA server in a small office environment and are looking at DRBD to provide it. We only have about 100GB that needs to be on the HA server and server load will be extremely low. The data will probably increase about 10%-25% per year if we archive older office data, and 50%-75% each year if we don't.

Point is we use a mix of consumer grade and used enterprise grade hardware which WILL be a problem if we don't preemptively plan for it; and pre-built quality servers DO fail, so redundant servers seems like the way to go.

The Plan: We are thinking it would be good to find (2) of the best bang-for-our-buck used servers and synchronize them. We simply need SATA/SAS capable servers and space for as many drives as can be had for the price. These servers seem like they can be had for $100-$200 (+some parts and additional drives) if you catch a deal.

This would theoretically mean a server could fail and if we took days to get to it, as long as we didn't have another coincidental failure, things would still hum along until our IT department (me) could get to it. We would use Debian as an OS.

Some Questions

  1. (A) How does DRBD handle drive or controller failure? That is This shows DRBD before the storage driver, so what happens when the controller fails and writes dirty data or the drive fails but doesn't crash immediately? Is the data mirrored to the other server or not and is there risk of data corruption across servers in cases like these?

  2. (B) What are the fail points for DRBD; that is theoretically as long as one server is up and running there are no issues EVER. But we know that there are issues so what are the fail modes using DRBD since most of them should theoretically be software?

  3. If we are going to have two servers for this, would it be reasonable to run VM's on each with MYSQL and Apache for database and web server replication? (I am assuming so)

  4. Is DRBD reliable enough? If not, is the unreliability isolated to certain tasks, or is it more random. Searching turned up people with various issue but this IS the internet with seemingly more bad info than good.

  5. If data is being synchronized over LAN, does DRBD use double the bandwidth? That is, should we double up on NICS and do some link aggregation and trunking? Then maybe put them on separate routers on separate circuits and UPS's in separate rooms and now you really have some redundancy!

  6. Is this too crazy for an office in terms of server management? Is there a simpler REALTIME alternative (granted DRBD seems simple in theory).

We already have a server. So it seems to me a second USED server with a dedicated drive for DRBD could easily be had for around $150-$250 with some smart shopping. Add a second router, more drives, more NIC's (Used), and (2) UPS's and were talking $1,000 +/-. That is relatively cheap! And I am hoping this would mainly buy us time during a server fault. Drive failures seem like the easier thing to handle with RAID these days. It's other hardware failures like controllers, memory, or power supplies that might require downtime to diagnose and fix that are the concern.

Redundant servers for us means used hardware becomes more viable with more up time and more flexibility for me to fix things when my schedule allows vs having to stop everything to repair the server.

Hopefully I didn't miss that these questions have easy searchable answers. I did a quick search and didn't find what I was looking for.

Damon
  • 429
  • 2
  • 11
  • You don't want a "drive" in these servers. You want an **array** of drives, two of them at the very least, configured as RAID1. – EEAA Aug 24 '13 at 13:02
  • @EEAA Which is better, single drive in redundant servers or one server with RAID? We plan on RAID eventually in them for the HA data, but we see redundant servers as more reliable than only a RAID on one server. Do you disagree? And good backups take priority over all of this. – Damon Aug 24 '13 at 14:42

1 Answers1

7

First, you need to define what you really mean by "HA". What are you protecting against, what are the costs of an outage of type X and duration Y? How will it affect your organization? What is your role in this organization anyway and what is your time worth? How much time can you spend on this? After that, you have to decide if this requirements allow this kind of solution or if you need something else.

Second: In my world, the sentences "I need HA" and "I am going to buy crappy used servers for 200$" don't possibly fit together (in fact, for me buying used crap and professional use of any kind don't fit together at all).

Anyway, your questions:

  1. If you write completely new data to the DRBD block device, it will be written correctly on the non-broken controller. It's a completely transparent layer in front of the actual disks, just as a software RAID or LVM. However, if you have data corruption on the primary node due to broken controllers or read errors from the disk, this could easily propagate to the secondary node since write operations are often read-modify-write cycles, and in this case, a block of corrupted data will be read on the primary node and a write operation for this block is sent to both nodes. This brings up the most important point when using DRBD: Same as a RAID, it is in no way a replacement for a good and reliable backup.

  2. I don't understand what you mean here.

  3. When using VMs in a single node setup is useful, it will be in the two-node setup as well, and you'll have the advantage of possible live migration when done right.

  4. In my experience, yes. You should test it thoroughly in your environment though and spend a lot of time simulating the various fail states the system can experience and learn and document how to recover from them. While it's reliable, DRBD is not self-healing and requires a good understanding of the situation to recover from a failure condition.

  5. You really want a dedicated connection between the nodes. In a two-node setup, this can be a point to point connection without a switch or something. Everything else might be possible technically but is just nonsense. Depending on your usage pattern, using trunking or faster NICs (e.g. 10G ethernet or Infiniband) for this dedicated link might be beneficial, but if most/all of the data to read or write comes from the LAN interface, this won't help as you are limited by the LAN anyway.

  6. This comes back to my first paragraph: What do you expect from it and what do you consider HA? For an experienced system administrator, it can be a cheap and reliable way to protect from a range of failures, but it requires a lot of fundamental understanding of how the parts fit together. Many small shops without such an experienced full time SA are better of with quality hardware and a good support contract though.

Finally: Don't try to retroactive fit any HA solution on your current hardware. As I wrote, you need the time to experiment with the setup and its failure conditions. This requires a lot of downtime and can't be reasonable done on your production hardware.

Sven
  • 97,248
  • 13
  • 177
  • 225
  • To your second point, I definitely understand that probably 99.9% of professionals out there would say the same thing for GOOD reason. To "2.", in our setup for the RAID, we are focusing on drive failure, controller failure, software errors, and user errors and plan accordingly. What would be a similar list using DRBD. Is it the same? As to "4.", We are in the planning stage for all of this. As we grow our data will be more important and downtime will become a bigger and bigger problem so trying to be preemptive on planning now so we can start purchasing pieces to build our infrastructure – Damon Aug 24 '13 at 14:12
  • "6." I simply want and need more uptime for our files, database, and webfront end AND for the ability for me to be in the field during normal hardware failures (memory, PS, controller, etc) and still be up and running. We started by just turning a desktop into a server and that has worked for years without issues. BUT, THIS WILL NOT LAST :). So, the plan is to probably find a used IBM, HP, or Dell server (high availability and cheap) but this doesn't solve the problem of hardware failure because over the years, they will have issues too; just with a higher interval between failure. – Damon Aug 24 '13 at 14:27
  • To your last point, absolutely. This will take alot of time to implement. But we need to know where we are going so we purchase the right hardware and begin integrating it so when we have the business volume, these things are figured out. We run tight margins so we are simply trying to utilize our options to lower cost and increase production. In this case buying a new high priced single server seems more expensive than 2 used server with DRBD and will give more uptime at a lower equipment price. SA costs are definitely a factor, and I think it still ends up cheaper in our case. – Damon Aug 24 '13 at 14:38
  • Oh, and Google's starting model was a version of what we are thinking of. Google used consumer grade hardware and built in redundancy; WHEN something fails you just swap it out. I am thinking of a similar move for our IT infrastructure. We currently use it for our equipment. As an example, we purchase used Honda HR214 lawn mowers for $50-$100 (instead of $1000 similiar new commercial) and we have 3 where we need 2; when a mower fails we just swap it out and repair when we have time. But we service our equipment in-house. For DRBD to work, we will need to do much the same. – Damon Aug 24 '13 at 15:31
  • You have to be aware that DRBD is only one comparatively small part in a HA setup where it takes the place of a SAN for shared block devices. The *much more* complicated pieces of a HA setup are those that make automatic failover possible. You will need cluster filesystems, cluster management, fencing etc.etc. to make a full HA setup that will to continue to work if a node fails without immediate human intervention. As long as you consider DRBD as just a kind of "RAID over LAN", things are easy, but this is just the very first step. – Sven Aug 24 '13 at 16:00
  • Also, you are not Google and I guess you are in an entirely different business. What makes sense for Google for Facebook doesn't necessarily make sense for your business. – Sven Aug 24 '13 at 16:15
  • Definitely aware that implementation is more involved than planning and will required a whole slew of things to make DRBD possible. But right now we are trying to figure out if it is possible. And can't you have DRDB without a SAN and get automatic failover without human intervention? We probably will stay with internal storage or DAS. Also, to your second comment; this does hinge on us maintaining this in house via myself or a smartly shopped IT person as it grows. There is no way this is viable if we call any ol' IT company and pay to have this setup and maintained. – Damon Aug 24 '13 at 16:33
  • 1
    DRBD is the "cheap" replacement for a SAN: It allows you to have a simulated shared block device "connected" to more than one machine. But it doesn't help you at all with any application failover. If you run a file server, it runs on only one machine at a time while the data is kept in sync on both machines. However, if the primary file server fails, there is nothing in DBRD that would switch this file server to the second machine. This is the domain of cluster management software, and implementing this in a secure and reliable way is *much more* complicated than just DRBD. – Sven Aug 24 '13 at 16:50