3

I'm having a discussion with a work colleague. I'm saying that a network with 100 elements will have pretty much 10 times as many failures as a network with 10 elements, ie a tech will need to replace faulty hardware 10 times more often. He suggests that the failure rate doesn't go up in a linear fashion and the failure rate will be significantly less than 10x, in fact only slightly more failures. This is not the probability of an outage etc, we are just talking in relation to the amount of parts that a tech would need to swap out in a given time frame.

MikeKulls
  • 285
  • 1
  • 2
  • 13
  • As I recall, you are both partially correct... But it is the inverse of his response it goes up exponentially to a certain degree. Off to search for what I am thinking about. – AthomSfere Nov 04 '13 at 01:22
  • 1
    As a whole system, one failure of a single router could be equal to total failure. Like if you have 100 servers load balanced for a website and only one router to connect them to the internet. The router failure means total failure in that regard. The system as a whole is only as reliable as the weakest link. – hookenz Nov 04 '13 at 02:53
  • @Matt, I am not talking about the chance of a user experiencing an outage, more about the amount of work a tech would need to do to fix all the faults that arise. – MikeKulls Nov 04 '13 at 03:38

1 Answers1

4

Of course it's linear, assuming identical components with identical reliability and identical environmental conditions.

But, it is rare to be able to make an apples-to-apples comparison between an installation of 10 servers and an installation of 100 servers. Small groups of servers, routers, switches, etc. are often subjected to inappropriate environments such as unventilated closets where they may be exposed to inappropriately high temperatures, dust, and lint. They may also be inappropriately connected directly to grid power that may expose equipment to power irregularities such as spikes, surges, and brownouts. On the other hand, typical "datacenter" environments have proper controls for temperature/humidity, clean air, clean power, etc. It is also important to bear in mind that a large-scale operator may be more likely to specify truly professional-grade equipment.

Equipment may be more reliable in a datacenter than in a broom closet, but that isn't due to some magical law of the universe that gives equipment safety in numbers. Instead, it is due to the deliberate optimization of many controllable factors.

Skyhawk
  • 14,149
  • 3
  • 52
  • 95
  • And if they are all new and many of the elements are identical the likelihood of them failing around the same time frame is greatly increased. – hookenz Nov 04 '13 at 02:50
  • In our case it is actually a network that was once in much higher usage and now has way more equipment that is required (dial up internet). All of it is battery backed in air conditioned exchanges. Because we are talking about pulling out 90% of it, for the purpose of the comparison it is carrier grade equipment in both cases. Hence an apples to apples comparison is pretty valid except that the load will increase. I accept this increased load could result in a greater number of failures but that is a different topic of discussion. – MikeKulls Nov 04 '13 at 03:34
  • @Matt you appear to be correct. They are all around 12 years old and all appear to be failing at a great rate. – MikeKulls Nov 04 '13 at 03:35