Sunday, June 17, 2012

Carrier-Grade: Five Nines, the Myth and the Reality


Notes from the paper below on Carrier-Grade and availability in both hardware/telco and software worlds. Good lessons on what availability, customer-oriented SLA, reliability, its origins, and how multiple components contribute to the overall availability. A paper worth reading.

Carrier-Grade: Five Nines, the Myth and the Reality

·         "Availability" is oftentimes not the total time up, but rather related to a customer-oriented SLA
Availability was computed from the time a ticket was opened until it was closed. Than came the exclusions: only customer opened tickets counted – proactive restorations did not count (if the tree fell and no one noticed before it was propped up, it did not fall), then the beginning of the outage start time was from the time the ticket was opened (a somewhat long time from when the problem physically occurred), and lastly, any time the ticket was transferred to the customer (“on the customer’s clock”) did not count. And even above this, scheduled maintenance down time did not count toward availability. The Voice NOC had seemingly qualified away rigor in order to meet the mandated goal of five-nines. (Once again proving the rule that you get what you measure.)
Essentially the Voice NOC was not wrong, nor the Data division right. What the NOC was measuring was compliance with a customer-oriented SLA, a Service Level Agreement, and not really an availability measure. The Data division was measuring availability as “measured uptime,” or the probability the network could be used at any specific time the customer wished to use it.

·         Origin: hardware reliability numbers
It is likely that the origin of five-nines availability comes from design specifications for network elements. It also seems likely that the percent measure came before the current MTBF (Mean Time Between Failures) and MTTF (Mean Time To Fix) measurement, since it is a simply expressed figure and the MTTF requirements often match the % calculation while being expressed as an odd-ish number. In practice, these numbers are mostly estimates that the manufacturer makes about the reliability of their equipment. Because telecom equipment frequently is expensive, not deployed in statistically significant sample set sizes, and rushed into service as soon as it passes laboratory tests, the manufacturer estimates its reliability.
It is not clear when the measure of reliability of an individual network element became the measure of overall network availability – but it did: customers don’t care if one element fails, or dozens fail. Customers only care that the service they are paying for and rely on works as offered.

·         Availability: average or a probability?
is availability an average or a probability? Generally, if you are building a network element, you would define availability as a “probability of failure”. But if you are an engineer in a NOC running a report, you would define availability as an average. Whereas, according to Cisco, the number of defects in a million is used to calculate software availability and time does not enter directly into the calculation.

·         High-availability: averaged out among all system elements?
if one system crashes for 24 hours, and all others work without interruption for the month, the 24 hour outage might be averaged across a large base of installed units. When the number of installed units is large enough, this yields a number that would remain within the required 5 nines of up time. More specifically, if a service provider installs a product described as "carrier grade" or providing five-nines availability, should the service provider expect that this is the product reliability standard expected of every single network element, or should the service provider expect that some of the elements may perform at a much degraded level, and that it is only the world wide “law of large numbers" that is used to measure carrier-grade? You see, it isn’t just that one bad apple in a group of network elements can skew the overall numbers – there are actual customers and real traffic affected by that “bad apple” element. Way beyond some theoretical measure, the effect on customers might be quite severe – certainly outside the guarantees of any customer SLA, and almost certainly extracting a penalty payment from the carrier, and likely attracting a fine from the regulator as well.

·         System availability vs. component availability - or, why 5 nines was chosen.
Measuring system availability is actually different from individual element availability. If a string of network elements are directly dependent on each other, say a common circuit which passed though each one, than the availability of the system follows the law of multiplying probabilities. In a 4 node system each of which is five-nines reliable, than the availability would be .99999 * .99999 * .99999 * .99999 = .99996. This partially explains why five-nines was chosen – because it is so close to unity/one that the degradation of availability due to directly dependent probabilities still is very reliable. So the network effect is reduced. (Choosing 0.95 reliability as a standard, for example, would mean that with a string of 12 dependent nodes, one would be in failure mode about half of the time). But with everything built to five-nines, if just one bad apple exists (say with even a stated four-nines reliability), than the string of four nodes as a group becomes .99987 - very close to the reliability of the lowest performing element. In fact, in this case the close-to-unity of the other devices nearly removes them from the equation; the dependent string will always be very near the value of the bad apple, which can be very bad if the apple is actually rotten. In this situation all of the careful design and investment in carrier-grade devices of five-nine reliability becomes economically worthless.


Post a Comment