Picture This: A Visual Guide to Failure Values
As you study for a variety of certification exams (such as CompTIA’s Cloud+, A+, Security+, etc.), there are a number of “failure”-related values you will need to learn. Typically, these are associated with hardware components such as the hard drive. You will be tested repeatedly on the same set of failure values — or a subset thereof.
The formulas for computing the values, where they exist, are given in the descriptions that follow, along with other pertinent information. Make sure that you thoroughly understand the definition for each of the values, and the differences between these topics, before writing any exam. The images are a play upon the cards used in Monopoly (© 1936 by Parker Brothers) and are fitting because the first rule of any game is to know that you’re in one. Trying to keep IT systems up and running, with availability high and failures at a minimum, can indeed be viewed as a game — and a taxing one at that.
Mean Time To Restore (MTTR)
The simplest of the values to understand is MTTR: The first three letters always represent Mean Time To and the “R” changes between Restore, Repair and Recovery. Regardless of which word is used for the last one in the acronym, MTTR is the measurement of how long it takes to repair a system or component once a failure has occurred. If the MTTR is 20 minutes for a particular entity, then it takes — on average — 20 minutes to fix that entity when it breaks and move on.
While MTTR is considered a common measure of maintainability, you need to be careful when evaluating it during negotiations with a vendor on a service level agreement (SLA), because it doesn’t always include the time needed to acquire a component and have it shipped to your location. I once worked with a national vendor who thought the MTTR acronym stood for mean time to respond. A technician would show up on site within the time the contract called for but would only begin to look at the problem and then make a list of any needed supplies as well as get coffee, make a few phone calls, and so on. That vendor’s actual time to restore always far exceeded the contracted MTTR number. Make sure any contract agreement you commit to spells out exactly what your client can expect. In general, the lower this number, the better.
Figure One: MTTR is the average time it takes to correct the failure.
The time it takes to resume normal operations can be dependent upon a wide number of variables but two of the biggest factors are the availability of substitutes and knowledge. If a hard drive in a rack fails, for example, and there is a spare nearby, then it can take less than one minute to swap one for the other. If there is no spare, or no one working at 3 a.m. who knows how to swap the drives, then the time to repair naturally takes longer. To reduce this number, keep spare parts on hand and have a knowledgeable person who can respond to problems — and fix them — readily available.
When faced with exam questions on this variable, common sense is your best guide.
Mean Time To Failure (MTTF)
The Mean Time To Failure (MTTF) is the average time to a nonrepairable failure of a component. It can sometimes be used in place of MTBF (discussed next), but that is an improper use and there is a distinct difference between them. The easiest way to think of MTTF is to equate it with the life of the item. Almost every computing component has an MTTF associated with it and devices that commonly fail include hard drives, power supplies and memory. Devices that fail a little less commonly would include network cards, controllers, fans, motherboards and the like. The higher the MTTF on a device, the better.
Figure Two: The Mean Time to Failure (MTTF) represents time when the system is running in the absence of a problem.
The more components you are working with, the more you decrease MTTF. As a simplified example, assume that in your entire organization — a little one-man office — you have only one hard drive and it is a SATA with a MTTF of 1 million hours. When you first start out, the odds of your hard drive failing in the first year are only slightly greater than 0.8 percent. Those odds get worse as the hard drive ages, but still stay fairly small.
Now assume that business needs grow and you move from the single hard drive to an array of 32 drives, each with the same rating. Since any one of those drives could fail, the odds of failure have now increased 32-fold, and you’ve gone from less than 1 percent likelihood of a problem during the year to 25 percent. In short: the more interconnected components you have, the more possibilities exist that something can go wrong.
Mean Time Between Failure (MTBF)
Often confused with MTTF, the Mean Time Between Failures (MTBF) is the measure of the anticipated incidence of failure for a system or component, or how frequently a component will fail. The word “between” implies that the failures are recoverable. This measurement is often used in the industry for hard drives, but failures there are not recoverable and MTTF is the more accurate variable. If the MTBF of a cooling system is one year, you can anticipate that the system will last for a one-year period; this means you should be prepared to rebuild the system once a year. If the system lasts longer than the MTBF, your organization receives a bonus. Like the other variables, MTBF is helpful in evaluating a system’s reliability and life expectancy.
Figure Three: Like MTTF, MTBF decreases as you add more possible things to go wrong.
Logically, MTBF is a superset of both MTTF and MTTR. Put another way, the time between recoverable failures is equal to the time to the failure of a nonrecoverable component plus the time to fix it. For example, consider a server in which the power supply fails. The MTBF is equal to all the time the power supply functioned properly (it’s MTTF) and the time it took to replace it (its MTTR).
Recovery Time Objective (RTO)
The Recovery Time Objective (RTO) is the maximum amount of time that a process or service is allowed to be down and the consequences still be considered acceptable. Beyond this window, the break in business continuity is considered to negatively affect business. The RTO is agreed on during the process of business impact analysis (BIA) creation.
Figure Four: The RTO is a goal you’d like to achieve in terms of time to returning to operations.
The BIA is simply a study of the possible impact if a disruption to a business’s vital resources were to occur. This analysis isn’t typically concerned with external threats or vulnerabilities, but focuses on the impact a loss would have on the organization. It can be useful in identifying the true loss potential and may help you in your fight for a more substantial IT budget.
Recovery Point Objective (RPO)
The Recovery Point Objective (RPO) is similar to RTO, but it defines the point the system would need to be restored to. It can be expressed in terms of time: if RPO can be two days before a crash occurred, then whip out the old backup tapes, but if it needs to be five minutes before the crash occurred, then you need to rely on journaling. As a general rule, the closer the RPO matches the item of the crash, the more expensive it is to be able to obtain.
Figure Five: RPO represents an amount of loss you are willing to live with.
RPO and RTO are quite often expressed together. For example, in the event of a failure, it may be that the BIA specifies you need to be back up and operating within two hours (RTO) with a maximum loss of the last 10 minutes of transactions (RPO).
Summing it Up
These five variables — MTBF, MTTF, MTTR, RPO and RTO — are used to express/calculate variables associated with the failure of systems and/or components. All five are prominently featured in a number of certification exams, most notably a number of the entry level certification tests from CompTIA.