MTTR vs. MTBF vs. MTTD vs. MTTF Failure Metrics
You’re awoken in the middle of the night by your phone loudly ringing. The screen shows 1:37am, and the CTO is calling. Your network is down, and the nighttime skeleton crew hasn’t been able to pinpoint why. The next 72 hours are a whirlwind, and while the immediate fire has been extinguished, the jury is still out on whether you keep your job.
Preventing network downtime is plan A, but if that fails, optimizing your recovery procedure becomes very important. Different network failure metrics can be used to measure your incident response and show progress each time calamity strikes.
Different organizations will define uptime differently. A completely dead system, an inability to complete any number of operations, incorrect information delivered or under-performance can all be considered service outages.
The most popular way to measure uptime is by system availability. This metric is considered one of the leading industry standards for networking professionals and will be used for this article, but the basic premise applies to other forms of service failures as well.
All these metrics are created as mean averages – where the central tendency is shown by the sum of all measurements divided by the number of observations in the data set. The product of this calculation is usually measured in hours or thousands of hours, so the lower the number the better. These can be used to track KPIs, inform SLAs or maintenance contracts and explain results in reporting and analytics.
Here is a quick glossary of related terms:
- MTTR: Mean Time to Recovery – Mean Time to Recovery is the average time it takes from a failure to full working order. Learn how to calculate MTTR.
- MTBF: Mean Time Between Failure – Mean Time Between Failures is the average time you can expect a component or service to perform before failures. Learn how to calculate MTBF
- MTTD: Mean Time to Detect – Mean Time To Detect is how long you can expect an outage to be flagged in your system. Learn how to calculate MTTD.
- MTTF: Mean Time to Failure – Mean Time to Failure is how long from an average operating state you can expect before a service disruption occurs. Learn how to calculate MTTF.
You will want to reduce MTTR, MTTD and increase MTBF and MTTF, which can be achieved by IT hardware monitoring and the thoughtful planning of your preventative maintenance processes.
What Is Mean Time Recovery (MTTR)?
Mean Time to Recovery is the average time needed to restore your system after a failure. Repair and recovery are coupled with system outages and failures. In the fast-moving IT industry, understanding how quickly an item of equipment can be repaired is vital. Your incident response time will reflect on the performance of your team, your organization and the profitability of the business.
How to Calculate MTTR
For a single component, we only care about the time from when the outage begins to when the system operates normally. Any time spent operating correctly is not included. Diagnostic time is included in the calculation but not lead time on ordering parts or supply chain/procurement waiting time.
The intention is to return the metric between when the disruption is first discovered and when it returns to operation. This could change between data center hardware maintenance SLAs, so it is important to clarify how different organizations may interpret what “MTTR” is.
Here are some examples:
- An asset has failed exactly once, the failure was immediately discovered and reported, and on-site technicians repaired the failed asset in 24 hours. The MTTR would be 24 hours.
- An asset failure, power outage, and ISP failure all happen within three hours, the system failure was immediately reported every time and a local technician was notified and recovered the failed system the first time within 3 hours, the second time within 4 hours and the third time within 2 hours. The MTTR would be (3+4+2)/3 = 3 hours.
- Performance degradation due to over-congestion was reported but could not be acted on for 24 hours due to operational problems elsewhere, it then took a further 24 hours to diagnose the exact problems and another 24 to fix it. The MTTR would be 72 hours. Then the same issue persists with the same problems and it only takes 12 hours to fix. The MTTR would then come down to (72+12)/2 = 42 hours.
What Is Mean Time Between Failures (MTBF)?
MTBF shows how often a system fails and can be used to calculate the expected longevity of that system.
How to Calculate MTBF
Rather than looking at the time between when a service outage happens and when it is restored, we look at the total operational (online) hours of the service and divide that period by the number of times it has gone down.
This means that in our last example above, the first recovery that took a long time isn’t as significant in the calculation. For example, a system may have been running for 21,000 hours and been disrupted 3 times, giving an MTBF of 7,000 hours.
MTTR vs. MTBF
In the examples above you can see how recoveries might radically change each of the metrics. A service that has an MTBF of 10,000 hours but an MTTR of 24 hours might be valued different from a service with MTBF of 8,000 hours but an MTTR of 1 hour. Generally, when considering Mean Time To Recovery vs. Mean Time Between Failures, the first metric is more important as it represents the action an engineer will take to reduce downtime and out-of-pocket expenses of the business due to lost service.
What Is Mean Time to Detect (MTTD)?
MTTD, also known as Mean Time to Acknowledge is the metric you use to determine how effective you are in responding to outages and notifications. The closer to 0 it is the better. This can be reflected against a single component, or against an entire service in one or multiple data centers.
How to Calculate MTTD
To calculate, take the sum of incident detection times in length in minutes and divide it by the number of incidents in a given period, such as a year or a month. For example: There were 5 incidents that took respectively 16, 21, 4, 8 and 36 minutes to detect. (16+21+4+8+36)/5 = 15min MTTD.
MTTR vs. MTBF vs. MTTD
While a high MTTD and a low MTBF are key reasons to take corrective action to fix gaps in service, it is still most important to track MTTR KPIs.
What Is Mean Time to Failure (MTTF)?
Mean Time to Failure is the representation of a whole system of assets within an organization and the average time it takes one to fail.
How to Calculate MTTF
To calculate this, you take the total number of operation hours (uptime) and divide it by the number of assets monitored that failed. In a system with 20 assets with 99.988% uptime over a year due to 3 disruptions, the MTTF would be 2,919.66 hours.
Maximize Uptime with the Right Network Monitor
The demands of your network are unique, but the right network fault management software can help you tailor your ideal alert system. Entuity Software doesn’t operate in a fractional, a la carte capacity that requires you to piece together a coherent monitoring system. Our comprehensive network monitoring software helps you design event, syslog and trap management for your distinctive organization and minimize the business impact of component failures.
Quickly measure and respond to network failures with Entuity – contact us to schedule a demo today!