What Is IT Resilience? – Strategy and Planning for Technology Resiliency

Park Place Hardware Maintenance

Lee Walker August 15, 2022

Businesses large and small rely on technology to support critical business operations. IT ecosystems have evolved to deliver always-on capabilities that support organizations’ goals and ensure service availability for employees and customers alike.

In addition to the servers and storage systems that run your business’ applications and house your critical data, these systems also rely on the stability of the network infrastructure in order to communicate. If the network goes down, your business is effectively crippled. Unfortunately, servers and storage systems can fail, and network downtime can and does happen, and the cost to your organization can be significant.

IT resilience plans help reduce both outage frequency and severity.

What Is IT Resilience?

IT resilience is a measure of an organization’s ability to continue operating, even amidst disruptions in underlying systems, as well as its ability to mitigate and recover from outages. It means maintaining acceptable levels of service and access despite software and hardware failures, inadvertent failures due to configuration changes (read “human error”), handling occasional increases in demand that can tip the system over the edge, etc.

A solid resilience strategy is essential for today’s modern digital business, and should encompasses the following:

  • Sufficient capacity to cope with the day-to-day, and occasional spikes in demand
  • Continuous monitoring to provide real-time insights and enable proactive action against outages and poor user experience
  • Change control and detection, with reviews for correctness and policy conformance
  • Security provisions to prevent against intrusion or malicious attacks
  • High Availability for services where only zero downtime can be tolerated
  • Being prepared for swift recovery when (not if) failures occur, e.g.
    • Maintaining active maintenance contracts for your hardware, or stocking spares
    • Keeping backups of critical system configurations, for rapid reimaging or rollback
    • Having a check list of tests to quickly validate system readiness

Resilience vs. High Availability

Resilience and high availability (HA) are often confused, but they are not the same thing. Admittedly, they sound similar when discussed in terms of IT strategies, and they work toward the same goal of achieving higher reliability within an organization.

HA works through redundancy, whereby a system can fail over from primary to secondary components, if the primary components fail. Additionally, it can ensure that data is replicated across multiple data stores to avoid any single point of failure. HA can be costly to implement, typically requiring two (or more) set of components and dedicated connectivity between them to achieve one overall system. As such, HA is often reserved for those services where zero downtime can be tolerated.
IT engineer addressing question - what is IT resilience?

However, as outlined above, high availability is just one key aspect of a broader IT resilience plan.
Component failure is not always the cause of overall service outage or degradation. Poor configuration and insufficient capacity are equally to blame, and therefore require equal attention to ensure overall service resilience. In turn, failing over to secondary components is only useful if they’re ready and capable to deliver the same service as the primary – which is often not the case in practice – for example, failing from a fast link to a slower backup link, may not be sufficient to deliver the service users expect.

It’s also important to remember that when a highly available system does failover, that the overall service is no longer covered by redundancy until the failover is detected, and the failed components identified, replaced and restored. Until then, you aren’t truly providing a high availability system. This makes continuous monitoring, hardware maintenance, and the backup and restoration of your system’s configuration a critical piece of your overall resilience strategy.

3 Steps Toward an IT Resilience Strategy

So how do you achieve IT resilience? What does data center or application resiliency look like in terms of strategy? And what does resilience mean with reference to your IT architecture? Taking a top-down approach to assess your business and its IT could look something like this:

1. Identify Essential Business Services and Determine the Impact of Failure

First, you should define what normal business operation requirements are. What services does your business utilize or provide? Which are “essential” for your business to operate, and which are not? What are the minimum acceptable service levels you must achieve?

In turn, you should aim to determine the business impact of service failure or degradation. In other words, what is the cost of service outage over a given time period – 15 minutes, one hour, one day? And remember, it’s not just about the immediate costs – hardware replacement, staff required to restore the system, lost productivity. Try to also consider how any outage may affect the customer journey and customer satisfaction. How does that translate into brand damage or loss of market share? The true cost of service outages are often far greater than initial estimates.

Customer facing systems and internal collaboration tools are likely to rank highly, whilst the company news intranet and online training systems are not going to stop business moving forward if they’re unavailable for a while.

Trying to solve everything all at once is never easy, and budgets are not unlimited, so answering these questions can help you to focus on those services which are most important to your business. They’ll also act as a guide when considering how much it is sensible to spend on ensuring these services are resilient, and providing good leverage when putting together an associated business case.

2. Identify People, Processes, and Infrastructure that Support Your Essential Services

Now that you’ve identified and prioritized those services that have the greatest impact on your business outcomes, you must drill one layer deeper to uncover service dependencies – the people, processes, infrastructure (internal and external) that are essential to support these services.

Regarding infrastructure, you can conduct a manual IT infrastructure audit, or you can make use of tools to automatically inventory all of your IT assets, determine physical and logical connections, run inline path analysis to determine application dependencies, etc. and thus reduce the opportunity for human error. Manual audits are generally not feasible or cost effective on larger systems.
network engineer building IT resilience strategy

Your people and processes are more unique, so you may need to consult org charts and SOP documents for your IT Department (and even extend to your Procurement and SecOps) to get a full view of the variables in your service delivery.

3. Identify Weaknesses and Opportunities to Improve IT Resiliency

Having determined all components that make up your essential services, you can start to look for weaknesses and opportunities to make them more resilient to failure or degradation, e.g.

  • Are all key components covered by active warranty, with suitable resolution timeframes?
  • Are data backup processes in place and active?
  • Are critical components deployed in a High Availability fashion?
  • Is there sufficient redundancy in the network? Have fail over scenarios been tested?
  • Are all components being actively monitored?
  • Are there bottlenecks along application paths that may limit performance?
  • Is future capacity catered for, based on historical service usage data?
  • Is the software/firmware running on all components up to date and secure?
  • Do we have the staff necessary to manage the services effectively?
  • etc…

This step can seem a little daunting at first, but it doesn’t need to be a drawn-out manual process. Employing the right tools, services, and trusted technology partners offering remote hands to augment existing staff, can all help make this process more manageable.

Alternatively, you may want to consider simply engaging an IT infrastructure managed services provider to proactively monitor your data center hardware for you, and enable fact-based decision-making and drive IT strategy and transformation.

The following activities are just some examples of opportunities to improve IT resilience within your organization.

Make Network Observability Data Actionable

Data is critical to building effective IT resilience plans, but if that information is not actionable, it does little good. You must be able to do more than simply access data about your network. You need to be able to act on that information in time to mitigate pending failures, bottlenecks, and situations that affect service/operations and network availability.

Achieving network observability and making data actionable will require more than just taking it out of any silos that exist within the organization. The first step involves collecting, correlating and visualizing the data that you collect in order to garner insights into what’s going on in your IT estate. Enterprise network monitoring software tools have this covered. In turn, the use of AI and/or machine learning can take that to another level by highlighting correlations and patterns that humans fail to spot. It’s all about ensuring that information can be used to glean insights, discover issues, and plan your environment correctly.

Build an Environment to Withstand Demand Emergencies

Demand continues to increase across IT environments, and that is not going to change. In fact, it will only continue to escalate. Take the early 2021 GameStop stock manipulation as an example. As the price of stock continued to rise and more investors tried to get skin in the game, resources became scarcer to the point that customers were unable to access their accounts and make trades. Entire platforms ultimately crashed.

Organizations must build environments that can withstand such demand emergencies. To this end, enterprise monitoring tools can help you ensure future demand is catered for, based on historical service usage data and trend analysis techniques, as well as highlighting existing hotspots and bottlenecks. Additionally, making use of virtualization technologies can provide elastic capacity, for unforeseen demand emergencies, avoiding the need and cost to over-provision permanently.

Leverage Automation

Automation is one of the hallmarks of modern IT, but too few data center decision-makers realize how critical it is to their IT resilience strategy. When it comes to technology resilience, network automation can help streamline M&A activity, reduce manual effort and human error, and help with zero-touch remediation thereby helping to avoid critical situations that lead to outages in the first place.

Easily resolved, high-frequency tickets are popular use cases of automation. If your organization is spending many-hours on addressing reoccurring, bite-sized problems, then a small investment in automation today can lead to large cost-savings and improved service in the future.

Remove Knowledge Bottlenecks and Lone Wolf Operators

There is a reason that many of today’s organizations have moved to a flat hierarchy. It is because when knowledge and capabilities are hoarded, the business suffers. The same is true when it comes to application and server resiliency. Bottlenecks and lone-wolf operators hamper critical infrastructures and hoard key assets, leading to more severe outages.

Organizations must try to avoid pooling critical knowledge and capabilities within the hands of just a few key people. When this happens, the handful of people with the knowledge and capabilities to solve critical problems end up overburdened with other responsibilities, hampering their ability to respond quickly. Instead, democratize those skills and use outages as learning opportunities to build capabilities in other employees.

Building relationships with the right data center professional services partners can help you analyze your current IT infrastructure resilience and document the knowledge you need in a pinch.

Proactively Maintain and Monitor

Too often, organizations take a reactive stance to IT challenges and growing demand. It is an “if it’s not broke, don’t fix it” mentality that ultimately leads to damage and increased service outages. A proactive stance instead allows organizations to spot challenges before they become serious issues and adjust before outages occur. This applies not just to the maintenance of critical systems, but active monitoring for precursors to detrimental events.

Achieve IT Resilience with the Right Data Center Networking and Optimization Partner

Achieving adequate standards for IT resilience is critical for organizations to successfully compete in today’s connected and highly digital world. Fortunately, Park Place Technologies is here to help! As a holistic, global data center networking and optimization firm, we can help you optimize every aspect of your IT resilience strategy.

From the initial audit of your current IT infrastructure to building out an environment that can withstand increased demand, to ongoing hardware maintenance services on post-warranty equipment and even automating the monitoring, management, and optimization of your enterprise network, we are here to assist you at every step of the way.

For more information about improving the IT resilience for your organization, contact Park Place Technologies today!

About the Author

Lee Walker, Chief Technology Officer - Entuity