Surviving Planned or Unplanned Outages: Zero Downtime Business Applications

Overview

This is not merely another iteration of conventional business continuity methodologies such as disaster recovery, active-passive, or active-active. Instead, the emphasis lies on crafting, constructing, and implementing a fault-tolerant solution capable of confronting and autonomously resolving infrastructure failures without human intervention. The solutions and themes discussed in this document are not meant to encompass an exhaustive list of possibilities.

Achieving zero downtime for your application hinges on constructing an infrastructure capable of withstanding failures with minimal or no external recovery efforts required. While this article primarily targets the NonStop community, many of the principles discussed are equally applicable across other platforms with slight adjustments tailored to those environments. I aim to present the information within the context of a previously established system, highlighting the specific challenges and achievements unique to that implementation.

However, it’s important to note that every situation is distinct, and this is not a one-size-fits-all endeavor. Each client is likely to encounter different constraints, necessitating alternative solutions. The approach outlined in this article serves as a framework to embark on your journey, recognizing the need for flexibility and customization based on individual circumstances.

The NonStop platform has undoubtedly made many of us, myself included, accustomed to streamlined processes and procedures for managing unforeseen events. However, it has also led us to believe that there might be a simple solution to eradicate downtime in business applications. The reality is that this platform has laid the groundwork for integrating fault tolerance into business applications, yet this opportunity often goes unrecognized. In his 1962 speech about the moon landing, John F. Kennedy famously said, “…and do the other things, not because they are easy, but because they are hard…”. I liken his words to the effort required in building an application resilient to various failure scenarios; it will be challenging but certainly not insurmountable. So, are you ready to embark on this lunar-like journey?

To start, we will dissect the different elements comprising your IT infrastructure. Each of the following will be addressed individually in subsequent sections:

Hardware
Network
Application
Database

Before delving into the specifics, I would like to emphasize that these individual components are not presented in any predetermined sequence. The manner in which you opt to implement them—whether individually, iteratively, or as a cohesive endeavor—will depend solely on your available resources and the potential impact on regular business operations.

Hardware

For a brief moment, let’s contemplate the fundamental essence of the NSK platform—it’s engineered to isolate and withstand multiple points of failure without significant interruption. This very principle should guide us in mitigating the impact of our application of catastrophic or complete system outages. These potential factors encompass power failures, human error, natural disasters, system software upgrades, and so forth. The aim of segregating segments of the hardware platform from other areas of the system is to enable outages to transpire with minimal disruption to business operations.

The initial hardware aspect to consider is ensuring peak performance to uphold the agreed-upon Service Level Agreements (SLAs) for business services. Secondly, determine the minimum level of business functionality that can be sustained during failure scenarios—essentially, which business functions or SLAs can be compromised. Additionally, anticipate any future business growth, as this presents an ideal opportunity to incorporate flexibility for forthcoming changes in the business landscape. These parameters collectively define the hardware requirements necessary to support the business effectively.

Now is the moment to reconsider the hardware configuration. It is essential to keep in mind a key objective: to segment subsections of the platform from one another, mitigating the impact of a catastrophic failure. For instance, consider a cluster of four 4-CPU NSK systems, which offer the same processing capacity as a single 16-CPU configuration. By ensuring that each system within the cluster is physically and geographically isolated from the others, you accomplish the primary objective of isolating multiple hardware failures from a single event.

Adopting a default configuration of at least three systems within a cluster is strongly advised to minimize the impact of any single node failure. This recommendation stands unless Service Level Agreements (SLAs) and acceptable levels of reduced business functionality can withstand a complete failure of one of the nodes within a two-node cluster.

Pros:
- Once implemented, any planned or unplanned outages will only impact a fraction of the available system resources.
- With proper planning and consideration for hardware expansion, new systems and/or additional CPUs can be added to the network without impacting normal business operations.
- Easier to perform hardware upgrades with little or minimized business impact.
- Once the application has been engineered to systemically adapt to changing hardware configurations (planned or unplanned), the software can be installed and configured on the new node independent of the rest of the system.
Cons:
- Increased System Support is needed to support additional systems within the expanded ecosystem.
- Multiple operating system upgrades will now be required; one for each node in the configuration.
- These configurations can potentially result in additional Initial License and Software License fees.

Network

Traditionally, I have entrusted experts in this domain to deploy the requisite solutions to address business needs and requirements. For this discourse, attention will be directed towards three (3) key areas: Network configurations for the nodes within the network, External access to other computers, and External access for user interaction with the application.

Internal network setup: Despite the NonStop platform’s extensive history spanning over four decades, configuring internal networks might seem straightforward. However, it is worth noting that the internal network should be arranged to ensure all systems can function as peers to one another. Furthermore, multiple connections between these machines, spanning various communication processes, hardware controllers, and network routers, must be established. This setup helps mitigate the risk of multiple failures stemming from a single event.
Similar considerations should also be extended when constructing your external network configuration. Incorporate multiple lines, paths, and even service providers into your planning. For instance, in a previous scenario, we had established a dedicated T1 trunk line between two of our central systems and each client site (two sites). Each trunk line was configured with a single line handler controlling all connections.
However, through our re-engineering efforts, we broke down this configuration into multiple 56KB circuits distributed across various systems, employing multiple line handlers and engaging multiple service providers. The objective behind these adjustments was to diminish the impact of a single network line failure from a minimum of 25% to less than ~1%. Further elaboration on this aspect will follow, detailing how to eliminate all failures resulting from a data communications failure.
The final aspect of network considerations is user access to the system. In the aforementioned scenario, the users of this software were devices connecting to our radio network and accessing the application via a network line handler. Therefore, the devices were reconfigured to connect to any available connection through a Network Load Balancer.
This same methodology can be applied to TCPs within the PATHWAY environment. Start by configuring multiple TCP/IP listeners per system, ideally one per CPU, to maximize coverage. Subsequently, each TCP is configured for terminal connectivity, accommodating a subset of terminals from each listener process. Distribute these IP connections across multiple service providers as well. With this configuration, the Load Balancer will manage individual outages at the network layer, listener process, TCP process, or the entire system.

The essential component required at the network layer is the Load Balancer.

Pros:
- Dynamic routing around network failures without human intervention.
- When utilizing a network Load-Balancer, workload is more evenly distributed.
- Configurations can be dynamically customized to accommodate changing network needs.
- Improved utilization of invested hardware.
Cons:
- More network components will need to be monitored and maintained.
- Potentially increased hardware costs.

Application

Given that this article centers on NonStop, we will assume that PATHWAY is the application environment in use. Designing an application to withstand various infrastructure failures may appear daunting, but it is more manageable than one might think. One of the initial steps is to challenge traditional approaches to software development and environmental configurations for the NonStop platform. Let’s commence by adjusting the PATHWAY configuration to incorporate duplicate server classes of the same type, deploying multiple PATHWAY environments within a single system, removing “hardcoded” PATHSEND commands from the application, strategically employing Context Sensitive and Context Free servers, enhancing tolerance for PATHSEND errors, and finally, integrating data-driven and black box logic into the application.

It’s crucial to configure multiple identical PATHWAY environments on each system within the network of systems. When appropriately set up, this ensures an alternative access path to the application and its associated data in case of issues with the primary PATHWAY. The configuration and utilization of alternate PATHWAYs will depend on your business requirements and available resources. However, in the discussed example, a secondary PATHWAY was employed to facilitate dynamic modifications to the application without causing an application outage. New or removed functionality would be configured in the secondary PATHWAY, and the associated message routing tables would be updated to point to the newly configured PATHWAY. All servers would then be instructed to read their message routing tables for updates. Once all traffic has been successfully routed to the other PATHWAY, servers can be shut down. Further discussion may be necessary to fully understand the various possibilities.
Typically, the PATHSEND command within an application is hardcoded with specific values for Node, PATHMON process name, and server class name. To enhance flexibility and resilience, the application was adapted to use variables for these values, sourcing them from a database ‘routing’ table. This approach enables the application to access multiple systems and/or server classes via alternate paths, ensuring normal functionality while tolerating outages. Refer to item #4 for further details.
In our application, we chose to maximize the implementation of context-free servers (black box logic), reserving sensitivity to the context of messages only at transaction endpoints where information such as sender, receiver, purpose, and location was required for appropriate actions. Message routing was based on message type rather than message content. In our scenario, these endpoints primarily consisted of Line Handlers used for managing remote client computer systems or handheld radio/satellite devices. The extent to which Context Sensitive/Free servers are employed in your business will depend on your individual business environment. An added advantage of utilizing Context Free servers is that all messages are treated as “fire and forget.” It is the message type that dictates the subsequent destination, including routing any responses back to the originator.
The main objective of these modifications is to detect failures and dynamically reroute around them seamlessly, without the end users being aware of any issues. To achieve this, we implemented the changes described in item #2 and transferred the message route (i.e., PATHSEND attributes) from server to server into a database. Additionally, alternate routes were loaded to enable retry logic within the application to keep the message progressing. The breadth and depth of these configurations depend on the specific requirements of the organization. In our case, we consistently employ four (4) distinct paths between servers, spanning multiple nodes, to ensure reliable message delivery. For this approach to be effective, the application needs to be adjusted to exhaust all available routes before declaring a failure.
Pros:
- Dynamic routing around outages within the application.
- New application modifications can be deployed in real-time.
- Interactive ‘Routing’ modifications can be introduced as needed.
- With proper configuration, all available system resources can be simultaneously utilized maximizing corporate capital investment.
- Depending upon other system configuration, application load balancing can occur.
Cons:
- Application modifications are needed to facilitate support for dynamic routing.
- Extra effort must be put toward a good Capacity Planning model to ensure that enough system resources are available for each failure scenario.
- Additional system resources are required to support multiple application environments.

Database

The significance of corporate data assets is frequently disregarded and undervalued; nevertheless, a corporation’s data typically stands out as its most valuable asset. Among the four infrastructure areas covered in this document, managing databases poses the greatest challenge in designing a business continuity solution. Numerous solutions exist today for ensuring access to these assets. Consistent with the overall theme of this document, this section will delve into how we ensured immediate data availability despite any failures.

Our initial goal was to define, document, and distribute a comprehensive list of known or anticipated failure scenarios, along with the corresponding business rules needed for each situation. These rules received consensus from Management to Technicians. Additionally, we explored various solutions, considering factors such as time, budget, human resources, and technological capabilities, before ultimately opting to design and implement our own database solution tailored to our specific requirements.

The most straightforward solution would have been to employ one of the numerous data replication software packages available. Nevertheless, due to the expenses and the time required to establish and configure a performance-driven implementation that met our standards, management opted against this approach. It was concluded that the cost associated with using a third-party software solution, which could potentially result in unintended delays and the risk of data loss during a catastrophic failure, was deemed unacceptable.

Upon evaluating the distinct categories of data managed within the application, such as auditing, billing, customer data, logs, registration, routing, timers, etc., we designed and implemented several partitioned NonStop SQL/MP tables. The Customer table was identified as the backbone of the business, crucial for the functioning of nearly all other processes. To ensure resilience, we created two versions of the Customer table, each partitioned so that half of the data was distributed across two different systems. This setup ensured that a complete dataset of our customer information was maintained on two separate systems, with each system housing a half from each table copy. This necessitated modifying the application to interface with both tables for reading and writing operations.

In the case of a write operation, the application was designed to proceed without interruption as long as one of the write operations succeeded. To prevent the use of outdated data following a failed write attempt, we incorporated a last modification timestamp into the tables. The application was also engineered to read from both tables, comparing the last modification timestamps to ensure the data accurately. If the timestamps were identical, the records were deemed synchronized; otherwise, the record with the more recent timestamp was considered the data of record.

The registration table was uniquely structured as a single table with multiple partitions across six machines, each located in one of our radio network control centers. This table contained device information specific to that location. The application was programmed to search for the device in its last known location first. If it failed to find the device there, it would scan the entire table to determine if the device had roamed to a different location. To provide redundancy, each radio controller was connected to a secondary backup NonStop system. This ensured that in the event of a system failure, devices could still be located and utilized without any disruption.

Our assessment concluded that the application could accommodate the extra I/O requests necessary to guarantee data was recorded in the database such that it remains current for other users and processes. Although this approach did not decrease or remove the additional I/O operations, it enabled the application to detect the timing and location of a failure, allowing it to automatically address the disruption in real-time.

Pros:
- Managing distributed data via the application provides an immediate and dynamic response to ensuing successful application transactions.
- Real-time alerts to support personnel are built into the application.
- Reduced operating costs.
- No external third-party cost, training, or monitoring.
Cons:
- Additional time and testing of application changes will be necessary.
- Depending upon the design selection, additional I/O operations may occur within the transaction path.
- Deeper upfront analysis is required to identify opportunities for failures and the necessary business rules to apply to each failure scenario.

Summary

This solution is best envisioned as a unified enterprise system comprised of a cluster of nodes, where each node operates independently yet collaboratively shares access to a single database instance. Although the process of building a fault-tolerant or zero-downtime application can seem daunting, it can be mastered if taken in steps. As you may have noticed, the database is the most complex and challenging component you will face; but that does not stop your ability to build out the other parts of this framework while tackling the database.

About the Author

Greg Hunt is a solutions architect with more than 30 years of industry experience in programming, systems management, data analytics, and query/database design and performance for NonStop and other platforms. After working for nearly a decade at HPE, Greg brought his extensive knowledge of architectural practices, data management, and database platform experiences to Odyssey Information Services in 2016, where he helped Odyssey’s clients improve overall data utilization and database performance.

Acknowledgments

While the context of this document originates from a past implementation I participated in, I’ve sought the input of a select group of friends, coworkers, and trusted confidants to ensure accuracy, validity, and readability. To each of them, I extend my heartfelt appreciation for their valuable comments and contributions to my endeavor. My heartfelt Thanks go out to:

Allison, Jeff
Ballard, Ben
Stanford, Donna
Wetzel, Ronald

“We cannot solve our problems with the same thinking we used when we created them.” – Albert Einstein

Author

Greg Hunt

Greg Hunt is a solutions architect with more than 30 years of industry experience in programming, systems management, data analytics, and query/database design and performance for NonStop and other platforms. After working for nearly a decade at HPE, Greg brought his extensive knowledge of architectural practices, data management and database platform experiences to Odyssey Information Services in 2016 where he has helped Odyssey’s clients improve overall data utilization and database performance.

View all posts

The Connection

A Journal for the HPE NonStop Business Technology Community