Hand writing fault tolerance with marker

What is NSFTI all about?

NonStop offers multiple ways of writing fault tolerant applications. One of the earliest technologies for programming fault-tolerant applications is using process pairs. Writing an application using process-pair technology can be quite daunting as it expects the programmer to have a deep understanding of several NonStop specific programming aspects such as data checkpointing, primary and backup processes, the File System (FS) API and handling process pair signals to name a few. NSFTI (NonStop Fault Tolerant Interface) is a C++ library that hides the process pair programming complexity in a few C++ classes. The classes are designed for use by any C++ programmer without any NonStop programming experience.

Is process pair technology relevant today?

A short answer to this question is YES. Look around and you will see products even outside NonStop trying to achieve fault tolerance through process-pairs. The term process-pair might not be familiar, but the technology is. Let me substantiate with an example. Look at Redis. Redis, when deployed in HA (high availability) mode runs as a process-pair where the primary is termed ‘master’ and the backup is termed ‘slave’. The primary checkpoints data to the backup to keep the backup up to date with the primary. There are Redis client libraries that seamlessly switch over operations to a backup when the primary fails.

There are many other popular products out there, that try to achieve high availability by using the exact same principles that are employed in NonStop’s process pair technology

OK! Got it. Tell me how the C++ classes help

An application programmer should only worry about programming application logic and handling data and not worry about infrastructural tasks such as creating a backup process, data checkpointing, and signal handling. NSFTI enables a programmer to focus on data and application logic and hides the infrastructural tasks.

NSFTI takes a data-centric approach to process-pair programming. It exposes in-memory data structure(s) for the programmer to keep critical data (that needs to survive process failure). The implementation ensures that the data is synchronized with the backup process, data consistency is never compromised, and the infrastructural tasks are all hidden from the programmer.

Really? How does it work?

Ah! If you have come this far and are still interested in NSFTI, then read on for more details on the library.

NSFTI for the technologist

The value of NSFTI is well understood by programmers who have written fault-tolerant applications using process-pair technology. It’ll be better appreciated by new-age programmers who must design and code fault-tolerant applications but have no knowledge of process-pairs.

NOTE: Fault tolerance can be achieved in ways other than using process pairs. Here our focus is purely on using process-pair technology

Decoding process pairs

What does it take to create a fault-tolerant application using process-pair? At its very core the process pair is about (a) not losing data and (b) having at least one process always executing business logic. Let’s peek into the underpinnings of such an application

Always have a backup

An application that is designed as a process pair starts off not as a pair but as a single process. That single process let’s call it the primary, then decides that it needs a backup to carry on its work in the event of its own failure. And hence it creates its image which we will call the backup. In the event of failure of either, the surviving one assumes the role of primary and ensures that a new backup is promptly started. Creating the backup and communicating with each other happens through a set of NonStop File System (FS) API. Multiple functions need to be invoked to create the backup process, to receive messages from each other, to watch out for signals, and the like. To an untrained eye, the functions in the API look like hieroglyphs and can be quite a daunting task to learn.

Keep the backup fully informed

A backup will be of no use if it does not know what the primary is doing. How does it know what the primary is doing? It needs the primary to share its state and keep it up to date each time it changes. Hence the primary should always keep the backup informed of its state. This is called data checkpointing. The primary communicates its state to the backup through function invocations in NonStop FS API. The primary decides what data the backup needs, to pick up from exactly where it left off in the event of a failure.

In the event of a failure, the backup assumes the role of the primary, looks at the (checkpointed) data it has, and starts from where the primary left off. Along with this it starts a new backup and checkpoints its entire state to the new backup.

And there are plentiful pitfalls

What should the primary do if the backup fails to start? What if the CPU on which the backup is to run is not started? How does the primary create the exact same environment as itself for the backup to run? And more.

A well-programmed process-pair handles all these and, programming for these cases requires a depth knowledge of the NonStop API and process-pair programming paradigm too.

The summary is…

Process pair programming requires a depth of knowledge of the NonStop platform and is not for the faint at heart. The application programmer has two tasks to fulfill, to code the business logic and to code for infrastructural tasks to achieve fault tolerance using process pair technology.

The emergence of NSFTI

If you have followed the arguments so far, and if you are wondering “Why not spare the programmer of the very complex infrastructural tasks and allow focus on the business logic?” then Congratulations! you have understood the primary goal of the NSFTI library.

Data-centric view of process pairs

As we have already seen, the conventional process pair programming requires the programmer to understand a whole lot of NonStop technology. With NSFTI the programmer can now solely focus on business logic and leave the infrastructural tasks to the library. The library will handle all process pair technology tasks such as creating the backup, checkpointing the data to the backup, detecting peer failure, restarting the peer, and signal handling.

What then should the programmer do to use the library? you may ask. The only thing the programmer should do is keep the state information in the data structure provided by the NSFTI library.

A Map for storing application state

The library provides a Map data structure whose contents are always kept in sync with the backup. The programmer just needs to store the application data that needs to be checkpointed, in the Map. The Map data structure is implemented as a C++ class with method signatures very similar to that of the C++ STL (standard template library).

Outline of an NSFTI application

If you are a programmer reading this article, you would want to know what an NSFTI program looks like. We have quite a few code snippets and example applications that are shipped along with the product. If you are impatient then the following few paragraphs give you a brief outline of an NSFTI application. For more details, please refer to the online documentation of NSFTI.

Initialize the library

The first thing the application should do is initialize the library. Amongst other things, during initialization, the library will create the backup process and prime the backup process to receive checkpoint data from the primary.

std::shared_ptr<FTILib> initRet = FTILib::initialize();

Create data structure(s) to store application state

The application should then create data structure classes to store the application state. NSFTI provides functions to create ‘named’ data structures.

shared_ptr<FTILib> ftiLibInstance = FTILib::getInstance();

/* Get the MapFactory for creating the map */ shared_ptr<MapFactory> factory = ftiLibInstance->getMapFactory();

/* Create a Map by name APP_STATE to store state */

state = factory->createNamedMap<std::string, std::string>(“APP_STATE”);

Perform business logic

The application then executes business logic just as it normally would, without being aware that it’s running as a process pair. While doing so, the application should persist state changes to the NSFTI data structures.

/* insert state information in the map */

state->insert(“key”, “pack my box with five dozen liquor jugs”);

Any function that modifies the contents of the map, will internally result in data being checkpointed with the backup. Hence when the function ‘insert’ returns with success, the backup would already be updated with the state.

Graceful shutdown

NSFTI provides a function for a graceful shutdown of the application. Without a graceful shutdown, the backup would identify the shutdown as a failure and start a new backup.

/* API to be called for the graceful exit of the application */ ftiLibInstance->shutdown(EXIT_SUCCESS);

Getting started

We have created a few sample applications and posted them on our GitHub repository. There are four samples, each demonstrating different capabilities of the library. Each sample is accompanied by a detailed README with instructions on how to run the sample. You can find the samples in NonStop’s GitHub repository https://github.com/HewlettPackard/NonStop

Before you try out the samples you need to have the L21.06 version of the OS and T1150L01^AAE installed.

Tell us what you think

We in NonStop engineering are always on the lookout for ways to help our customer and partner community. If you have any thoughts that you would like to share with us, do drop us a mail at Sridhar.Neelakantan@hpe.com or Suveer.Nagendra@hpe.com

[1] Online documentation for NSFTI can be found at NonStop Fault Tolerance Interface (NSFTI) Library

Author

Suveer Nagendra

Suveer has close to 24 years of industry experience out of which about 21 years working on NonStop and related technologies. Of these Suveer has been with HPE for 15 years now. He has a good understanding of the entire platform and specially the middleware portfolio of products. He has written applications using NS Tuxedo for customers such as Rabobank in The Nederlands, consulted large customers such as Bombay Stock Exchange on performance tuning of their trading application and architected customers solutions such as payment switch from Opus in India. Suveer has architected popular middleware products such as the servlet container NSJSP, the application server NSASJ, the in-memory caching server NSIMC and more recently the API Gateway. Along with expertise on NonStop technologies, Suveer has years of experience in Java technologies both in the application and infrastructure areas. He is a Master Technologist working in the NED lab in Bengaluru, India.

View all posts

The Connection

A Journal for the HPE NonStop Business Technology Community

NSFTI for the impatient: A 2-minute introduction to NSFTI