What is NSFTI all about?
NonStop offers multiple ways of writing fault tolerant applications. One of the earliest technologies for programming fault-tolerant applications is using process pairs. Writing an application using process-pair technology can be quite daunting as it expects the programmer to have a deep understanding of several NonStop specific programming aspects such as data checkpointing, primary and backup processes, the File System (FS) API and handling process pair signals to name a few. NSFTI (NonStop Fault Tolerant Interface) is a C++ library that hides the process pair programming complexity in a few C++ classes. The classes are designed for use by any C++ programmer without any NonStop programming experience.
Is process pair technology relevant today?
A short answer to this question is YES. Look around and you will see products even outside NonStop trying to achieve fault tolerance through process-pairs. The term process-pair might not be familiar, but the technology is. Let me substantiate with an example. Look at Redis. Redis, when deployed in HA (high availability) mode runs as a process-pair where the primary is termed ‘master’ and the backup is termed ‘slave’. The primary checkpoints data to the backup to keep the backup up to date with the primary. There are Redis client libraries that seamlessly switch over operations to a backup when the primary fails.
There are many other popular products out there, that try to achieve high availability by using the exact same principles that are employed in NonStop’s process pair technology
OK! Got it. Tell me how the C++ classes help
An application programmer should only worry about programming application logic and handling data and not worry about infrastructural tasks such as creating a backup process, data checkpointing, and signal handling. NSFTI enables a programmer to focus on data and application logic and hides the infrastructural tasks.
NSFTI takes a data-centric approach to process-pair programming. It exposes in-memory data structure(s) for the programmer to keep critical data (that needs to survive process failure). The implementation ensures that the data is synchronized with the backup process, data consistency is never compromised, and the infrastructural tasks are all hidden from the programmer.
Really? How does it work?
Ah! If you have come this far and are still interested in NSFTI, then read on for more details on the library.
NSFTI for the technologist
The value of NSFTI is well understood by programmers who have written fault-tolerant applications using process-pair technology. It’ll be better appreciated by new-age programmers who must design and code fault-tolerant applications but have no knowledge of process-pairs.
NOTE: Fault tolerance can be achieved in ways other than using process pairs. Here our focus is purely on using process-pair technology
Decoding process pairs
What does it take to create a fault-tolerant application using process-pair? At its very core the process pair is about (a) not losing data and (b) having at least one process always executing business logic. Let’s peek into the underpinnings of such an application
Always have a backup
An application that is designed as a process pair starts off not as a pair but as a single process. That single process let’s call it the primary, then decides that it needs a backup to carry on its work in the event of its own failure. And hence it creates its image which we will call the backup. In the event of failure of either, the surviving one assumes the role of primary and ensures that a new backup is promptly started. Creating the backup and communicating with each other happens through a set of NonStop File System (FS) API. Multiple functions need to be invoked to create the backup process, to receive messages from each other, to watch out for signals, and the like. To an untrained eye, the functions in the API look like hieroglyphs and can be quite a daunting task to learn.
Keep the backup fully informed
A backup will be of no use if it does not know what the primary is doing. How does it know what the primary is doing? It needs the primary to share its state and keep it up to date each time it changes. Hence the primary should always keep the backup informed of its state. This is called data checkpointing. The primary communicates its state to the backup through function invocations in NonStop FS API. The primary decides what data the backup needs, to pick up from exactly where it left off in the event of a failure.
In the event of a failure, the backup assumes the role of the primary, looks at the (checkpointed) data it has, and starts from where the primary left off. Along with this it starts a new backup and checkpoints its entire state to the new backup.
And there are plentiful pitfalls
What should the primary do if the backup fails to start? What if the CPU on which the backup is to run is not started? How does the primary create the exact same environment as itself for the backup to run? And more.
A well-programmed process-pair handles all these and, programming for these cases requires a depth knowledge of the NonStop API and process-pair programming paradigm too.
The summary is…
Process pair programming requires a depth of knowledge of the NonStop platform and is not for the faint at heart. The application programmer has two tasks to fulfill, to code the business logic and to code for infrastructural tasks to achieve fault tolerance using process pair technology.
The emergence of NSFTI
If you have followed the arguments so far, and if you are wondering “Why not spare the programmer of the very complex infrastructural tasks and allow focus on the business logic?” then Congratulations! you have understood the primary goal of the NSFTI library.
Data-centric view of process pairs
As we have already seen, the conventional process pair programming requires the programmer to understand a whole lot of NonStop technology. With NSFTI the programmer can now solely focus on business logic and leave the infrastructural tasks to the library. The library will handle all process pair technology tasks such as creating the backup, checkpointing the data to the backup, detecting peer failure, restarting the peer, and signal handling.
What then should the programmer do to use the library? you may ask. The only thing the programmer should do is keep the state information in the data structure provided by the NSFTI library.
A Map for storing application state
The library provides a Map data structure whose contents are always kept in sync with the backup. The programmer just needs to store the application data that needs to be checkpointed, in the Map. The Map data structure is implemented as a C++ class with method signatures very similar to that of the C++ STL (standard template library).
Outline of an NSFTI application
If you are a programmer reading this article, you would want to know what an NSFTI program looks like. We have quite a few code snippets and example applications that are shipped along with the product. If you are impatient then the following few paragraphs give you a brief outline of an NSFTI application. For more details, please refer to the online documentation of NSFTI.
Initialize the library
The first thing the application should do is initialize the library. Amongst other things, during initialization, the library will create the backup process and prime the backup process to receive checkpoint data from the primary.
std::shared_ptr<FTILib> initRet = FTILib::initialize();
Create data structure(s) to store application state
The application should then create data structure classes to store the application state. NSFTI provides functions to create ‘named’ data structures.
shared_ptr<FTILib> ftiLibInstance = FTILib::getInstance();
/* Get the MapFactory for creating the map */ shared_ptr<MapFactory> factory = ftiLibInstance->getMapFactory();
/* Create a Map by name APP_STATE to store state */
state = factory->createNamedMap<std::string, std::string>(“APP_STATE”);
Perform business logic
The application then executes business logic just as it normally would, without being aware that it’s running as a process pair. While doing so, the application should persist state changes to the NSFTI data structures.
/* insert state information in the map */
state->insert(“key”, “pack my box with five dozen liquor jugs”);
Any function that modifies the contents of the map, will internally result in data being checkpointed with the backup. Hence when the function ‘insert’ returns with success, the backup would already be updated with the state.
NSFTI provides a function for a graceful shutdown of the application. Without a graceful shutdown, the backup would identify the shutdown as a failure and start a new backup.
We have created a few sample applications and posted them on our GitHub repository. There are four samples, each demonstrating different capabilities of the library. Each sample is accompanied by a detailed README with instructions on how to run the sample. You can find the samples in NonStop’s GitHub repository https://github.com/HewlettPackard/NonStop
Before you try out the samples you need to have the L21.06 version of the OS and T1150L01^AAE installed.
Tell us what you think
We in NonStop engineering are always on the lookout for ways to help our customer and partner community. If you have any thoughts that you would like to share with us, do drop us a mail at Sridhar.Neelakantan@hpe.com or Suveer.Nagendra@hpe.com
 Online documentation for NSFTI can be found at NonStop Fault Tolerance Interface (NSFTI) Library