NonStop – Focusing on Wins
If you follow my column (and if you do, thank you), you know that I tend to favor the ‘Trends’. I confess the technology does hold more fascination for me but it is high time to discuss some NonStop success. Now, first off, NonStop is a unit inside Mission-critical systems which is a business unit inside High Performance Computing which all rolls up to Compute within HPE which is still just one of several major business divisions. I say this because I don’t want the percentages I am going to disclose to be interpreted as a recommendation to buy or to not buy HPE stock. That would be between you and your financial broker. Also the percentages I am discussing refer, this issue, to North America. So that being said and without disclosing any real numbers I am okay in relaying that so far this year NonStop is about 15% up revenue-wise over last year (Y/Y). This quarter NonStop is up over 30% (Q/Q). And for the year so far we are tracking over budget. So wins are definitely happening. Since I do not have customer permission to disclose any specific names let me just say that major banks, major retailers and major credit card companies continue to be pleased and are investing in their NonStop mission-critical systems. The x86 platform systems are the predominant platform that is selling, and the latest systems (HPE Gen 10 or for us X-3) takes full advantage of the HPE silicon root of trust, making it the most secure x86 architecture available. See this article on HPE cyber security or search the website for Cybersecurity or silicon root of trust. Pretty cool stuff. If you have an existing Gen 8 or 9 system (X-1 or X-2) you may consider a processor swap to the latest Gen 10 (X-3) processors. They are faster and have the added security built in (see link above). If you saw my last article on the old syndromic surveillance pilot, I am happy to say that there is renewed interest in that solution both in the United States and overseas. We are seeing the criticality of systems as Covid-19 is cresting in the United States and starts, hopefully, to begin to decline.
So NonStop is selling well in 2020. I thought I would write a bit about the NonStop fundamentals this issue. You know the boring Availability and Scalability that is on every NonStop PowerPoint deck. I think, through repetition, it does lose some of the magic over time. NonStop is best known for availability, I mean we call it “NonStop” right? But is it still better than other solutions? We weren’t the only AL4 system in the marketplace. There is a core difference in design that has allowed NonStop to maintain a lead in this area. The general design principle, which is absolutely unique in the industry, is the premise that everything in the system will fail. It has never been a Tandem or a NonStop design principle to create a MTBF (Mean Time Between Failure) that was so good we didn’t have to worry about that component. If the MTBF were measured in centuries, NonStop developers will still develop based on ‘What happens when this unit fails’? NonStop assumes failure of everything and I believe that has made all the difference. I know in clustered systems, the last thing you want to happen is for the system to actually fail over to the backup system. That usually invokes a lot of crossed fingers: will it work, that the backup system components are working correctly, that the startup scripts work, etc. NonStop, as you all know, fails fast because we know it works. If anything looks the least bit funny – fail – because we know the system will continue. I recall way back in the Tandem days we used to do a demo in Cupertino where we would pull the CPU board out of an active system, pull a chip off that board and stick it back in. The management tools back then would pinpoint the chip that had been pulled as bad and we’d pop the board again and replace the chip. This to show a CPU outage and a recovery. Generally we would also pull the power plug out of a modem and we’d watch the traffic from Cupertino to the New York system switch and go through Chicago to get to New York. We’d plug the modem back in and the traffic would shift back to the ‘best’ route, which was directly to New York. I was there when a major bank was watching this demo and someone said it must have been expensive creating a phantom network for the demo. The presenter said it wasn’t a phantom network that was our actual live Tandem network we were doing the demo on. The bank attendees mentioned later that they were not sure what was more impressive the demo itself or the fact that we were confident enough to do it on our live network. We were, and are confident, because the failure of these and all components of a NonStop system have a backup that we know works because it is already working. The other availability maxim was that all components in the system are used. There aren’t any components sitting around waiting for a failure. We use both Primary and Mirror disks. We use all processors and manage the load so that a failure can be absorbed. We use both the X and the Y fabric, at all times. The failure is to a known good, active component. Another big differentiator is performance. Back in the day, many systems would consider themselves ‘up’ if they could get a system prompt on a console. Jimmy (James Treybig, founder, President & CEO of Tandem) was famously asked by a consultant when did Tandem consider a system ‘down’? Jimmy answered that ‘A system was considered down if a customer thought the system was down’. A pretty high standard. That meant that the system and the applications running on it had to appear normal even during a failure. If a processor went done and response time started to become compromised Pathway would automatically spawn more servers (application instances). This technique of spawning additional resources is becoming the defacto standard for containers. Create stateless containers and if needed, fire more up. It provides seamless performance but I do not believe container companies equate this necessarily with availability as we did and do. A properly designed system should perform as well during a failure as it does normally. What I am saying is that availability has several dimensions, all of which have been accounted for by the development maxim that everything will fail. So on a NonStop the Performance, the workload, the concurrency, the volume and velocity of transactions on the system are maintained, even during a failure.
Future Leaders In Technology
Congratulations to this year’s recipient!
Hometown: Mercedes, Texas
University of Texas at Austin, Computer Science
“I hope that my work in computer science will one day lead me to work for successful corporations such as Apple and Google and thus allow me to contribute to technological advances and in turn, being a Hispanic female, inspire others of all interests and backgrounds to chase their wildest dream and not let societal norms barricade their aspirations.”
The same multi-dimensionality applies to NonStop scalability. We can scale up (faster processors and of course our Dynamic System Capacity – adding cores in real time) but NonStop has always favored scale out owing to the now popular parallel processing, shared-nothing architecture which almost everyone copies nowadays. The thing that hasn’t been copied is the Operating System behind NonStop scalability which can allow online expandability. The database size can increase dramatically. The transaction rate of a system can double and double again and again without changing any code. The performance of a system can be predictably grown. A four processor system will be twice as fast as a two processor. That four processor can be doubled again by adding four more. The math is pretty easy. Additionally the more you add, the better the availability gets. If I have a two processor system and one of them fails everything shifts to the surviving processor. If I have a sixteen processor system and a processor fails the load can be assimilated by the surviving 15 processors. If all processors were running at 60% each one only has to increase its performance by 4% to handle the load of the failed processor. It would be unnoticeable. Just as it was designed to be.
I know these ideas are well-known in this community and through actual experience on NonStop but every now and again it’s good to hear the classics, isn’t it?