NonStop Trends and Wins

System Outages - 2025 Update

Nonstop Trends and Wins

Dr. Jim Gray worked for Tandem in the early days and was responsible for a lot of strategic directions, especially concerning availability and database. He was instrumental in the creation of the TPC https://www.tpc.org/ in an effort to create fair benchmarks customers could use to evaluate system performance. I was very fortunate to have known Jim. He was a genius and a wonderful, kind human being.

In Dr. Jim Gray’s seminal 1985 analysis of system outages (“Why Computers stop and what can be done about it”) revealed that hardware failures accounted for 18% of outages, Software 25%, Environment 14% and the leading cause of failure was what he termed Administration, but we can say human error. This information was groundbreaking at the time and was one of the first selective studies on computer outages. At Tandem, at the time, we focused on availability, on eliminating, or riding through various failures. Hardware always had partner systems. That is, power supplies would be architected in pairs and supply 50% of the power for a specific unit. If one power supply failed, the other would ramp up to 100% of the power, keeping the unit from failing, or even knowing there was an outage. A hardware failure would not (and will not) take a NonStop system down. Software was a more serious issue and accounted, even back then, for a much larger percentage of failures. Many application errors would not take a system down, but NonStop’s definition of “up system” meant not only the Operating System but the customer’s application as well. The concept of process pairs was created to try to eliminate some application errors. To quote Jim’s paper, “The key to software fault-tolerance is to hierarchically decompose large systems into modules, each module being a unit of service and a unit of failure. A failure of a module does not propagate beyond the module.” 1985, but it sounds a lot like microservice architecture. So Jim suggested breaking large applications into smaller elements to have a failure create less of an impact on the entire environment. It was Jim’s contention that software bugs fell into one of two realms. There were the hard bugs, which were easily detected and always failed in the execution stream. These were mostly caught in application testing. The other bug, what Jim referred to as a soft bug, would only fail under certain conditions and was very hard to catch. The process-pair technology developed by Nonstop was intended to ride through these soft bugs, since in a process failure, the pair process was on a different CPU and therefore a different environment than the failed primary process. The environment issues were covered either by the battery systems sold with the early NonStop systems or later, when RDF allowed disaster failover. The biggest issue was human error, and NonStop systems tried to eliminate having humans get involved by the fast failover, but even still, configurations, fat fingers, lack of training, and educated guesses would most often account for a downed system.

Modern data shows a spike in software as a leading cause of outages, with software and human factors now dominating outage causes:

2025 Outage Causes Analysis

  1. Software Failures (45-55%)
    Now the leading cause, amplified by:
  • Cascading failures in microservices architectures 5
  • Unpatched vulnerabilities and flawed updates (e.g., 2024 CrowdStrike outage affecting 8.5M systems 7 )
  • Insufficient chaos engineering practices 5
  1. Human Error (25-35%)
    Administrative mistakes now surpass 1980s levels due to:
  • Cloud configuration errors 6
  • Inadequate DevOps training 5
  • Pressure for rapid deployment overriding safeguards 6
  1. Network/Infrastructure Issues (15-20%)
    Modern complexities include:
  • Multi-cloud integration failures 5
  • DNS/DDoS attacks 6
  • Bandwidth saturation from IoT devices 5
  1. Third-Party Service Failures (10-15%)
    A new category reflecting:
  • Cloud provider outages 5
  • API dependency chain reactions 6
  • Supply chain attacks 7
  1. Cyberattacks (8-12%)
    Includes ransomware, zero-day exploits, and state-sponsored attacks disrupting critical infrastructure 6 7 .
Era Hardware Software Human Network Third-Party Cyber
1985 1 4 18% 25% 42% N/A N/A N/A
2025 5 6 7 <5% 50% 30% 18% 12% 10%

Modern Case Study: 2024 CrowdStrike Outage

This software update failure caused $ 10 B+ losses through:

  • Faulty security patch deployment 7
  • Manual recovery requirements 7
  • Cascading impacts across supply chains 7

While hardware reliability has improved as Jim predicted 1 3, new challenges emerged from distributed systems complexity. Modern solutions emphasize automated rollbacks 6, immutable infrastructure 5, and chaos engineering 5, but organizational resistance to resilience investments is surprising, given the financial losses  5.

In the article for InfoQ cited above, “outages are commonplace in most organizations, with 55% of companies reporting weekly and 14% reporting daily outages. Staggering 100% of survey participants experienced revenue losses due to outages, with some companies (8%) reporting losses of USD 1 million or higher over the last 12 months.”  That’s a lot of outages and a lot of money. I know there is a lot of pressure whenever a NonStop renewal comes up within organizations. NonStop is not as well-known as Linux and Windows. Sometimes it’s not a ‘corporate’ standard. People want more integration with existing tools and strategies which is available, but that’s a different article. Remember, most NonStop systems were initially acquired for their availability and scale, which NonStop still has. Over the life of your NonStop, how many times has your company experienced an outage on the platform?

When Jim wrote his article, he didn’t really split out networking, since that was usually just a comm controller on NonStop. He did not envision third-party providers (Cloud suppliers) as a potential outage. Once you go cloud, it’s out of your hands. If they have an outage, you can only wait for them to identify and correct it. Cyber is also a new and scary addition to outages. In Jim’s day, one would need physical access to the data center to crash a system. With the Internet, a whole new category of outages has emerged. NonStop is working very hard on digital resilience and the ability to circumvent a ransomware attack, so stay tuned.

NonStop remains the premier system if you want to have your applications available 24x7x365.

Author

  • Justin Simonds is a Master Technologist for the Americans Enterprise Solutions and Architecture group (ESA) under the mission- critical division of Hewlett Packard Enterprise. His focus is on emerging technologies, business intelligence for major accounts and strategic business development. He has worked on Internet of Things (IoT) initiatives and integration architectures for improving the reliability of IoT offerings. He has been involved in the AI/ML HPE initiatives around financial services and fraud analysis and was an early member of the Blockchain/MC-DLT strategy. He has written articles and whitepapers for internal publication on TCO/ROI, availability, business intelligence, Internet of Things, Blockchain and Converged Infrastructure. He has been published in Connect/Converge and Connection magazine. He is a featured speaker at HPE’s Technology Forum and at HPE’s Aspire and Bootcamp conferences and at industry conferences such as the XLDB Conference at Stanford, IIBA, ISACA and the Metropolitan Solutions Conference.

    View all posts
NonStop TBC 2025, the Woodlands, Texas

Be the first to comment

Leave a Reply

Your email address will not be published.


*


This site uses Akismet to reduce spam. Learn how your comment data is processed.