Essential and accidental complexity
The topic of essential versus accidental complexity goes back to ancient Greece in a classification from Aristotle. But we can go just 30 years back instead of 3000 to find it well illustrated in software development, by Fred Brooks, a Turing award winner. In his 1987 essay “No Silver Bullet” he introduces two concepts, essential complexity and accidental complexity. Essential complexity is inherent to the problem the software needs to solve and cannot be removed. As an example, if the user needs a program to do 30 different things then those are essential requirements and the program must be architected do those 30 things. On the other hand accidental complexity is the result of how a program was developed with some level of unintended complexity that can potentially be addressed. It goes on explaining how high level languages for example are helping to reduce accidental complexity. The Python language comes to my mind as a good example and we can only acknowledge how it is universally acclaimed for its simplicity and immediate productivity gain.
Further reading on the topic shows that although it is recognized that accidental complexity is unintentional, it still suggests that the problem is in the developers’ court and it is their responsibility to fix it. Yet, how reachable is that goal when taking into account all the constraints in place? Maybe before jumping into blaming the developers we should understand the context. Often, if not always, developers have to work within established boundaries. Organizational boundaries drive how an IT line of business designs software that may simply mirror how the organization is structured. Budget boundaries drive investments to write code that supports the given current business use case and only that. And in fact, isn’t the agile model of small incremental code changes an illustration of focusing solely on selected “users stories” (*)? In that context optimizing code for the sake of architectural elegance and reduced complexity becomes much harder to justify. It is seen as added cost for an unpredictable benefit. How can companies find and justify time and money for this today? Product boundaries may also exist when a solution is assembled from multiple sources and today assembling multiple packages appears to be the norm. Why would you create new code when libraries already exist, readily available, sometimes with an approval stamp such as being labeled “apache project” or enjoying a million GitHub downloads as undisputable proof of their quality? However each product may not provide the same quality attributes, leading to gaps that need to be addressed by the end users.
(*) A user story is a tool used in agile software development to capture a description of a software feature from an end-user perspective
Database vendors only solve database problems.
A very good example of this problem has unfolded in the last decade in the database world. Many database products had to come up with better scaling and availability. From Oracle RAC to Microsoft Failover Cluster Instances, from Cassandra to MongoDB, Google Spanner to CockroachDB, they all implemented a range of capabilities ranging from very basic clustering to distributed SQL in order to address the higher demand for data compute. Such products are in general created to provide the same experience should they be deployed on Microsoft Windows, Red Hat or SUSE Linux. This in turn means they rarely take full advantage of the underlying OS features, such as a clustered file system since this would suddenly restrict the product to a specific platform. This is a major boundary that database vendors cannot overcome easily, even with the smartest developers. So database vendors release their own clustering and their own availability features. Network load balancing, storage availability, clustering of applications and integration with operating system features are often left as someone else’s problem. This lack of integration with the OS or other elements of the software stack simply means that the burden is given to the customer to figure out how to implement the integration. At this point the level of accidental complexity can increase further because additional constraints will have to be factored into the mix. What products are my existing DBAs familiar with? Do I need a senior DBA that knows how to setup clustering? Do I failover the OS or just the database or the whole system when a failure happens? Wouldn’t it be better to have my DBAs focus on securing systems? And what if my super skilled DBA misses one single point of failure because, after all, he is human? The example of availability is particularly unforgiving as the best availability is only as strong as its weakest link. All your efforts to keep the DB highly available can be ruined by a single, human mistake. But assuming you have all of the above solved, designed and paid for, there is still a big elephant in the room. You still have to assemble the cluster together such as installing Linux on each node, setting up an IP address for each node, deploying them , securing them, creating users, installing clustering software, etc.… and none of those tasks can easily be solved by the database vendor alone.
How can we solve this database accidental complexity?
One of the key quality attributes of a cluster is its capability to be implemented as a single system image. How do you get a number of compute nodes to appear as a single system to the end user? When you assemble a cluster, maybe the database comes from one vendor, the OS is Open Source, and the clustering s/w is the one your DBA knows best. Sometimes, even from a unique vendor, all the products are not fully integrated. For example Microsoft SQL Server includes a high availability feature called “always on availability groups” and Windows has a high availability feature called Failover Cluster Instance (FCI). As it happens, and that is even listed in the SQL Server 2019 documentation, both cannot work seamlessly together. How can you achieve a consistent single system image across a whole solution in such conditions? As of this day, mainstream Linux systems are not clusters by default. Even if you can cluster the database, the Linux servers are not native clusters where the remaining gaps will be addressed with quick and dirty solutions such as sharing non-DB files from a NFS server. Yes very ugly!
This is where the story is radically different when you look at NonStop systems. A key design for NonStop systems is that the cluster single system image is implemented at the kernel level. Anything that applications interact with such as the file system, process management or network interfaces are seen as a single entity from all the nodes in the cluster. With a NonStop system all the CPUs (aka compute nodes) are internal anonymous elements that you do not have to manage. All those CPUs appear to the database, and everyone else, as a unique system, so there is no clustering to be implemented by the database as it has already been done by the OS. This is in contrast with assembled clusters where you have to install each node, assign an IP address, manage the node, patch the node and repeat that same operation as many times as you have nodes. For a NonStop system including 16 CPUs, you install s/w only once and you manage only one system and only one database s/w instance. This is a unique feature that was very well defined about 20 years ago by Greg Pfister in his book “In search of clusters” in which he says “kernel level SSI management is the most desirable solution to manage clusters”. I recommend this book, still very relevant, where Greg Pfister from IBM compares IBM SYSPLEX, OpenVMS clusters and NonStop systems along with “assembled clusters”.
NonStop beats accidental complexity where others hit a wall
Kernel-level SSI is a good example of solving accidental complexity. Instead of having each database software, each middleware and each application create their own cluster with various level of success, let the OS address this once and for all, for everything running on the software stack.
And then everything becomes so much simpler!
For example, when you install HPE NonStop SQL on a NonStop cluster, you install it just once, configure it just once and if you add new compute nodes in the future they are immediately available to the database engine. Readers familiar with NonStop know that there are other advantages that come with the OS that can also be transparently adopted by database and applications such as high availability of the storage engine as another example. The benefit for this? No need for the end users to create replicas. Talk about simplifying and reducing costs! NonStop was designed as an always on cluster from day one and is therefore solving a problem that is very difficult to solve in isolation by other vendors. Database vendors have a strong need for clustering but they don’t own the OS where clustering is best implemented so they are doomed to re-invent their own solutions which are very often complex, limited in scope, costly and unable to achieve 100% availability. Indeed, as outlined by Fred Brooks in 1987, it looks like there is no silver bullet for those vendors to solve accidental complexity. NonStop solved this many years ago within our OS and has been the silver-bullet solution for our customers. If you have not considered NonStop yet, take another look at how it can solve a great deal of complexity that others cannot address.
Be the first to comment