Don’t Look Now! Your Data is Moving – Again!

The past several years have seen another explosion of data creation. As predicted, the amount of data captured today is at least an order of magnitude more than what was captured just 5-10 years ago. This trend is expected to continue for at least another decade. But capturing lots of bits in a bucket is a relatively easy job for most of IT. Log files and data capture mechanisms have improved to the point where facial, spatial, and other data are passively flowing into collective storage.

This huge and ever-increasing harvesting of data has accelerated Artificial Intelligence (AI) by exposing near term and aggregate data into software that can leverage the information for learning and on-the-fly event processing. Data drives the new age in both positive and negative ways; with amazing new discoveries and helpful enablers being created almost daily, and at the same time, allowing nefarious players to manipulate people, processes, and activities in morally and socially undesirable ways. We all hope and assume that the continuation of this trend will in the long run make the world better and will allow tremendous advancement around the world.

This data is starting to make its way into AI learning and decision tools that reside and interact where the data is created, at “the edge” of computing, as it is called. Because of this, the way we manage data is now and again changing the way we design our systems architectures. Let’s take a look at how we got here. We can see that the evolution tells us a lot about where we might be in a few years. This insight may help system architects, developers, and users to better appreciate the new data age in computing. I will walk us through this history briefly so that we can better embrace these trends as part of a collective maturation of hardware and computer science.

In the Beginning

The first computer data was transient; and quite literally ephemeral. Stepping past the history of initial load and reloadable technologies like paper tape and Hollerith cards we recall that many of the early systems could only sometimes (by development and user design) place the data bits onto a linear media (magnetic tape) in order to later retrieve the data. This meant that every time the data was to be accessed or changed, it had to be read in linear fashion and either modified on-the-fly, or loaded into memory and re-written back, linearly, to a storage device. This is interesting because we still do some similar data manipulations today. We will discuss this later, but the use of log files and similar linear data has becoming interesting again due to advancements in computer science and newer software tools.

As non-volatile random access devices evolved into what we now call “disks”, it freed the designers to determine their own access to the data. This might seem simple, but it turns out that it is not. The physical storage does not map directly to the desired logic usage of the data. The entire data set may exist on a media that allows access, but the access does not allow anything other than random start and stop points unless we add a level of abstraction to the storage at some point. In other words, a disk storage that supports direct access (instead of linear access) to locations on the disk is no better than the linear mode if it does not provide a set of rules for where that data will reside. In the diagram below, the physical (lower left) must be leveraged using several levels of abstraction and support by everything from hardware to interfaces and compilers.

If you do not provide the rules, then the software has to directly go to the hardware (as shown in the next diagram). This is a contrived example. It wouldn’t work, but basically, without a scheme to store information, you cannot just write like you did with magnetic tape.

The reason I mention this history is because it is important to recognize that for the last 40 years the computer industry played with several ways of accomplishing this data abstraction. Interestingly, aside from the linear tape merge, the industry continued to add to the list of possible techniques without completely eliminating the prior techniques. This is because each of these derivations support specific uses. And, almost always, those uses are still viable even when a new approach is created.

ARCANE PHYSICAL MAPPING EXAMPLE

We are now used to some standard models for this level of abstraction. In the NonStop world, we were proud and excited to be early users of the index sequential access methods. (ISAM). Tandem, at the time, offered several options that did not force all designs to use ISAM. Users could let the operating system just give raw blocks of data, or blocks of a certain length with a relative byte offset (this was a favorite of mine), key-sequenced, and options for alternative paths to data (alternate keys). ISAM, Relative offset, raw sequential blocks, and alternate keys are still popular and used on many systems today; beyond just NonStop. The combined sets of data access mechanisms allowed Tandem to deliver high performance and reliable transaction processing systems.

The Codd and Date relational model revolution began in the 1970’s as the famous research by Codd was published in 1969 and subsequently was promoted by E. F. Codd (1923-2003) and Chris Date (b. 1941) in books and delivered into technology in the 1980s. The relational model provided a way to define mathematical models for data that could then be leveraged in software logic. The industry embraced a new domain language called “SQL” as an easier, more flexible way to deliver a lot of the things that the ISAM and other mechanisms provided. And with some sophisticated techniques, several vendors differentiated their tools to support an ever growing list of specialized features.

Both ISAM and SQL were very good at delivering predictable transactional performance – where the data writes and reads are driven by a consistent access method. But if the access plan is not predictable, the relational model has some limitations to its access performance consistency. The capabilities of the SQL language and the commercial products expanded with ripe competition that continues into the present day.

In the 1990’s, several alternatives to the relational model were used in specific situations where speed, size, flexibility, and other mitigating factors made relational databases less useful. This led to startup companies forming to support alternative modes such as:

Key-value stores
Examples: Redis, Couchbase, Hazelcast, and memcached.

Columnar or Column-oriented databases
Examples: Vertica, HBase, and Cassandra

MapReduce
Evolved to Hadoop

In the early 2000’s (around 2004), Google started what was to become the MapReduce revolution. With the success of Google and the need to deliver super-large clusters of user-defined data structures, Google introduced a clustered file system that supported its own mechanisms for user data abstraction. MapReduce drove open source Apache Hadoop into immense popularity because it provided very large scale-out and very strong analytics processing capabilities (which was the original intent of the project from Google).

ETL, Messaging, and Replication

And now for something completely different. Well, not completely different, but somewhat different. The trouble with this evolution is that over time, you might want to take data and migrate it from one perspective to another. For NonStop users, an example of this is moving back and forth between Enscribe to SQL. For other situations, it might be moving data from an SQL to a column database. Or even just moving from one SQL to a different SQL implementation.

For this, the industry came up with another new term even back before the MapReduce “revolution”. An entire industry was built in the 1980-1990’s around the manipulation and movement of data into different databases and data abstraction designs. Extract Transform, and Load was the means by which data was massaged and moved into each of these different perspectives. It was called ETL for short. Going all the way back to the 1990’s, database has had this problem of formatting and moving data from one database to another. Usually, this is accomplished in one of 3 ways:

1. Direct user software that just converts the data on-the-fly (in the transaction), trickle feed (in groups of updates), or in batch processing (scheduled).
2. Replication of the data as part of the database update function using tools for data replication.
3. As a function of a data bus of some kind like an enterprise service bus (ESB) or similar technology.

The function of an ETL might be just to extract and load, or it can enhance/enrich/filter the data as it loads it. For example, name format normalization like “K. Moore” becomes last name “Moore”, first initial “K”. ETL is popular for managing lots of copies of the data so as to support different perspectives as I described above.

The other value of ETL is that I can take transactional data and make it fit onto a non-transactional database.

The latest technologies are moving more and more toward a fluidity of data storage and access. It is assumed that all of the above database techniques are useful for different reasons and for legacy support needs. But there needs to be better way to combine the data access methods with the ETL functionality. In addition, it is desirable to be able to perform some analytical work (like aggregation, summarization, time slicing, etc.) on the consumption side of the processing. In other words, it is obvious that providing these function as the data arrives is more efficient than providing the functionality after it is settled into a singular database domain; as it has been done for decades.

Data Movement by Store and Forward

Data Streams

Welcome data streams into this story. The latest technique embraces an eventual end model where the use case can dictate the database; as has been done for decades. However, instead of just providing an ‘after the fact’ replication or messaging service for data, the data is funneled into a service that any consumer might use. The concept leverages a unified log. A unified log receives the flow of data and allows for actions to be taken as the log is being committed to storage. This would include data applications that want to put the data into whatever database is best for later processing. For transactional data the user can subscribe to the content that is desired, not necessarily to everything. In some ways, his is like a simple publish and subscribe model; and in others ways, it is like a filter. And in another perspective, it can be an aggregation engine. There are a few data streaming tools available today. Kafka is the most popular one, but others are emerging in this space and several special feature frameworks enhance the capabilities of the streaming design. Unified log and streaming services have opened the door for many new innovative tools and techniques to filter and cull data directly at the time of inception. This makes this model attractive to internet of things (IoT) data harvesting, analytics processing, artificial intelligence (AI), and for transactional processing.

The advantages are not universal. Some activities cannot and should not go into the stream (funnel shown above). However, often there are enrichment/enhancement tools that make it worth leveraging this sort of design.

Aggregation, Filters, and time slicing are examples of add-on services that can be used with the streaming data service itself. Examples of the add-on tools are Spark, Flink, Storm, Apex, Flume, and several others. Each of these provide a specific function that eases and or enriches the data that is used by the consumer software.

NonStop and Streams

HPE Nonstop can leverage this new model in some interesting ways. It can leverage the tools to deliver enriched and aggregated source data into new transactional features. It can also feed into enterprise infrastructure as part of an enterprise stream of data that others can consume. Think of a NonStop application that is processing credit authorization and at the same time can see the current count of similar activities or can see if the same account was being accessed in a sliding window of time. All of this can happen even while not being the database of record for those other sources of data. This is what this database access model can provide. A shared, universal, but singular perspective that is based upon the user need. The data does not have to be fully replicated to the NonStop nor does it have to be provided by a service bus or message system. Instead, the tool can provide the information as part of an ongoing transaction state as the data flows into the NonStop. It is like an enterprise bus for databases.

Indeed, all along your data has been moving around for years. The latest technologies and tools accept this premise and provide a sophisticated way to exchange information without all of the point-to-point or complete replication of data. Where the data will go next is anyone’s guess.

Author

Keith Moore

Keith is a Master Technologist for the Americans NonStop Enterprise Solutions Architects Group, a veteran member Association of Computing Machinery (ACM), and is ISC2 CISSP Security certified. Keith’s focus is on real-time, “always-on” architectures, security and software design for enterprise usage. Most recently, Keith has been involved with application modernization efforts, enterprise cloud deployments, and real-time event-driven designs. He is currently involved with HP Labs and product development in integration techniques for cloud-enablement technologies. He is an annual featured speaker at HP’s Technology Forum and local User Group meetings. Keith also spends spare time working as a volunteer with universities and other groups sharing computer design and computer history. Keith joined HP in 1987 and has been in the IT industry over 35 years.

View all posts

The Connection

A Journal for the HPE NonStop Business Technology Community