How Striim can easily unlock the value of your data without disruption to existing applications and databases
I recently had the opportunity to work with a new partner product for HPE NonStop systems, named Striim (pronounced “stream”). As you might guess from its name, Striim is designed to process streams of data content. Striim describes its product as “a distributed data integration and intelligence platform that can be used to design, deploy, and run data movement and data streaming pipelines”. In practical terms, this means Striim can access data from many diverse sources, act on that data to derive intelligence and add value, and then deliver the results to any one of many diverse targets, all in real-time, with high volume and velocity, and without disruption to existing applications and databases.
The founders of Striim had the foresight to predict that data consumption trends would extend beyond the confines of disk-bound data sources to include an array of streaming source points – from change-data-capture of traditional OLTP databases to edge-based sensors streaming events at high volumes – and everything in between. I think they got it right. They created a product that is easy to work with and very powerful in its capabilities, yet extensible for custom processing requirements.
The need to extract information from data has become insatiable. The quest for ever more data has evolved from simple files and relational databases, through operational data stores, data marts, data warehouses, and “no-SQL” big data lakes. Simultaneously, the quest to obtain data in real-time has produced new data architectures such as data meshes and fabrics. Simply put, companies need the ability to access data within their enterprise wherever it exists, do it quickly and easily, and transform it into meaningful “products” for further consumption. This is the new paradigm for data.
Striim is a product that addresses these concerns, and the approach taken in its design makes it easy for a typical business analyst to use – one need not have a deep knowledge of data formats, databases, or programming skills. Striim provides an intuitive web-based GUI interface through which a user assembles an application by moving around “components”, which are the fundamental building blocks of the application. These processing components create a data flow that accomplish the objectives of the application. It is all very modular – the output stream of one component becomes the input stream of one or more subsequent components.
The user selects specific components from a palette of objects (Fig. 1) grouped by functions, such as data sources and targets, continuous queries, windows, enrichers, and event transformers. The user combines these components to produce the desired effect for the data stream – to enrich, join, split, merge, partition, obfuscate, encrypt, or aggregate elements – all done in memory and in real-time. Parameter selections and settings through an intuitive configuration interface allow customization. “Programming”, as far as it goes, consists of manipulating data elements using a simple SQL language.
Once the user is satisfied with their design, the application is deployed for testing or production use. Striim takes care of the details of deployment, instantiating processes as needed, compiling queries, and starting all components. This all happens “behind the scenes.” The user simply focuses on the application objectives and the assembly of the data flow, while Striim takes care of the hard parts. The result is that customers can design and deploy sophisticated applications with ease and with a degree of productivity significantly greater than what traditional programming methods deliver.
Striim provides over 150 connectors – these are built-in adapters that interface to data sources and targets such as the HPE NonStop audit trail for change-data-capture (for all audited file types), and direct access to SQL/MP and SQL/MX databases. Striim also provides adapters for many other databases and file formats, event distributors, HTTP, Kafka, or TCP/IP data streams, and for the most common public cloud providers. Customers requiring access to a unique data source or target can integrate their own custom-developed Java adapter into the Striim application.
Figure 1. Striim application design GUI
Components from the “palette” on the right are used to construct the application flow, in the center.
The output of one component is the input stream for one or more subsequent components.
Accessing data on HPE NonStop systems
There are two methods in Striim used to access data on NonStop. The first and primary method is through change-data-capture (CDC) in which Striim uses NonStop APIs to read data from the TMF audit trail. This is a real-time capture of all changes made to the database. The second method is batch-oriented, where data is sequentially read from, or written to, databases on NonStop. The batch method is used in several ways: To replicate data content from a source to a target, to load data into Striim in-memory tables, for exploring and preparing data, and for capturing results. As I explain later, batch access is also a convenient way to process data for more general operations – similar to having a general-purpose data utility tool.
Using Striim for change-data-capture on HPE NonStop systems
Striim provides CDC adapters for Enscribe, SQL/MP, and SQL/MX databases. TMF keeps data changes in the audit trail only for audited files and tables. Customers using unaudited files or tables may consider using a product such as AutoTMF to enable auditing by TMF, which facilitates the use of Striim.
To obtain database changes from the audit trail, the user configures an agent, which Striim deploys onto the NonStop system. The agent reads CDC data from the audit trail, filters the content for the files or tables of interest and sends the results into the Striim application for processing.
Using Striim to access databases on HPE NonStop systems
To read or write SQL/MP or SQL/MX tables, Striim makes use of the SQL/MX JDBC type-4 driver along with the MXCS connectivity subsystem for database access. It is simply a matter of providing the URL connection details (IP address, port, catalog, schema, datasource, and user credentials), and specifying a table and optional SQL statement. The SQL statement can perform table joins or do any other basic pre-processing required, such as filtering data. Otherwise, the table is read or written to, as-is. A SQL/MP alias provides access to a SQL/MP table from the MX run-time environment.
High-level Striim architecture
With the exception of the CDC agents that obtain audit changes and the MXCS servers that access a database, all processing for a Striim application runs on one or more Linux systems (Fig 2).
Figure 2. High-level architecture of Striim run-time environment
This architecture accomplishes several objectives:
- Offloads all application-related processing (except for data access) from the NonStop system, which minimizes the use of resources. Striim can supplement existing or new business applications with sophisticated real-time analytical processing without requiring any application code changes or disrupting NonStop service level agreements.
- Facilitates high scalability. Customers can deploy Striim applications across multiple Linux systems to partition the workload across all participants, thereby scaling-out capacity to match workload demands.
- Ensures fault tolerance. The Striim multi-node architecture provides high availability to applications. Striim employs a reliable check-pointing mechanism to ensure continued processing in the event of a node failure, following the principle of “exactly once” processing. This ensures Striim replicates data from a source to a target with no duplicate or missing data, even in the event of a failure.
How customers can use Striim with HPE NonStop
Before building a Striim application, customers typically will explore their data to understand the content and data patterns. Often, they will construct a profile of the data, and examine data relationships. It is critical to understand the underlying content before building an analytics application.
Users often will build simple applications with Striim to examine their data. This is part of the data exploration/preparation process. It may involve loading simple data files into SQL tables, cleansing data, or writing the results of data profiling to output files. The point is that while Striim can produce very sophisticated analytical applications, it is also a rather convenient utility for database loads, extracts, and general data manipulation, i.e., ETL-oriented work. Additionally, users can build applications with Striim to generate data and test their mainline applications.
Data from CSV or other common file formats can be loaded into a NonStop SQL database quickly and easily with Striim. Within literally a few minutes, a user can configure a Striim application to read and load external source data into a database.
Furthermore, the user can implement any required data transformations, such as data normalization, filtering, enriching, cleansing, or data obfuscation using simple SQL statements, or using built-in Striim data masking functions. For example, the following SQL snippet uses one of the built-in functions to mask credit card values:
This function transforms a credit card value such as, “1234-5678-9012-3456”, into, “xxxxxxxxxxxxxxx3456”.
There are additional masking functions to manage emails, phone numbers, and social security numbers, as well as generic and custom masking capabilities.
Should the volume of data be large enough to benefit from parallelization, the user can extend the Striim application in multiple ways to parallelize processing (even without using multiple Linux systems). The scaling result is very similar to using a product like DataLoader on NonStop, but with a fraction of the effort, and notably, with no programming. Productivity is very high.
Data extracts work in a similar manner. A user can configure an application to mask data elements, or transform data in any way and write it to a variety of file formats, to an event processor/distributor such as Kafka, or alternatively, directly into a foreign database, either on-premise or in the cloud. Striim includes adapters for Amazon Web Services, Google Cloud, and Microsoft Azure cloud providers.
Taken together, these capabilities accelerate the process of data exploration and preparation.
Some customers will use Striim for data replication, perhaps to foreign platforms or databases. To keep a replicated file or table in sync with the latest changes, they first replicate an initial copy and then configure a Striim CDC agent to apply changes from the TMF audit trail.
A replication stream can include any of the previously discussed data manipulation techniques that filter, enrich, or otherwise transform the data prior to delivery to the target. For example, customers may be required to obfuscate columns such as social security and credit card numbers prior to delivery to offsite targets. These and other transformations are easily included in the replication stream.
Although Striim makes file access a breeze with its suite of built-in adapters, the real power of Striim – in my opinion – is in its ability to perform sophisticated analytics in real-time (Fig. 3).
Striim makes it easy to process real-time event streams using continuous queries (CQs). These are SQL queries contained within processing components that operate on each row of data as it arrives. CQs filter, transform, and obfuscate data columns in preparation for downstream processing.
CQs can also enrich data with supplementary information by combining the input stream with in-memory tables through a SQL join. For example, by joining the zip code from an input stream to an in-memory table containing enhanced address information, the CQ enriches the data stream with the city, state, and perhaps GPS coordinates, which enhance visualization in the dashboard display.
Real-time analytics usually requires a variety of aggregate metrics based on a moving time interval. For example, a user may need to produce aggregated sales metrics every 5-minutes for a retail application, by store location and/or product category. This involves using SQL aggregate functions such as average, maximum, standard deviation, count, etc.
Computing such aggregates requires a bounded event stream. Striim provides options, such as sliding and discrete windows, where the window size is limited by event counts or by time duration. For example, a user may select a “jumping” window to produce discrete 5-minute periods as one way to aggregate results. Alternatively, the user could use a sliding window of time (or event counts) to produce moving averages, running counts, etc.
To produce such results, the user simply adds and configures a Striim window component to the application flow and follows it with a CQ component that computes the desired aggregates. It is that simple.
- CDC input stream
- CQ: Data preparation (data typing, filtering)
- In-memory lookup tables (data enrichment)
- 5-minute Jumping window (data grouping)
- CQ: Data aggregation and enrichment
- CQ: Data ranking/categorizing
- CQ: Alert evaluation
- Output target for alerts
- CQ: Data preparation for dashboard
- Output target for dashboard
Figure 3. Complete point-of-sale analytics application flow showing CDC capture and real-time analytics with alerts and visualization
The Striim product includes the ability to create real-time graphical dashboards (Fig. 4) using a variety of common displays, such as bars, lines, bubble charts, and heat maps. The dashboard receives events from the application flow as quickly as the upstream components can produce them. In the case of the above-mentioned retail application, the dashboard will be updated every 5-minutes with the latest sales data. Users can drill down to view the details of the sales data by location or other included dimensions.
Figure 4. Dashboard visualization for point-of-sale analytics application
More sophisticated analytical applications use machine learning (ML) algorithms to find obscure, yet valuable data relationships and to build evaluation/prediction models for real-time execution. For example, ML models can evaluate a retail transaction stream for anomalies and for fraud detection – emitting alerts as necessary, or analyze real-time trends proactively to circumvent inventory shortages or supply chain concerns based on dynamic time-based purchase patterns.
Striim provides several built-in linear and non-linear regression functions for building ML models easily. A user typically constructs a model from samples of static training data, but static models can sometimes “age out” and become ineffective as data evolves. A more robust approach uses Striim functions to create an in-memory adaptive model that defines the frequency of model updates and retraining. As changes occur in data patterns of the input stream, Striim updates the model automatically to remain current.
In addition to its built-in ML functions, Striim applications can use powerful third-party ML libraries like Weka and Massive Online Analysis (MOA). These algorithms are well suited for large-scale machine learning and predictive modeling. Together, they provide algorithms for clustering, classification, regression, visualization, outlier/change detection, pattern mining, and recommender systems. Plenty of choices from which to construct effective and sophisticated real-time models.
Given the modularity of the Striim product, customers can initially deploy core applications to gain immediate value and later include more sophisticated capabilities, such as ML-based processing. With this approach, customers gain immediate value from the mainline Striim application while taking the time necessary to study exactly how to design and deploy their advanced analytics.
Striim: Final thoughts
This very powerful and exciting new product allows customers to expose valuable data easily from their core databases, especially content locked away within legacy applications – without any disruption to those applications and databases.
I was very impressed with how intuitive the product is to use, and how in very little time I was able to construct applications that accessed and manipulated data, especially when considering how long it would have taken me to use typical programming methods.
To me, the essential value of Striim is in its ability to design and deploy very sophisticated analytical applications against many types of event streams, bringing the capabilities of ML to NonStop applications. Striim clearly has the potential to enhance NonStop applications, whether they be older legacy applications based on Enscribe or the latest based on SQL/MX. It certainly warrants a close examination by anyone with such interests.
For more information, please contact your local HPE NonStop sales team, or refer to the Striim website.