
Fortunately, the open-source community has attacked this area with a plethora of tools for storing, searching, and visualizing operational data. Unfortunately, the number options are overwhelming. But then, fortunately again the APIs and interfaces have become well-defined and are now a small set of choices for a very robust delivery of information into these enterprise observability software stacks. In other words, it is easier than ever to deliver enterprise observability for all systems including and especially for Nonstop!
What is “Observability”?
Observability is a term used to describe the function of “observing” information about the IT systems in an enterprise. As part of the “DevOps” processes, it is the operational discovery and influence on operations change. When this term is used in the context of this article, we are referring to the entirety of the operational information that can be derived and collected from all systems in the enterprise.
Key Concepts
|
Introduction to the technology
Because there is a historical and a diverse set of requirements for interfaces into and around the tools that expose observational data, there are dozens (hundreds?) of options for the collection, searching and visualization of this sort of data. The key need is for there to be just a few common ways for all platform systems to be able to deliver a common way to ingest and present the data.
Fortunately, a few tools and interfaces have become common and can be used as de facto standards.
The Operational Process
Our team at HPE have come up with a brief description of what it is to become “observable”. While this is our own interpretation, it is intended to align directly with the industry with regards to how to participate and how to best leverage as many possible combinations of tools as possible. The choice of the tools is user dependent. This general perspective is universal.
1) Produce
Data on systems is produced by application software, operating systems, subsystems, remote activity, and other various sources. In general, there are three types of useful data for operational visibility:
Metric data: As implied, this data it metered type information such as counters gauges and histograms.
Trace data: Data that represents a trace of processing. Examples include both internal trace data (e.g. procedure and functions inside of a program), and external trace data (i.e. multiple services that combine into one logical traceable activity – e.g. debit + credit = a single transfer trace)
Log data: Generic or specific time -sequenced information of any kind. Examples are system event data (like EMS), or security events, etc.
2) Capture
The data then needs to be captured in a consistent way. Capture is often overlooked as the most critical piece of the puzzle. Deciding on the capture (and distribution) tool is possibly the most important decision to be made. Fortunately, the industry is finally settling on a common capture tool and API called OpenTelemetry (one word), although the concept is these two words, “open” and “telemetry” representing the open access to data as being telemetric measuring and collecting of operational data.
3) Visualize
The flashy part is the visualization. This is what most people think of with regards to observability. There are tools for selecting, searching, aggregating, filtering, highlighting, and visualizing all the above data types. These tools are very powerful and can be integrated into AI systems and can be presented as dashboards for the entire enterprise with drill down and drill-in capabilities for metrics, trace, and log data.
The Design and Capabilities
A key goal of this project is to provide the most flexibility and the most open access to the widest possible choices for all three steps of the processes. For produce, we chose to deliver with standard capabilities of the Nonstop platform itself; minimizing any new specialty system software or user application intervention. For Capture, we chose OTEL collector (described below) but also tested and leveraged other alternatives like logstash. For Visualize (and distribution) we continually rearchitect and deploy different tools. The flexibility in the visualization suite of options is impressive. We have literally used dozens of different configurations and visualization tools.
The Collector
The current design leverages a single collector. This is the recommended approach and is representative of most large enterprise designs. We chose to use OpenTelemetry Collector (OTEL Collector) for several reasons:
- OpenTelemetry is the most common and most well-defined open API and collector design in the industry.
- The current CNCF guide embraces OTEL as a full observability stack and the collector component seems to be emerging as the recommended collector (and API) standard.
(see: CNCF Landscape in references) - OTEL integrates with nearly everything we have considered. It allows use of specialized visualization data structures and visualization tools independent of the OTEL first-level OTEL API.
- The OTEL libraries for Java and for Python work for Nonstop today and some future library availability will enable TS/MP services to be instrumented for trace data observation.
The collector is the critical piece of the design. But the choice of which one to use is not as critical. In the Frigg project design, it is designated to be the single place to send all information from any of the systems. Our demonstration includes up to 10 different servers, but an enterprise would likely have far more than that, and when you add in the number of containers executing within the enterprise, the collector becomes the single best vault to monitor, research, and document the enterprise execution state. This powerful capability is one of the key reasons to leverage an observability stack of any kind. One final note about the collector. While we chose to use OTEL collector in this design, it could be that we would use alternatives and combinations. The impact of changing the collector is minimal to any of the systems that send to it.
Aggregation, Indexing, Distribution and Visualization
It is beyond the scope of this article to describe each of the components that we leverage. We encourage users to embrace the standards of their organization. The Frigg project attempts to demonstrate as many possible tools as possible without regard to what would normally be used in an enterprise. In the normal use case a single, or a few different visualization tools would be used in combination with a collector and possibly additional tools for aggregation and indexing (e.g. Prometheus, and such). The most important consideration for the project is flexibility of choice.
Frigg currently demonstrates use of Prometheus for metrics data, Grafana and SigNoz are used for dashboard presentation. But we have used other tools and leverage different software stacks depending on the resources available and which stacks seem to be desirable.
The team welcomes interaction with Nonstop users to better understand the common needs of our enterprise users. We intend to add more tools and dashboards based upon user requests and availability of the various software stacks. Further, we encourage NonStop users to see the demonstration in person. A picture is worth — well — you know…
Conclusion
The Frigg team was surprised at how quickly, easily, and flexibly, the observability ecosystem was deployed. For log data, there was no development required. For much of the trace data, there was minimal software effort required, and for metrics, some of the data was readily available and send to this ecosystem via scripts. The goals of the project were met and continue to be enriched as we learn more and follow this explosive area of IT technology. A key take-away is that it takes very little work to instrument Nonstop to use these observability resources.
For more information
Frigg is being presented and demonstrated at regional Connect user meetings. We welcome interaction and input on where to take the project next. Please feel free to contact any of the team members with questions and further needs in the observability space.
Be the first to comment