What’s New in CEP and Stream Analytics

Roy Schulte   21 November 2023

We note seven trends that are changing the way organizations use streaming event data and complex-event processing (CEP).  The practice of CEP, including stream analytics and streaming extract-transform-load (ETL), is evolving and spreading. In this context, we refer to CEP as stateful computation on multiple event objects to produce abstractions (complex events) regardless of how it is accomplished – through custom-tailored application code; an event stream processing (ESP) platform such as Flink, ksqlDB, Microsoft Azure Stream Analytics, or Spark Streaming; or any other software tool.

The seven trends:

  • Demand for real-time analytics continues to grow.
  • The cost of streaming data is dropping, and its volume is exploding.
  • Stream processing is being implemented at multiple levels in the network from the edge to the cloud.
  • Log-based “streaming” messaging systems, such as Kafka, are becoming ubiquitous.
  • Open source and open core ESP platforms are proliferating.
  • Organizations are starting to use more-advanced practices such as stream metadata management and lineage tracking.
  • More kinds of products, including analytics tools and DBMSs, are adding good support for streaming data (i.e., real-time CEP is not limited to ESP platforms).

These trends are happening simultaneously, and every organization is likely to be involved in some or all trends.  Here is further explanation:

Demand for real-time analytics continues to grow.

Corporations and consumers alike have come to expect their devices to be smart, and that their business dashboards or smart phone apps will provide up-to-the-minute situation awareness about everything from the state of their call center to when their packages will arrive. Not all real-time analytics involves the use of streaming event data and CEP, but many do. If some of the input is a stream (an unbounded sequence) of data records that document things that happen (e.g., state changes), typically with a time stamp on each record, it’s an event stream. If the computation involves simple aggregation (e.g., counting, summing or averaging); selection (e.g., maximum, minimum, first, last); or pattern detection among a set of multiple events (records), then it’s CEP. Every medium and large organization has real-time (or technically, near-real-time or soft real-time) applications that use streaming data.

The cost of streaming data is dropping, and its volume is exploding.

The volume of streaming data from physical devices, i.e., IoT data, is exploding because sensors don’t cost much now and because so many new devices have embedded chips and network connectivity. It’s not just smart phones, it’s also Internet-connected TVs, fitness and medical devices, refrigerators, cars, airplanes, factory machines, and heavy equipment. The volume of data from non-IoT sources, such as e-commerce sites and social media, is also expanding rapidly. All of this is made possible by the low cost of integrated circuits and advances in fiber networks, 5G and other communication technologies.

Stream processing is being implemented at multiple levels in the network.

In an IoT scenario, much of the initial processing of event data occurs on the edge in endpoint devices or near-edge gateways and servers. Most of this is performed in custom low-level code, but there is some growing use of ESP platforms and other edge analytics tools. Further processing occurs in data centers and in private and public clouds. Cloud ESP brings the generic benefits of cloud: elastic scalability, self-provisioning, continuous patching and versioning, and offloading most of the work required for high availability and disaster recovery. This can reduce the cost of stream analytics, particularly if workloads vary over time, and makes CEP available to a broader range of applications and organizations.

Log-based “streaming” messaging systems are becoming ubiquitous.

The market for message-oriented middleware (MOM) expanded to include a relatively new kind of message broker that is particularly well suited to support streaming event data. We are referring, of course, to the log-based messaging systems, sometimes called streaming messaging systems, such as Amazon Kinesis, Apache Kafka, Apache Pulsar, Google Cloud PubSub, Microsoft Event Hubs, and the many other products that are based on Kafka or Pulsar. For example, Gartner Inc. is tracking Confluent and more than 30 other companies that offer either on-premises Kafka, cloud-based Kafka as a service, or both. Most large messaging vendors offer both log-based messaging products and queue-based messaging products to serve different kinds of messaging situations. In a related trend, some vendors now offer multi-model messaging products that provide aspects of both log-based and queue-based messaging. Examples include Solace, Synadia Communications (NATS), TIBCO, VMware Tanzu (RabbitMQ), and Pulsar suppliers such as DataStax and StreamNative.

Open source and open core ESP platforms are proliferating.

Early ESP platforms were closed-source products, and some of today’s leading products still have closed source, including Microsoft Azure Stream Analytics, Oracle GoldenGate Stream Analytics, SAS Event Stream Processing, TIBCO Streaming, and others. However, open-source ESP platforms and open-core ESP platforms (a combination of community open source and proprietary vendor extensions) have taken a major role in the market. For example, Flink, Kafka Streams and Spark Streaming are each available as free open-source software, but each are also the basis of products from multiple vendors.

Organizations are starting to use more-advanced practices for event processing.

As users and vendors gain experience with streaming event data, they have come to see the need for better management of message metadata. Most of the early messaging products did not even include native facilities to define message schemas. Many of the new messaging products have schema registries to provide standard ways for developers to document message formats, easing development and facilitating version control. Beyond schema registries, the next generation of metadata management tools, such as event portals and stream or event catalogs, is emerging on the market. These may include data catalog style information, such as domain names, label names, descriptions, and attributes. Some are integrated with other general purpose data catalogs, and others are adding connections to API registries. A related set of features helps track lineage – the flow of messages between applications (e.g., which kind of message (or “topic”) led to the subsequent production of which other kind of message).

More kinds of products are adding support for streaming data.

Stream analytics have long been associated with ESP platforms, and they remain the best tools for demanding, high-volume applications that perform low-latency CEP (e.g., multistage computation on events in moving time windows). However, ESP platforms are not well suited to handle a range of streaming applications with different requirements. Such applications are better addressed by other kinds of tools that also support streaming event data (including aspects of CEP):

  • Analytics and BI (ABI) platforms, including Microsoft PowerBI, Tableau and TIBCO Spotfire, have been enhanced to handle streaming data. Some ABI platforms can process streaming data directly (“in motion”), while others support streams by landing them in a database and then processing them with minimal delay (“data at rest”, although still in near-real-time).
  • Several different kinds of data stores also support streaming data (at rest) well. This includes in-memory data grids (IMDGs); in-memory caching type DBMSs; time series DBMSs; search-oriented data stores (often used for log data); and multi-model DBMSs that support SQL and no-SQL data models.
  • Most data integration products that historically focused on batch ETL pipelines have added support for near-real-time streaming ETL (including changed-data capture (CDC), real-time replication, and streaming data transformation and ingestion). These compete with a newer generation of “big data” stream data integration products specifically targeted for continuous stream processing (and are also capable of batch data integration).
  • Finally, we have seen the emergence of “unified real-time platforms” that combine powerful ESP/CEP capabilities with data management and analytics/business-logic-hosting capabilities. These can be used for real-time stream analytics and in some cases, generalized real-time application processing.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.