W. Roy Schulte

This is Part 2 of a two-part series. Part 1 provides context, defines terms, and explains how AI helps Event Stream Processing (ESP). Part 2 explains how ESP helps AI by enabling streaming data pipelines.

ESP and AI software tools are complementary. Sure, ESP is often implemented without using AI tools, and AI and other analytics are often implemented without ESP tools. However, using them together provides the best available solution for an increasing number of business problems because of the high information value and ubiquity of streaming data.

This synergy operates in three ways:

AI and other analytical software components can be executed within online production ESP applications to make smart operational decisions (see Part 1).
Generative (Gen) AI-based copilots can be used at build time to help develop and test ESP applications (see Part 1).
ESP can be used to ingest and prepare streaming data that helps engineers design, build, and train AI and other analytical solutions (see below).

Key takeaways

There is an important difference between
- offline activities, when AI and analytics solutions are built and long-term decisions are made using historical data, versus
- online activities, when AI and analytics are executed to make real-time and near-real-time, repeatable, operational decisions using current data.

ESP plays a role in both kinds of activities.

In most cases, streaming data should be ingested and processed through continuous, near-real-time, streaming data engineering pipelines fed by messaging subsystems such as Kafka. In a few cases, streaming data is landed and processed through batch pipelines but this has several drawbacks.
The tools and techniques that are used for AI and other analytics on streaming data are partly different than conventional tools and techniques for AI and analytics on non-streaming data.
Incoming event streams are filtered and abstracted to generate complex events. Complex events present the information in a form that is more effective for analytics (data scientists call this abstraction process “feature generation”).

ESP Helps AI/Analytics at Build Time

It’s probably obvious, but almost all AI and other analytical models are built directly or indirectly from data:

All machine learning (ML) systems, including Gen AI, use historical data directly for training purposes, i.e., for identifying correlations, patterns, and other relationships in the input data.
By contrast, symbolic-reasoning AI/analytical models, such as those based on rules or optimization algorithms, often don’t require historical data that is specific to the problem at hand. Analysts, management scientists, or other experts can design explicit symbolic models using first principles, personal experience, tribal knowledge, regulations, or other information. Nevertheless, historical data often plays a major indirect role in building symbolic models because experts draw on insights and analyses (sometimes including statistics and other ML results) derived from data lakes, data warehouses, business intelligence (BI) reports, and other information sources.

When raw data (streaming or not) is collected it usually is not in a form that can be immediately used to design or train analytical models so it must be processed through data engineering pipelines. Organizations turn petabytes of raw input data each day into a few terabytes of more-relevant and valuable analytical data, i.e., features and other curated data that are used to build analytical models.

Conventional Pipelines

Conventional data engineering (“data prep”) pipelines are often described using the three-stage Medallion Architecture (see path “a” in Figure 1 below).

Raw data is extracted and landed in a “bronze” zone (staging area) which may be a low-cost cloud object store (e.g., data lake).
Next it is filtered, cleaned, de-duplicated, validated, and enriched to become “silver” data. This may involve things like data format conversion (to/from Avro, CSV, JSON, Parquet, Protobuf, ORC, XML, etc.) and data type conversion (integer, float, string, etc.), syntactic and semantic transformation, and enrichment (augmentation, e.g. via lookups).
Finally, it is further transformed via feature generation (robust forms of abstraction) and reorganized into analytically-appropriate data structures to become “gold” data that is input to various AI/analytics model-building tools or used as semantic models for BI reporting purposes. This may involve multiple steps such as grouping (e.g., by location, time windows, or organizational unit), joins, aggregation (called “roll ups” in the BI world, such as sums, counts, or averages), pattern matching, and other calculations. If the data is unstructured, data engineering may involve classifying images or parsing, chunking, and vectorizing text data for Gen AI large language models (LLMs).

Figure 1. Data Engineering Pipelines

The bronze, silver, and gold data stores in a data engineering pipeline may all be part of a lakehouse cloud service or some other implementation of a cloud, on-premises, or hybrid virtual data warehouse. Pipelines for non-streaming data are typically processed now in a batch extract/load/transform (ELT) approach where “extract and load” applies to the bronze zone and most transformation is applied in subsequent silver and gold zones. However, traditional batch ETL pipelines, where some transformations are applied before loading the data, are still widely used. (Note: Batch pipelines (“a”) can also be implemented using certain ESP platforms although this is rare. See Appendix A, “Processing Batch Data as a Stream”).

Streaming Data Pipelines

Streaming data needs most of the same kinds of data engineering as conventional, non-streaming data, but there are some differences. Like conventional pipelines, streaming pipelines may do data format and data type conversion, filtering, syntactic and semantic transformation, enrichment and so forth. Unstructured streaming data may still need classifying or parsing, chunking, and vectorizing for Gen AI. However, differences arise because of the continuous nature of streaming data and the kind of feature engineering (complex-event processing (CEP)) that is applied.

Three alternative ways to process streaming data (“b”, “c”, and “d”) are shown in Figure 1. A messaging subsystem (e.g., Apache Kafka, Confluent Cloud, or other messaging middleware) may essentially serve as the raw (“bronze”) data store for all of these paths. Event streams are held in topics or other message queues and can be replayed whenever desired.

Streaming Data in Batch Pipeline (“b”)

Some developers default to treating streaming data mostly like non-streaming data. They use a database connector or user application to land raw streaming data from a messaging subsystem or CDC connector into an object store, other file, or DBMS and then process it later. Developers can use their familiar batch data engineering procedures and tools. However, this approach introduces latency because the batch pipeline may take a couple of days, or really anywhere from a couple of hours to a couple of weeks, to complete depending on how it is set up. Moreover, developers may have a hard time to compute some kinds of features/complex events, especially temporal or spatial patterns, from the raw data because traditional analytics tools in the later stages of a batch pipeline generally are not well suited for some CEP (see “d” below).

Basic Streaming Pipeline (“c”)

A better, and more-common, approach is to continuously process streaming data as it appears (path “c”). Basic streaming tools such as Kafka Connectors, NIFI, Benthos, and others can be used to execute filtering, validation, range checks, and stateless transforms such as format and data type conversions, column reordering, masking, some kinds of enrichment (lookups), and simple semantic transformations based on arbitrary expressions or code. The resulting data is roughly on the level of a silver data store. This level of refinement is already enough for some simple reporting or analytical purposes, in which case the output data goes straight to a tool or an analytical data store. However, many analytic applications need the data to be further abstracted, i.e., they need stateful feature generation/CEP to prepare it for analytics. Basic streaming tools are not well suited to accomplish this, so another processing step by an ESP tool is required to generate “gold” level features/complex events. Sometimes this is done in a batch mode (path “c”) but a better solution is to use a full streaming pipeline (“d”).

Full Streaming Data Engineering Pipeline (“d”)

An ESP platform, Data Streaming Platform (DSP), stream data integration (SDI) tool, streaming DBMS or other stateful, ESP-capable software product is the heart of a full streaming data engineering pipeline (see Appendix B in Part 1 of this series for more explanation of these products).

New data arrives continuously (streams are unbounded) so ESP software must demarcate the beginning and end boundaries of moving time windows. ESP software may re-order records if they arrive out of order (based on timestamps). In some cases, it will recompute results to reflect late-arriving data that belongs in a previous time window. ESP software may perform aggregation, joins, and spatial or temporal pattern matching based on the order of events, the absence of events, and relative timing (how much time elapsed between events).

Streaming pipelines perform different operations depending on the business problem. For example, if the incoming data is text, the ESP software may compute chunks and then invoke an API on an external Gen AI embedding model (e.g., from OpenAI or Voyage AI) to convert the data to vectors. A continuously-updated vector database is especially relevant for custom LLMs and databases that manage private data of a specific business (Detail: NIFI can also enable vector embeddings and perform some stateful operations, although more limited than Flink and other full-blown ESP platforms).

ESP output is at a “gold” level, ready for AI/analytics. It may be put into a feature store or other virtual data warehouse data table, or it may be made available to the AI/analytics tool through a messaging subsystem or API.

Advantages of Streaming Pipelines

Streaming data engineering pipelines (especially “d”, but “c” to a lesser extent) are almost always preferable to batch pipelines (“b”) for several reasons.

Streaming pipelines:

Reduce storage overhead. Streaming data is typically far more voluminous than non-streaming data. It often contains irrelevant or redundant data, such as market data where prices don’t change often or device data where temperature readings don’t change for many readings over a long period of time. It can be quite expensive or even entirely impractical to store all of the raw data in data lake. It is better to filter and condense streaming data immediately so that low-quality raw data is never stored in a database or file. Basic streaming pipelines (“c”) can do some of this because they can transform and enrich incoming data, sample the data, and filter out clearly bad data. For example, IoT data tends to contain spurious, out-of-range, readings that should be ignored because of errors in sensors or in communication networks. Full streaming pipelines (“d”) can do much more than basic pipelines (“c”) to reduce the volume of data because they may turn thousands or hundreds of thousands of incoming raw events into a handful of higher-level complex events.
Reduce network overhead. In highly-distributed topologies, such as fleets of many physical IoT devices, edge ESP filters and preprocesses data locally, sending only relevant events, periodic samples, or deltas (changes to data) to centralized systems for further analysis. A minimum amount of data is transmitted over wide area networks to regional or central collection points. Tools such as MiNiFi are designed specifically to do a distributed form of “c.” Distributing full ESP (“d”) can reduce network overhead even further.
Reduce processing overhead. Continuous stream processing can reduce the number of times the same data is read and recalculated. In batch data engineering scenarios (“a” and “b”), the system will generally do a full table scan (including updates and deletes) and then rewrite the data after augmenting it with SQL, PySpark, or some other logic to produce a silver table. Then it will do another full table scan to turn the silver table into a gold table by joining tables, creating temporary tables, and then rewriting it. By contrast, a continuous streaming pipeline (“d) can often handle all of this logic incrementally, immediately, and in memory.
Simplify pipeline architecture. All of the steps in a streaming pipeline (“d”) from ingestion to feature generation run on one platform with one failure recovery mechanism which is easier to manage. By contrast, a batch pipeline (“a” or “b”) consists of several independent, periodically-scheduled components; failure of one component disrupts subsequent steps of the pipeline.
Reduce latency. Streaming data is most valuable when it is fresh and reflects current conditions. Freshness is especially important for run-time, near-real-time production decisions (see Repeatable Online Operational Decisions section below). However, it is sometimes a factor even offline (at build time) if conditions are changing rapidly so the analytical models need to be re-trained frequently.

Building AI/analytics Models

Once data has been processed and stored in an analytical (“gold” level) data store, AI engineers, data scientists, data engineers, process modelers, BI analysts, or other builders create analytical models that are appropriate for the decision at hand. A full explanation of building analytical models is outside the scope of this article. However, we will note that the process of building a model varies considerably depending on whether the decision algorithm will be based on business rules, statistical data science, LLMs or other neural networks, optimization (e.g., Simplex), or some other AI/analytics technique or combination of techniques.

Modeling is a time-consuming, typically iterative process. For example, data scientists may run many experiments to identify the most significant features, choose the most accurate statistical algorithm, calculate coefficients, and do hyper parameter tuning to build ML models. Or process modelers may develop and test sets of business rules using historical data to verify that the rules make the desired decisions. Batch or streaming data engineering pipelines will usually be modified multiple times to try new iterations of feature engineering (and complex events) and algorithms (See Figure 2).

Figure 2. Building AI/analytics Models is an Iterative Process

Using AI/analytics Models

The result of build-time efforts is an AI/analytical model that is either used to make major offline business decisions or is deployed into a production application to make (near) real-time, repeatable operational decisions.

Offline Business Decisions

AI/analytics models can be used to help make offline business decisions, such as one-time strategic or tactical decisions, or major operational decisions that are not very time-sensitive. Strategic decisions include corporate decisions with long-term and widespread impact, such as acquiring another company, entering a new geographic market, or revising overall product strategy. Tactical decisions have a narrower scope but generally still impact multiple aspects of a company and apply for extended periods of time. Examples of tactical decisions include things like changing price lists or customer service or lending policies.

All strategic decisions and almost all tactical decisions cannot be not fully automated and may take hours, days, weeks or longer to make. They generally require discussion among many people, multiple approval levels, and several iterations of data collection, research and modifications to analytical models (e.g., “what if” analysis). The relevant analytical models serve as decision support (or “decision augmentation”) to help people who make the final decision. The point is that these analytics are customized for a particular decision and are not put into an online operational production application.

Note that even streaming data, if it is historical (i.e., data from past days, weeks or months), is relevant for some one-time strategic and tactical decisions. For example, records of product sales (which begin as streams of purchase records) affect product strategy and tactical pricing decisions. They also may play a role in strategic decisions to acquire or divest other companies. Copies of financial transactions and the associated gains and losses affect tactical fraud mitigation policies and business rules. Analytics on streaming telemetry data from sensors on machines may lead managers to make a decision to acquire a new machine (a decision that is not automated in an application).

Repeatable Online Operational Decisions

The other way to use an AI/analytical model is to deploy an executable version of the model into a production application to make (near) real-time, repeatable operational decisions on streaming data as it arrives (See lower half of Figure 3 below). The decisions may be automated in business applications or they may be made by people using real-time operational dashboards. For more explanation and examples of this, see Part 1 of this series, “Event Stream Processing Helps AI and Vice Versa, Part 1.”

Figure 3. AI/analytics Deployed into Online Operational Systems

Sharing Event Streams Between Online Operations and Offline Analytics

Some (complex) business events that are generated from real-time streaming sources during operations can be used by multiple online production applications (see Figure 4). Moreover, some of the same real-time business events may also be stored into analytical databases for offline purposes (see “e” in Figure 4) because they contain information that is relevant to management dashboards, BI reports, or the design and training of DS/ML models, rules, neural networks or other AI/analytics models. For example, Confluent’s Tableflow utility makes this (“e”) easier to implement by loading Kafka records into Iceberg analytical data stores.

Figure 4. Re-using Business Events for Both Operations and Analytics

Publishing re-usable business events on a corporate backbone (e.g., using Kafka or a Kafka-like messaging subsystem) is a key aspect of the enterprise nervous system concept. Enterprise nervous systems maximize situation awareness by sharing near-real-time information across geographic, departmental, and system boundaries for operational and/or offline analytical purposes

Conclusion

A large and growing percentage of enterprise data is streaming data. AI engineers, data scientists, and data engineers should use ESP tools to prepare streaming data for use in designing and building AI and other analytical solutions. Some AI/analytics models built on streaming data are used for offline strategic, tactical, and not-time-sensitive operational decisions. Other AI/analytics models are deployed into production applications to make (near) real-time operational decisions on streaming data as it arrives. Architects and engineers need to understand how ESP, AI, and other analytics tools complement each other to enable more-effective business decisions.

Appendix A. Processing Batch Data as a Stream

Some ESP platforms that are typically used for stream processing can also be used to implement batch data engineering pipelines by invoking them periodically rather than running them continuously. This is basically an alternative implementation of path “a” in Figure 1. A clock or a person manually starts the system to read files, databases, or event streams that were stored in messaging topic stores.

As has been noted elsewhere, batch can be seen as a special case of streaming. However, the architectures of traditional streaming software were impractical for high volume batch work. Some modern streaming systems, such as Flink and Google Cloud Dataflow, have added optimizations that make them practical for both batch and streaming.

If you have comments or corrections for this article, please contact me at urpfeedback@gmail.com.

Event Stream Processing Helps AI and Vice Versa, Part 1 February 21, 2025
When Should You Use an Event Stream Processing Platform? September 5, 2021
Why Organizations Use Event Stream Processing With AI February 21, 2025

Event Stream Processing Helps AI and Vice Versa, Part 2

Key takeaways

ESP Helps AI/Analytics at Build Time

Conventional Pipelines