There has been an explosion of innovation in open source stream processing over the past few years. Frameworks such as Apache Spark and Apache Storm give developers stream abstractions on which they can develop applications; Apache Beam provides an API abstraction, enabling developers to write code independent of the underlying framework, while tools such as Apache NiFi and StreamSets Data Collector provide a user interface abstraction, allowing data engineers to define data flows from high-level building blocks with little or no coding.
The easiest way to explain stream processing is in relation to its predecessor, batch processing. Much data processing in the past was oriented around processing regular, predictable batches of data – the nightly job that, during “quiet” time, would process the previous day’s transactions; the monthly report that provided summary statistics for dashboards, etc. …….
As businesses demanded more timely information, batches grew smaller and were processed more frequently. As the batch size tended towards a single record, stream processing emerged. In the stream processing model, events are processed as they occur. This more dynamic model brings with it more complexity.
Often, stream processing is unpredictable, with events arriving in bursts, so the system has to be able to apply back-pressure, buffer events for processing, or, better yet, scale dynamically to meet the load. More complex scenarios require dealing with out-of-order events, heterogeneous event streams, and duplicated or missing event data……………………
Fast-forward to today and Flink and Spark Streaming are just two examples of streaming frameworks. Streaming frameworks allow developers to build applications to address near real-time analytical use cases such as complex event processing (CEP). CEP combines data from multiple sources to identify patterns and complex relationships across various events. One example of CEP is analyzing parameters from a set of medical monitors such as temperature, heart rate and respiratory rate, across a sliding time window to identify critical conditions, such as a patient going into shock. Read the full article.
DCL: A good quick overview of the current state of stream processing systems, easy to read.