← Back to the schedule

Keynote | Technical

Towards an Unified API for Spark and the IIoT

Thursday 16th | 15:10 - 15:40 | Theatre 18

One-liner summary:

Structured Streaming is a game changer for Apache Spark having a unified API for both batch and real-time processing. Moreover, its support for “event time” and watermarking simplifies its deployment on IIoT related projects. In this workshop, we will hands-on Spark´s Structured Streaming API and more specifically on its advantages for the IIoT domain.

Keywords defining the session:

- Structured Streaming

- Apache Spark

- IIoT


The Industrial Internet of Things (IIoT) nowadays is an exploding trend with significant implications for the global economy. It spans industries including manufacturing, mining, agriculture, oil and gas, and utilities. It also encompasses companies that depend on durable physical goods to conduct business, such as organizations that operate hospitals, warehouses and ports or that offer transportation, logistics and healthcare services. Not surprisingly, the IIoT’s potential payoff is enormous. A specific example of this potential use is the Predictive maintenance of assets, saving over scheduled repairs, reducing overall maintenance costs and eliminating breakdowns up. For example, Thames Water, one of the largest providers of water in the UK, is using sensors and real-time data to help the utility company anticipate equipment failures and respond more quickly to critical situations, such as leaks or adverse weather events. However, analyzing such large quantities of usually out-of-order real-time data from different sensors and system is a real challenge with current real-time Big Data analytics frameworks. In order to help on addressing use cases such as the previously described where real-time processing of IIoT data is needed, efforts of the Spark community has led to the development of the Structured Streaming API that provides a unified view of both real-time and batch processing paradigms. This API provides latest performance gains from the Catalyst optimizer presented on Spark 2.0 and its support for event-time and late out of order data using watermarking techniques enables easier development of IIoT projects where data usually reaches the cloud late and in unordered fashion. Moreover, the latest release (Spark 2.2) provides support for maintaining custom state between the different micro-batches of the real-time data allowing more complex use cases. On this hands-on workshop, we will use Apache Zeppelin, Apache Kafka, Apache Avro, Apache Cassandra and Spark´s Structured Streaming API to solve challenges of related to IIoT projects such as handling late unordered data. More specifically, we will go through real code ingesting simulated late out- of-order data from Kafka, processing it with Spark´s Structured Streaming API and saving the results of those computations on Cassandra while we explore the Structured Streaming API, its main features such the unified view of real-time and batch analytics or its support of out-of-order late data processing using watermarking, benefits for Spark developers and its current limits for these kind of projects. All the code of the workshop and the developer environment be available at GitHub right after the workshop using Docker.