← Back to the schedule

Journeys from Kafka to Parquet

Calendar icon

Thursday 15th

Time icon

14:35 | 15:15

Location icon

Theatre 19


Keywords defining the session:

- Stream processing

- Flink

- Kafka

Takeaway points of the session:

- Understanding streaming fault tolerance can save us from surprises.

- Stream processing can be much more complicated than a batch solution.


Last year the Measuring 2.0 team at has started measuring user behavior on the website. Of course, we wanted to make the data widely accessible within the whole organization. We had user behavior data in Kafka, a distributed messaging queue, but Kafka is not a suitable data source for every possible client application (e.g. for generic batch processing systems). Given this, we decided to store the historical data in Parquet files which is a columnar format suited for analytics. It seemed like a simple task to sink data from Kafka to Parquet, but we struggled with multiple solutions.

At first, we created a Flink Streaming job. Flink has built-in tools for reading Kafka, handling out-of-order data, so it seemed like a good fit. However we struggled a lot when trying to fix fault tolerance and finally ended up having a much simpler batch job that reads Kafka and writes it as files to HDFS on a daily basis. We didn’t want to just dump the data, we had several requirements to fulfill.

Firstly, we wanted a scalable solution. We are talking about 10,000 records/minute and this number is constantly growing as we measure more things and the number of visitors grow.

Secondly, every record that appears in Kafka must appear exactly once in the files. We don’t want to lose data or have duplicates.

Thirdly we wanted to have files in even-time windows. Data in Kafka might be out of order, but we want them ordered in the files. If we created the files based on system time of the processing machines (“processing time”), the file for data between 18:00-18:59 might contain data about a user clicking around at 17:34, or 19:14. We don’t want this.

Finally, we want to store the data in columnar format. In contrast to a row oriented format where we store the data by rows, with a columnar format we store it by columns. Because the data is so rich, most consumers of the data will not need all columns. By using Parquet, most processing systems will read only the columns needed, leading to really efficient I/O. There will be many consumers of the data, so using a more optimized format pays well.

We went through several solutions. At first, we used Flink windows, but Flink (mostly) keeps the records in memory before writing them, and we just want to read and write the data, so using that much memory is unnecessarily. Then we tried the so called “bucketing sink” which should directly write the records and can imitate windowing for our needs without needing to keep data in memory. Unfortunately, bucketing sink relied on truncating files in failure scenarios, which is not possible for columnar formats like Parquet, so we tried closing files at every checkpoint, but then we had too many files. In the end we realized we could use a daily batch job to read the Kafka topic and write data specifically for that day. This solution made fault tolerance much simpler and satisfied all our requirements.

To conclude, doing everything in streaming/real-time sounds cool, but most of the times batch processing suffices and it’s simpler. There might be several different solutions to this problem, but why not start with the simplest one and then optimize if needed? However, I cannot say we wasted our time by struggling with these more complex solutions. We learned a lot about how the Flink windowing and checkpointing works and about the columnar Parquet format. I hope this can serve as a learning for others too.