← Back to the schedule

Infinite Topic Backlogs with Apache Pulsar

Calendar icon

Thursday 15th

Time icon

10:45 | 11:25

Location icon

Theatre 19


Keywords defining the session:

- Messaging

- Storage


Takeaway points of the session:

- The ability to build infinite backlogs allows your business to extract more value from both your data and from your messaging system.

- You no longer have to settle for just PubSub from your messaging system. With Pulsar, the messaging system can be much more.


Messaging systems are an essential part of any real time analytics engine. A common pattern is to feed a user event stream into a processing engine, show the result to the user, capture feedback from the user, push the feedback back into the event stream, and so on. The quality of the result shown to the user is often a function of the amount of data in the event stream, so the more your event stream scales, the better you can serve your users.

Messaging systems have recently started to push into the field of long term data storage and event stores, where you cannot compromise on retention. If data is written to the system, it must stay there.

Infinite retention can be challenging for a messaging system. As data grows for a single topic, you need to start storing different parts of the backlog on different sets of machines without losing consistency.

In this talk I will describe how Pulsar uses Apache BookKeeper in its segment oriented architecture. BookKeeper provides a unit of concensus called a ledger. Pulsar strings together a number of BookKeeper ledgers to build the complete topic backlog. Each ledger in the topic backlog is independent of all previous ledgers with regards to location. This allows us to scale the size of the topic backlog simply by adding a more machines. When the storage node is added to a Pulsar cluster, the brokers will detect it, and gradually start writing new data to the new node. There’s no disruptive rebalancing operation necessary.

Of course adding more machines will eventually get very expensive. This is where tiered storage comes in. With tiered storage, parts of the topic backlog can be moved to cheaper storage such as Amazon S3 or Google Cloud Storage. I will also discuss the architecture of tiered storage, and how it is a natural continuation of Pulsar’s segment oriented architecture.

Finally, if you start storing data for a long time in Pulsar, you may want a mean to query it. I will introduce our SQL implementation, based on the Presto query engine, which allows users to easily query topic backlog data, without having do read the whole thing.