Big Data Spain

15th ~ 16th OCT 2015 MADRID, SPAIN #BDS15




Thursday 15th

from 10:15 am to 11:00 am

Room 25



At LinkedIn, we ingest more than 1 Trillion events per day pertaining to user behavior, application and system health etc. into our pub-sub system (Kafka). Another source of events are the updates that are happening on our SQL and No-SQL databases. For e.g. every time a user changes their linkedIn profile, a ton of downstream applications need to know what happened and need to react to it. We have a system (DataBus) which listens to changes in the database transaction logs and makes them available for down stream processing. We process ~2.1 Trillion of such database change events per week.

Read more

We use Apache Samza for processing these event-streams in real time. In this presentation we will discuss some of challenges we faced and the various techniques we used to overcome them.

Challenge 1: What! the IT budget is limited ?

Sooner or later we realize that the architecture we choose can have a big impact on cost. Turns out that the mechanism used to access data/state during event processing can make a big difference. Here we will discuss two different types of data/state that an application has to access/modify to process events.

a. State/data that is derived as an output of event processing.

Some people store this state in a remote no-sql database or cache. This works at smaller event processing rates, but as the scale increases the network i/O and the serialization and deserialization costs rocket up.

The solution we employ is to have a local (RocksDB based store) on every processing machine. This store is backed up with a durable remote store that is lazily written into, to recover from machine failures.

b. Accessing adjunct data.

To process an event, inevitably an application needs to lookup adjunct data that is not embedded in the event. The common solution is to read this state from a remote database. In addition to issues around performance this approach also has a risk of accidentally DOSing the database when a backlog is being processed. To solve this our stream processing jobs liste to the change stream on the remote database and saves the changes to the local RocksDB store. As a result all access is now a local lookup. This approach works for datasets that are a few terabytes or smaller.

Challenge 2: Yes this is a stream processing application, but my logic changed and I want to process everything again.. yes everything !

One thing that is guaranteed about an application is that the logic will change over time. For an application processing a stream of changes from a database, some logic changes could require reprocessing all the records in the database. Many people do this kind of re-processing offline using Hadoop or Spark. But that implies that the core application logic has to be reimplemented in the batch processing system. In this presentation we will discuss how we solve this problem in our real time stream processing system.

Challenge 3: What! you want the correct answer every time ??

A real time stream processing application has to generate results in a reasonably some short period of time. Shorter time windows can vastly magnify the effect of events arriving late, arriving out of order etc. Most companies use what has been called as the Lambda architecture where the stream processing application is augmented with a offline batch application which recomputes the result set to fix inaccuracies in the stream processing application.

In this presentation we will discuss how we address these issues at LinkedIn without using the Lambda architecture.

Kartik Paramasivam foto

Kartik Paramasivam

LinkedInEngineering Manager