← Back to the schedule

Keynote | Technical

Disaster Recovery for Big Data

Thursday 16th | 12:55 - 13:25 | Theatre 20

One-liner summary:

All modern Big Data solutions, like Hadoop, Kafka or the rest of the ecosystem tools, are designed as distributed processes and as such include some sort of redundancy for High Availability. For mission-critical data this is often not good enough and enterprises seek a way to avoid having all their eggs in one basket. In this talk we will give examples of possible Disaster Recovery options, how to integrate them with existing solutions and what are the benefits of having them.

Keywords defining the session:

- Disaster Recovery

- Big Data

- Engineering


Big Data is all about unifying data in one place. Enterprises spend a lot of time setting up data flows from their productive environments to these new platforms to ensure that every piece of relevant information is stored in a single location. This has the benefit of providing a convenient centralized location for data access, a much broader view of what is happening, and allows for more efficient data cleansing and ultimately better analytics. However, there is a downside to this: the Big Data platform becomes mission critical and, if kept in a single location, might become a very vulnerable spot in the company’s structure. We all usually associate Disaster Recovery with the “What if your datacenter gets blown up by a bomb?”, but a broken water pipe or a power cable being cut down by accident are much more common and might still take a lot of time to recover. Big Data solutions need to be ready to be launched in a different datacenter if needed, just like the rest of the company’s solutions. So, how’s the Big Data Platform different from the rest of the IT services that it cannot be simply incorporated into an existing Disaster Recovery solution? First, productive Big Data solutions are generally run in their own environment: dedicated hardware, dedicated network, and a specific platform management solution such as Cloudera Manager or Apache Ambari. Secondly, the sheer volume of data and the complexity of the internal processes in these clusters add to the difficulty of designing a DR solution for Big Data and generally just the simplest scripts (with lots of manual intervention) are implemented.