← Back to the schedule

Keynote | Technical

Keeping your Enterprise’s Big Data Secure

Thursday 16th | 13:40 - 14:10 | Theatre 25

One-liner summary:

Security is a tradeoff between usability and safety and should be driven by the perceived threats.

Keywords defining the session:

- Security

- Encryption

- Metadata discovery

Takeaway points of the session:

- How to think about your big data security needs.

- When and how to use encryption.

Enterprises are putting all of their business-critical information in Hadoop. When your data analytics team is small, security is an afterthought, but as your data and team grow, security becomes imperative. Implementing security always requires tradeoffs between protection and convenience and thus it is important to understand the threats and how to protect against them.


The Apache big data ecosystem consists of a large diverse set of tools that each have their own security features. To bring some order to the chaos, this talk works from the top down about how to think about security by analyzing the threat vectors. We discuss how to use their various features to establish a coherent security policy and governance.


Apache Hadoop has strong authentication with Kerberos, audit logs, and encryption for data  at rest. However, since the file system doesn’t interpret the data, the authorization checks are at the entire file level. Apache Hive gives the users the ability to query their data using SQL, but also provides very fine grain control over which rows and columns are visible to each user. LLAP, which is a new service from the Hive project, not only provides an in-memory cache for the hot columns and partitions, but also lets other frameworks like Apache Spark and Apache Pig use Hive’s fine grained authorization.  Apache Ranger provides a consistent single control point to control the security of your data and extends Hive with dynamic data masking. Finally, Apache Atlas provides the discovery tools to find and track the data through your enterprise.


Upcoming features including columnar encryption in the ORC columnar format will also be covered. By encrypting particular columns, enterprises can control which users have access to particularly sensitive columns that contain personally identifiable information or financial information.


We will provide an overview of the new features that have been added in each of these Apache projects recently and how enterprises can leverage these new features to build a robust security and governance model for their data lakes.