← Back to the schedule

Keynote | Technical

Make the elephant fly, once again

Thursday 16th | 16:35 - 17:05 | Theatre 18

Keywords defining the session:

- Cloud

- Data Quality


Description: has been an early Hadoop user: since 2008 where it was first built for a recommendation algorithm. Since then, Hadoop has grown to many different teams in the company, supporting both Business Intelligence but also operation processes or Machine Learning jobs. And now our premise Hadoop cluster has been victim of its success and we face some challenges to run more jobs that use everytime more data without having to invest too much on Hardware and Infrastructure team. That is why we decided in 2017 to move all our Hadoop environment to the Cloud, not only to be able to easily use more hardware, but also to use new technologies that outperform our current installation. In this session, I will present some considerations, tips, problems we learned when migrating a heavy used Hadoop cluster to the Cloud, in order to make the elephant fast again. First I will briefly talked about, the Hadoop technology infrastructure we have and why we want to switch to the Cloud. Then I will focus on Data Quality, a problem that is quite often overlooked when dealing with lot of data. On that topic, I will describe an OpenSource project I have built: hive_compared_bq, a Python program that compares 2 (SQL like) tables, and graphically shows the rows/columns that are different if any are found. I faced that problem several years ago when working at Hortonworks where out team developed several programs to try to tackle it and thus to ensure that a migration project is successful. I’ll discuss those first approaches, the problem they have and the advantages we get with hive_compared_bq: straightforward to use, no need to move the data, scalable solution that works on the full datasets, easy way for the developer to see where the erros are. A demo will be made. The second tool I want to present related to Data Quality is DataPrep, that allows people to quickly see their data, get an idea about its distribution and spot outliers or errors (also small demo). Another topic I will discuss is Security, in particularly PII data. Our Hadoop cluster was a no go to store such data because it was too complicated for our infrastructure team to secure it to the standards required by our Security team. I’ll show the advantages we found in the Cloud (encryption, authorization model, advanced logs/audit) and the reasons why Security now allows us to store PII data there, and on which conditions (the BigData components of the Cloud provider have not the same