**17:15**~ 18:00,

**November 18th**

*Sala 5*

## Alex Fernández

*Senior Developer MediaSmart Mobile*

- Technical

bigdataspain.org
## THANK YOU FOR AN AMAZING CONFERENCE!

#BDS14

The 3rd edition of Big Data Spain in Nov 2014 was a resounding success.

Watch the video below and find out why our attendees, speakers, partners and friends turned Big Data Spain into one of the largest events in Europe about Hadoop, Spark, NoSQL and cloud technologies.

- Technical

AnalyticsEnglish

Creating an analytics database from scratch is not an easy challenge, and it only gets more interesting when working for a startup with limited resources. Instead of a full-scale army we have to think like guerrilla combatants: move fast and do as much damage as possible. At MediaSmart Mobile we set to create a robust system with a particular set of requirements: it should handle many billions of entries per day and be blazing fast (each query should not take more than two seconds), while not requiring a full time administrator or costing thousands of dollars per month.

We used Amazon RedShift, a managed solution from AWS that works as an OLAP and scales to petabytes of storage, and which should require little to no administration; scaling is always a couple of clicks away. After a couple of months we have created a system that can go through billions of registers in under a second, at a fraction of the cost of similar commercial solutions. Along the way we have found some interesting mathematical problems, and experimented with multiple designs.

We had to find a way to store billions of requests per day, and the solution is as obvious as it is often overlooked: aggregating as much as possible. Most requests are quite similar, and the combinations of possible values are much more limited than could be imagined. Is it possible to estimate the number of combinations for a given volume of requests? Interestingly, this is known as the "generalized coupon collector problem ", and the answer is not straightforward; luckily an approximate heuristic solution was enough for our purposes.

We then had to figure out what design was better in terms of storage (which can hike up the cost quite a bit) and query speed, which according to the classic trade-off of space vs time can be hard to reconcile. We figured out usage patterns and again aggregated our data until the speed was acceptable. The final challenge was to handle tables with lots of redundant data. At the time of writing this is still an open problem: we are making progress every day and have an acceptable solution right now, but will hopefully improve it in time for the talk.

In the process of designing and implementing this system we have learned some valuable lessons that we want to share with the audience: * Building analytics systems is hard: there are many pieces to orchestrate. * Building analytics systems for big data is doubly hard: we are near the operating limits of current technology all the time. * Performance must be always the key driver of any customer-facing system. * Good engineering requires many iterations to find a solution that is good enough.

We used Amazon RedShift, a managed solution from AWS that works as an OLAP and scales to petabytes of storage, and which should require little to no administration; scaling is always a couple of clicks away. After a couple of months we have created a system that can go through billions of registers in under a second, at a fraction of the cost of similar commercial solutions. Along the way we have found some interesting mathematical problems, and experimented with multiple designs.

We had to find a way to store billions of requests per day, and the solution is as obvious as it is often overlooked: aggregating as much as possible. Most requests are quite similar, and the combinations of possible values are much more limited than could be imagined. Is it possible to estimate the number of combinations for a given volume of requests? Interestingly, this is known as the "generalized coupon collector problem ", and the answer is not straightforward; luckily an approximate heuristic solution was enough for our purposes.

We then had to figure out what design was better in terms of storage (which can hike up the cost quite a bit) and query speed, which according to the classic trade-off of space vs time can be hard to reconcile. We figured out usage patterns and again aggregated our data until the speed was acceptable. The final challenge was to handle tables with lots of redundant data. At the time of writing this is still an open problem: we are making progress every day and have an acceptable solution right now, but will hopefully improve it in time for the talk.

In the process of designing and implementing this system we have learned some valuable lessons that we want to share with the audience: * Building analytics systems is hard: there are many pieces to orchestrate. * Building analytics systems for big data is doubly hard: we are near the operating limits of current technology all the time. * Performance must be always the key driver of any customer-facing system. * Good engineering requires many iterations to find a solution that is good enough.

## Join our Newsletter