from 13:30 pm to 14:15 pm
Last year, during BDS14, two Allegro engineers shared their experience by presenting pitfalls and mistakes that were made when implementing data ingestion pipelines in our company. This time, Maciej Arciuch presents a brand new design and shows how accepting good advices can result in a drastic design change and how making mistakes can teach us a lot.
The design presented during BDS14 as an anti-pattern was an HDFS-centric system storing semi-structured, but hardly documented JSON documents. Moreover, there was no buffer between the traffic and HDFS. Data was not monitored well enough and low level, error-prone APIs like Hadoop MapReduce in Java were used to manipulate data. Data was available in hourly or daily batches. Large number of small files caused name nodes to choke.
Sounds like a data-hell? Not really, all of these problems have fairly easy solutions you can apply to your system. Let us present what happened when we applied them to our pipeline.
The new design tries to address all the issues of the previous one. Spark Streaming and Kafka allow you to process your data immediately as it shows up with small latency. Spark Streaming, although not perfect tool, seems a decent choice for stream processing, while robust Kafka message queue seems a great data hub. Both products integrate very well. But wait, there’s more. Data is no longer cryptic JSON documents scatted in dark corners of HDFS. It is stored in Avro, a space-efficient binary format that forces you to define a schema for your data. Schemas are now stored in a schema repository and Hive tables are created by pointing just to it. Schema information is propagated to Hive meta store, so every column has its correct description now. The data becomes more accessible for your coworkers, for both real-time zealots and batch dinosaurs. When Avro-serialized, the data also takes less space, which – contrary to popular beliefs – is not infinite. High-level Spark API lets us express our business logic in less lines of code and in an elegant functional manner – no more tedious and error-prone coding in Java MapReduce. The same Spark code can be applied to batch and streaming jobs.
If you have a legacy system and you would like to move forward, this talk is for you. No buzzwords, simple recipes formula, a problem and a solution, from people who also make mistakes. Mistakes are fine if you can draw proper conclusions!
Analytics has undoubtedly changed the way businesses are operated. It has made clearer than ever that what cannot be measured cannot be managed. However, about 80% of Big Data projects still merely rely on descriptive analytics. While clever visualizations of the business data can be of great aid in the decision making process, there is much more value to be explored through deeper analytical processes. Whenever information about business rules and costs is available, prescriptive analytics can recommend efficient courses of action to optimize costs or revenues.
In this talk, I will present the key ideas that make prescriptive analytics stand out from usual analytics approaches. This kind of analytics builds on top of previous levels of analytics: descriptive analytics helps us understand the past and current situation of our business through visualizations and KPIs; predictive analytics drafts probable future outcomes of our business decisions, as well as estimates of unknown data such as customer intent or hidden influential factors. By combining all these data, we obtain a picture of how and why our business is working. The next step is adding another layer of intelligence that can test different hypotheses and explore possible business decisions, estimating costs and revenues in order to prescribe the best courses of action. This further empowers the decision process, providing more solid insights about the impact on business that such decisions might have.
I will also give several examples of applying prescriptive analytics, in fields such as customer service, smart pricing and retail logistics. The process of handling customer complaints can be accelerated by a predictive process that identifies the topic of the complaint, sending it to the appropriate expert, but also redirects it to a human reader if the predictive model is not confident enough in its own identification. Changes in the pricing strategy can also be suggested by predicting the impact that such change will have in customer demand, generating hypotheses of the optimal price in revenue terms and updating them in real time with feedback from clients. Moreover, logistics costs in retail chains can be optimized by predicting future customer demands and issuing stock transportations at the right times in order to minimize the number of trips and storage costs.
While in some of these settings attaining a prescriptive level of analytics follows easily from existing analytic processes, a significant degree of technique and expert knowledge is necessary for others, while at the same time producing benefits not achievable by other means. It is in those settings where prescriptive analytics makes a great difference, adding new value on top of the increasingly commoditized standard Big Data solutions.
IICChief Data Scientist
NoSQL databases threw out SQL for querying, while their authors focused on solving problems on scale, speed and availability.
The trouble is, the need for rich query never went away. Neither did SQL; it was only resting.
Today, non-relational databases are bringing back SQL-like languages and other query mechanisms to help them integrate with existing data query layers (e.g. Hibernate) and to fit in with the overwhelming weight of database query practice of the past 40 years.
In this talk I’ll cover the journey from simple key-value access, through novel ideas such as Jsoniq, to where we are now with academic papers proposing a SQL++ language and projects including Cassandra, Couchbase and Aerospike providing their own SQL-like languages.