16:10 | 16:50
Keywords defining the session:
- Machine learning
- Apache Flink
Takeaway points of the session:
- What is an online machine learning algorithm and what are the challenges which we have to tackle to build a scalable implementation of one of them
On the one hand, machine learning algorithms have become the cornerstone to solve most of the data analytic problems. On the other hand, after the emergence of multiple technologies that allow the batch processing of large amounts of data, streaming data processing has become the new objective to be followed in data analytic problems. It is no longer enough to process the data generated during the day each night, it is necessary to process them at the time they are collected in order to obtain the maximum benefit of them. The combination of these two clear trends often does not have a satisfactory performance. The implementation of automatic learning algorithms in productive environments involves the design of architectures that allow the acquisition of data for the training of machine learning models and the deployment of these in real environments, for example, to make predictions in real time. This division of the data workflow makes it necessary to propose more complex architectures, with more points of failure and above all solutions where we do not have the most up-to-date model possible since often the model in production is trained with a set of data collected in previous hours, days or weeks.
Scalable online machine learning can be the solution of this problem. This kind of algorithms allow us to train and take advantage of the model (i.e. making predictions or classifying data) in the same real time data workflow. On this way we will be able to simplify data streaming architectures and we would have always available the most up-to-date model. Evidently it is not an easy task. There are important challenges to solve before get productive online machine learning algorithms. First of them a mathematical challenge which consists of to know what machine learning algorithms are suitable to be used on that way. We need those of them which can train a model when it only process a sample only once time. All machine learning algorithms can not be used with this premise. Some of them needs the complete dataset to train a model and are not valid for this purpose. In addition there are other technical challenges to solve in this kind of data architectures. As the workflow in stream processing application is parallelized, we need a component to share the model between the workers and store the most update model as it is possible . Currently there are some experimental solutions but it is necessary to go in depth with these technologies and to mature them.
Finally an example of where this kind of algorithms can be used is shown. Defect early detection is a meaningful problem in secondary industries which means millions of euros. Concretely will be explained the implemented prototype to solve a defect detection problem of an steel maker. The complete prototype allows to predict defects in a real time environment and makes data analysis over historical data. The showed solution uses Apache Flink, Kafka, Couchbase and HDFS to implement a real time Lasso algorithm implementation able to detect steel coil flatness which will allows to reject defect coils in the factory