from 14:25 to 15:05
Data stream processing technologies embrace multiple scenarios that traditional database systems and batch processing technologies cannot approach properly. Through these technologies is possible to implement systems which are capable of handling hugeounts of data in real time, enabling new scenarios and horizons to big data.
Although the current trend is oriented to improve stream processing, batch processing is still present. This is partly due to the requirement of many algorithms to perform several iterations over data, making them unsuitable to be implemented over stream processing engines. There are also several factors that hinder its implementation, i.e. the data distributed across multiple nodes or the high access cost to distributed memories. Many of these algorithms belong to descriptive statistics field, especially those related to centered moments calculation including standard deviation or variance. These statistical operations are widely used for quality control, early alarm detection and management for real-time processes (climatology, metallurgy, banking, stock markets, et cetera).
Formulas for well-known statistical problems, such as the variance computation, lack from robust implementations required for dealing with big data due to its distributed nature. On the other hand, it has yet to be established more complex data science formulas such as centered statistical moments, correlation factors and machine learning results.
Mentioned formulas with such theoretical context needs to be transformed into robust algorithms that are able to deal with issues of numerical computation (floating point arithmetics include numerical instability, cancellation and loss of precision). Those issues are more relevant by continuous accumulative errors in a big data process. So, suitable tools must be provided in order to allow users to challenge gaps and act accordingly.
Our team is focused on the research, design and implementation of data science incremental algorithms under the streaming paradigm. We have developed an open source library, on top of Apache Flink, that provides parallel and scalable algorithms for data science, focussing on descriptive and inferential statistics.
These algorithms consume streams of data and are able to update their results in a parallel manner without the need of saving the processed data, allowing them to scale both spatially and temporally. They produce accurate outcomes for computing correlation factors and arbitrary-order statistical moments.
The library can also be used to process batch data in order to obtain approximated partial results from the whole set and reduce the time needed to obtain value from that data.
Next steps involve:
- Producing online implementations of machine learning algorithms that meet the scalability requirements proposed by the incremental approach and deal with previously mentioned numerical computation issues, related to floating point arithmetic and accumulated error over incremental steps.
- Improving the solution architecture by adding a new security layer in order to make stream processing more robust and reliable, protecting all data in a real time process.