from 12:10 to 12:50
In the age of "big data" we have been observing an unprecedented growth in the amount of data available for analysis. At the same time, handling unstructured and semi-structured data is a challenging task that prompts organisations to discard a substantial amounts of data. Artificial Neural Networks (ANN) have been successfully used for imposing structure over unstructured data, by means of unsupervised feature extraction and non-linear pattern detection.
Restricted Boltzmann Machines (RBM), for example, have been shown to have a wide range of application in this context – they can be used as generative models for dimensionality reduction, classification, collaborative filtering, extraction of semantic document representation and more. RBMs are also used as bulding blocks for the multi-layer learning architecture known as Deep Belief Network.
Training RBMs against big data set, however, turns to be problematic. When operating with millions and billions of parameters, the parameter estimation process for a conventional, non-parallelized RBM can take weeks. In addition, the constraints of using a single machine for model fitting introduces another limitation that negatively impacts scalability.
Numerous attempts have been made to overcome the aforementioned limitation, most of them involving computations using GPUs. Studies have shown that this approach can reduce the training time for an RBM-based deep belief network from several weeks to one day.
On the other hand, using GPU-based training also presents certain challenges. GPUs impose a limit on the amount of memory available for the computation, thus limiting the model in terms of size. Stacking multiple GPUs together seems to be inefficient due to the communication induced overhead and the increased economic costs. There are also limitations arising from memory transfer times and thread synchronization.
Thinking about the current state of RBMs in the context of big data we couldn't help but wonder if we can implement a CPU-based, parallelized version of the Restricted Boltzmann Machine, and see how effective it would be for processing big data sets in a distributed fashion.
At City University London we conducted extensive research in order to answer this question. We created a custom implementation of a Restricted Boltzmann Machine that runs on top of Apache SystemML - the declarative large-scale machine learning platform, initially developed by IBM Research and donated to Apache in 2015. We carried out a number of tests with various data sets, using RBMs as feature extractors and feeding the outputs to different classification algorithms (Support Vector Machines, Decision Trees, Multinomial Logistic Regression and others).
In this session we would like to present the Restricted Boltzmann Machine and the current state of this stochastic ANN model in the context of big data. We will also talk about SystemML and the way it alleviates certain big data challenges (e.g. using cost-based optimisation for distributed matrix operations), and why we chose it as a foundation for our machine learning problem. We will discuss the challenges that we have had during this entire project, and we will also share some exciting results and future research plans.
IBM AnalyticsData Scientist in the IBM Big Data Technical Team – Europe