from 16:00 pm to 16:45 pm
Apache Spark has successfully built on Hadoop infrastructure to encompass real-time processing, moving from rigid Map-Reduce operations to general purpose functional operations distributed across a cluster of machines. However data storage has become a black box. The source data for a query has to be retrieved in full and sent through the analysis pipeline rather than processing the data where it is stored, as in traditional database systems. This introduces significant cost, both in network utilisation and in the time taken to produce a result.
Valo is a new system that has been built to directly integrate the computation graph with data repositories, free from dependencies on existing Hadoop infrastructure. In the talk I show how easy algorithm implementation becomes with several examples:
I'll show how once defined, using a simple interface and annotations in Scala code, algorithms can be re-used in queries across real-time and historical context as well as seamlessly switching between the two. I'll show how we are developing support to extend this to refactor time ranges entirely from query definitions, independently specifying gap filling and interpolation. This will allow the user to completely focus on developing their analysis against a clean data stream.
I'll explain how we can then distribute the query, utilising both the annotated algorithms and user-defined policies that store the incoming data across the cluster based on timestamps in the incoming payloads. I'll show how this fits into the wider Valo architecture: a highly available, eventually consistent, truly real-time system that maintains computation in the face of individual node failure.
I'll introduce the two repositories we currently have in Valo:
Valo is capable of "pushing down" historical queries into the repositories themselves. This applies for both standard operations (counts, univariate and bivariate statistics) as well as more complex analysis like fuzzy text search using Lucene syntax and running the algorithms defined earlier in the talk.
I'll explain how Valo maintains knowledge of the schema and structure of the data at all stages in the analysis pipeline. This includes metadata about the original source, such as IP address or region that can be easily incorporated into analysis. I'll create live updating sets of sources and apply them to the same query to compare results.
I'll also give a brief overview of how this knowledge of the schema is helping drive forward research in our team to automatically apply relevant visualisations to query results.
ITRS GroupSoftware Developer