← Back to the schedule

Productionizing ML Pipelines with PFA

Calendar icon

Wednesday 14th

Time icon

13:20 | 14:00

Location icon

Theatre 20


Keywords defining the session:

- Machine Learning

- Model Export


Takeaway points of the session:

- PFA helps solve a major pain point in taking machine learning to production, particularly within the Apache Spark community.

- Open standards enable true model portability across languages, frameworks and runtimes.


The common perception of machine learning is that it starts with data and ends with a model. In real-world production systems, the traditional data science and machine learning workflow of data preparation, feature engineering and model selection, while important, is only one aspect. A critical missing piece is the deployment and management of models, as well as the integration between the model creation and deployment phases.

Despite this, there are few if any widely accepted, standard solutions enabling simple deployment of end-to-end machine learning pipelines to production. While this issue impacts all machine learning frameworks, it is a particular challenge in the case of deploying Apache Spark ML pipelines for low-latency scoring. While MLlib’s DataFrame API is powerful and elegant, it is relatively ill-suited to the needs of many real-time predictive applications, in part because it is tightly coupled with the Spark SQL runtime.

In this talk I will introduce the Portable Format for Analytics (PFA) for portable, open and standardized deployment of data science pipelines & analytic applications. Using a standard format that encapsulates both model serialization, together with model execution logic and schema definitions, means that models serialized to PFA are truly portable across languages, frameworks, versions and runtimes. Furthermore, model producers and consumers can become truly independent, dramatically simplifying the interaction of typical model “producer teams” (e.g. data scientists and machine learning researchers) and model “consumer teams” (e.g. production engineers and application developers).

I will also cover open-source work we have done to support export of Apache Spark ML pipelines to PFA in the Aardpfark project, as well as explore PFA export for other popular ML frameworks (such as scikit-learn). Finally, I will compare and contrast PFA to other available alternatives including PMML, MLeap, ONNX and Apple’s CoreML and explore the relevance of PFA in the new era of deep learning and AI technologies.