13:20 | 14:00
Keywords defining the session:
- Machine Learning
Takeaway points of the session:
- New workflow tooling and engineering process is emerging for machine learning, to fill gaps that Agile does not.
- Issues concerning "repeatable science" have impact on both scientific research and IT practices in industry, which now need to learn from each other.
Apache Spark is a popular Big Data framework, which evolved to handle large scale workflows for ETL, analytics, machine learning, and other needs. Precursors such as Cascading, Apache Pig, etc., also provided workflow abstractions. However, a newer generation of workflow tooling is emerging, and along with it come newer forms of software engineering process. Notions of “Agile” — which served well for web development — do not necessarily meet the needs of machine learning in production.
This talk reviews learnings from a large 2018 survey about ML adoption in enterprise worldwide, conducted by O’Reilly Media. Organizations with more experience running ML in production show important contrasts — when compared with less sophisticated teams — in how they handle engineering process, team priorities, key metrics, and other areas of decision making. Notably, there is a shift away from “Agile”, and while best practices are emerging, no distinct name has gained traction yet.
An important related theme in 2018 surfaced through Project Jupyter, concerning “reproducible science” and how workflows must repeatable with consistent results. Science is facing stark criticisms due to the lack of reproducibility in published works, e.g., climate science funding in the US. Industry faces similar challenges in how data science projects must be reproducible across teams to gain the trust of an organization overall. One outcome is that scientific research clearly has much to learn from open source practices, while IT has much to learn from science. How are those meeting in the middle?
Another emerging theme concerns the impact of “uncertainty” on software engineering — not necessarily a bad thing. ML models are neither software nor data. They introduce new needs and concerns in engineering, which in turn requires new approaches to engineering. To wit, building an ecommerce web site has analogies to a construction project for new apartments; however, ML in production is more akin to the long-term investment of managing high-end vineyards.
This talk reviews contemporary approaches to ML workflow and process, where circa 2014 focus on cluster throughput gives way to circa 2018 priorities about devops practices, clarifying team roles, managing code+data+feature versioning, hyperparameter optimization, ML model interpretation — just to name a few. We’ll compare accepted notions of Agile software development with how machine learning discipline now compels different engineering process.