Keynote | Technical
Thursday 16th | 12:55 - 13:25 | Theatre 25
This talk will look at the architecture of PySpark as it is today, how its changing (and why), and what this means in the "Big Data" space.
Keywords defining the session:
- Big data
Takeaway points of the session:
- For Python users as part of this talk you will learn how to more effectively use systems like Spark and understand how the design is changing
- JVM developers will gain an understanding of how to Python code from data scientist and Python developers while avoiding the traditional trap ofneeding to rewrite everything.
Many of the recent big data systems, like Hadoop, Spark, Kafka, and more, are written primarily in JVM languages. At the same time there is a wealth of tools for data science and data analytics that exist outside of the JVM. This talk will look at how we can bridge the gap today using PySpark, as well as how other systems (like Kafka Streams) can or are bridging the gap and explore the challenges of pure Python solution like dask.
Apache Spark is one the most popular general purpose distributed systems in the past few years. Apache Spark has APIs in Scala, Java, Python and more recently a few different attempts to provide support for R, C#, and Julia. This talk focuses on the interactions of Python and Spark, and aims to generalize to other non-JVM languages and big data systems. In essence parts of this talk could be considered “the impact of design decisions from years ago and how to work around them.” as well as a promise that the future is getting better, while being honest about the parts that are likely to remain bad in the near future.
To this end, we will start with the current architecture of PySpark and how it evolved. From their we look to the future of things like Arrow accelerated interchange for Python functions. We will look at how to expose Python machine learning models into Spark, and how to use systems like Spark to accelerate training of traditional Python models.
From their we will explore what other similar systems are doing, as well as what the options are for (almost) completely ignoring the JVM in the big data space.
For Python users as part of this talk you will learn how to more effectively use systems like Spark and understand how the design is changing. JVM developers will gain an understanding of how to Python code from data scientist and Python developers while avoiding the traditional trap ofneeding to rewrite everything.
A basic background with Apache Spark will probably make the talk more exciting or depressing depending on your point of view but for those new to Apache Spark just enough to understand whats going will be covered at the start. The presenter would of course encourage you to buy and read her books on the topic (“Learning Spark” & “High Performance Spark”), because which presenter doesn’t do that.