from 13:40 to 14:20
This talk will give you a well-balanced mix of theory and practice for doing machine learning with Apache Spark. Lighter on theory and somewhat heavier on coding, but gentle on newcomers.
In the first part of the talk, I will give a quick overview of Apache Spark and its architecture. You will learn about different Spark cluster types; about main processes in a Spark cluster; about job and resource scheduling; about Spark's Web UI; about RDDs (resilient distributed datasets), DataFrames, and DataSets, all three main abstractions in Spark; and about different components that comprise Spark's rich API. Then you will hear about Spark's machine learning APIs (MLlib and ML) in more detail.
In the second part of the talk, I will show examples of building machine learning algorithms in a Spark shell, using example data sets. I will cover the following algorithms:
- linear regression as an example of a regression algorithm
- random forest as an example of a classification algorithm
- k-means clustering as an example of a clustering algorithm
A short theory introduction to each of the algorithms will be given, providing the context for the examples. The theory will be followed by concrete examples of training and using machine learning models. Data cleaning and preparation are a big part of machine learning projects, so you will first learn how to clean and prepare the data using Spark. The data features will need to be scaled and normalized, you will learn what to do with missing values and how to deal with categorical features. Then you will train the models and learn how to evaluate their performance. Different performance metrics will be explained, as well as underfitting and overfitting (bias-variance tradeoff).