Apache Spark is a lightning-fast unified analytics engine for big data and machine learning. It was originally developed at UC Berkeley in 2009. Spark adds in-Memory Compute for ETL, Machine Learning and Data Science Workloads to Hadoop.

What Apache Spark Does

Apache Spark is a fast, in-memory data processing engine with elegant and expressive development APIs to allow data workers to efficiently execute streaming, machine learning or SQL workloads that require fast iterative access to datasets. With Spark running on Apache Hadoop YARN, developers everywhere can now create applications to exploit Spark’s power, derive insights, and enrich their data science workloads within a single, shared dataset in Hadoop.

Apache Spark consists of Spark Core and a set of libraries. The core is the distributed execution engine and the Java, Scala, and Python APIs offer a platform for distributed ETL application development.

Additional libraries, built atop the core, allow diverse workloads for streaming, SQL, and machine learning.

Spark is designed for data science and its abstraction makes data science easier.   Data scientists commonly use machine learning – a set of techniques and algorithms that can learn from data. These algorithms are often iterative, and Spark’s ability to cache the dataset in memory greatly speeds up such iterative data processing, making Spark an ideal processing engine for implementing such algorithms.

Spark also includes MLlib, a library that provides a growing set of machine algorithms for common data science techniques: Classification, Regression, Collaborative Filtering, Clustering and Dimensionality Reduction.

Spark’s ML Pipeline API is a high level abstraction to model an entire data science workflow.   The ML pipeline package in Spark models a typical machine learning workflow and provides abstractions like Transformer, Estimator, Pipeline & Parameters.  This is an abstraction layer that makes data scientists more productive.

Install Jupyter Notebook

$ pip3 install jupyter

# You can run a regular jupyter notebook by typing:
$ jupyter notebook

Install pyspark

$ pip3 install pyspark

Open jupyter notebook

Configure PySpark driver

Update PySpark driver environment variables: add these lines to your ~/.bashrc (or ~/.zshrc) file.


Creating spark context

from pyspark import SparkConf, SparkContext
from pyspark.sql import SparkSession
conf = (SparkConf()
sc = SparkContext(conf = conf)

Run simple add operation on spark

$ sum_add = sc.parallelize(range(100))
$ sum_add.reduce(lambda x, y: x+y)

Approximate version of count() that returns a potentially incomplete result within a timeout, even if not all tasks have finished.

$ rdd = sc.parallelize(range(1000), 10) 
$ rdd.countApprox(1000, 1.0)

Thanks for reading.