Skip to content
Snippets Groups Projects
H_README_repo.md 7.26 KiB

Apache Spark Standalone Cluster on Docker

The project was featured on an article at MongoDB official tech blog! 😱

The project just got its own article at Towards Data Science Medium blog!

Introduction

This project gives you an Apache Spark cluster in standalone mode with a JupyterLab interface built on top of Docker. Learn Apache Spark through its Scala, Python (PySpark) and R (SparkR) API by running the Jupyter notebooks with examples on how to read, process and write data.

build-master sponsor jupyterlab-latest-version spark-latest-version spark-scala-api spark-pyspark-api spark-sparkr-api

TL;DR

curl -LO https://raw.githubusercontent.com/cluster-apps-on-docker/spark-standalone-cluster-on-docker/master/docker-compose.yml
docker-compose up

Contents

Quick Start

Cluster overview

Application URL Description
JupyterLab localhost:8888 Cluster interface with built-in Jupyter notebooks
Spark Driver localhost:4040 Spark Driver web ui
Spark Master localhost:8080 Spark Master node
Spark Worker I localhost:8081 Spark Worker node with 1 core and 512m of memory (default)
Spark Worker II localhost:8082 Spark Worker node with 1 core and 512m of memory (default)

Prerequisites

Download from Docker Hub (easier)

  1. Download the docker compose file;
curl -LO https://raw.githubusercontent.com/cluster-apps-on-docker/spark-standalone-cluster-on-docker/master/docker-compose.yml
  1. Edit the docker compose file with your favorite tech stack version, check apps supported versions;
  2. Start the cluster;