Snippets Groups Projects

1 year ago
de339e16

add assignment H_Spark (H, H1) · de339e16
Sven Graupner authored 1 year ago

de339e16

History

add assignment H_Spark (H, H1)
Sven Graupner authored 1 year ago

H_README_repo.md 7.26 KiB

Apache Spark Standalone Cluster on Docker

The project was featured on an article at MongoDB official tech blog! 😱

The project just got its own article at Towards Data Science Medium blog! ✨

Introduction

This project gives you an Apache Spark cluster in standalone mode with a JupyterLab interface built on top of Docker. Learn Apache Spark through its Scala, Python (PySpark) and R (SparkR) API by running the Jupyter notebooks with examples on how to read, process and write data.

TL;DR

curl -LO https://raw.githubusercontent.com/cluster-apps-on-docker/spark-standalone-cluster-on-docker/master/docker-compose.yml
docker-compose up

Contents

Quick Start
Tech Stack
Metrics
Contributing
Contributors
Support

Quick Start

Cluster overview

Application	URL	Description
JupyterLab	localhost:8888	Cluster interface with built-in Jupyter notebooks
Spark Driver	localhost:4040	Spark Driver web ui
Spark Master	localhost:8080	Spark Master node
Spark Worker I	localhost:8081	Spark Worker node with 1 core and 512m of memory (default)
Spark Worker II	localhost:8082	Spark Worker node with 1 core and 512m of memory (default)

Prerequisites

Install Docker and Docker Compose, check infra supported versions

Download from Docker Hub (easier)

Download the docker compose file;

curl -LO https://raw.githubusercontent.com/cluster-apps-on-docker/spark-standalone-cluster-on-docker/master/docker-compose.yml

Edit the docker compose file with your favorite tech stack version, check apps supported versions;
Start the cluster;