Skip to content
Snippets Groups Projects

Compare revisions

Changes are shown as if the source revision was being merged into the target revision. Learn more about comparing revisions.

Source

Select target project
No results found

Target

Select target project
  • sgraupner/ds_cs4bd_2324
  • akbi5459/ds_cs4bd_2324
  • s90907/ds_cs4bd_2324
  • fapr2511/ds_cs4bd_2324
4 results
Show changes
Showing
with 1758 additions and 0 deletions
# Assignment F: Graph Data   (10 Pts)
### Challenges
- [Challenge 1:](#1-challenge-understanding-graph-data) Understanding Graph Data
- [Challenge 2:](#2-challenge-representing-graph-data-in-python) Representing Graph Data in Python
- [Challenge 3:](#3-challenge-implementing-the-graph-in-python) Implementing the Graph in Python
- [Challenge 4:](#4-challenge-implementing-dijkstras-shortest-path-algorithm) Implementing Dijkstra's Shortest Path Algorithm
- [Challenge 5:](#5-challenge-run-for-another-graph) Run for Another Graph
Points: [1, 1, 2, 4, 2]
 
### 1.) Challenge: Understanding Graph Data
A *[Graph](https://en.wikipedia.org/wiki/Graph_theory)*
is a set of nodes (vertices) and edges connecting nodes G = { n ∈ N, e ∈ E }.
A *weighted Graph* has a *weight* (number) associated to each egde.
A *[Path](https://en.wikipedia.org/wiki/Path_(graph_theory))*
is a subset of edges that connects a subset of nodes.
We consider Complete Graphs where all nodes can be reached from any other
node by at least one path (no disconnected subgraphs).
Graphs may have cycles (paths that lead to nodes visited before) or
paths may join at nodes that are part of other paths.
Traversal is the process of visiting each node of a graph exactly once.
Multiple visits of graph nodes by cycles or joins must be detected by
marking visited nodes (which is not preferred since it alters the data set)
or by keeping a separate record of visits.
Write two properties that distinguish graphs from trees.
(1 Pt)
 
### 2.) Challenge: Representing Graph Data in Python
Python has no built-in data type that supports graph data.
Separate packages my be used such as
[NetworkX](https://networkx.org/).
In this assignment, we focus on basic Python data structures.
1. How can Graphs be represented in general?
1. How can these by implemented using Python base data structures?
1. Which data structure would be efficient giving the fact that in the
example below that graph is constant and only traversal operations
are performed?
(1 Pt)
 
### 3.) Challenge: Implementing the Graph in Python
Watch the video and understand how
[Dijkstra's Shortest Path Algorithm](https://en.wikipedia.org/wiki/Dijkstra%27s_algorithm)
(1956) works and which information it needs.
*Edsger W. Dijkstra* (1930-2003,
[bio](https://en.wikipedia.org/wiki/Edsger_W._Dijkstra))
was a Dutch computer scientist, programmer and software engineer.
He was a professor of Computer Science at the Univerity of Austin, Texas
and has received numerous awards, including the
[Turing Award](https://en.wikipedia.org/wiki/Turing_Award)
in 1972.
<!--
[video (FelixTechTips)](https://youtu.be/bZkzH5x0SKU?si=n8Z2ZIfbB73_v1TE)
<img src="../markup/img/graph_2a.jpg" alt="drawing" width="640"/>
-->
[Video (Mike Pound, Computerphile)](https://youtu.be/GazC3A4OQTE?si=ZuBEcWaBzuKmPMqA)
<img src="../markup/img/graph_1.jpg" alt="drawing" width="640"/>
Node `S` forms the start of the algorithm, node `E` is the destination.
Draw a sketch of the data structures needed to represent the graph with
nodes, edges and weights and also the data needed for the algorithm.
Create a Python file `shortest_path.py` with
- declarations of data structures you may need for the graph and
information for the algorithm and
- data to represent the graph in the video with nodes: {A ... K, S} and
the shown edges with weights.
(2 Pts)
&nbsp;
### 4.) Challenge: Implementing Dijkstra's Shortest Path Algorithm
Implement Dijkstra's Algorithm.
Output the sortest path as sequence of nodes, followed by an analysis and
the shortest distance.
```
shortest path: S -> B -> H -> G -> E
analysis:
S->B(2)
B->H(1)
H->G(2)
G->E(2)
shortest distance is: 7
```
(4 Pts)
&nbsp;
### 5.) Challenge: Run for Another Graph
Run your algorithm for another graph G: {A ... F} with weights:
```
G: {A, B, C, D, E, F}, start: A, end: C
Weights:
AB(2), BE(6), EC(9), AD(8), BD(5),
DE(3), DF(2), EF(1), FC(3)
```
Output the result:
```
shortest path: A -> B -> D -> F -> C
analysis:
S->B(2)
B->D(5)
D->F(2)
F->C(3)
shortest distance is: 12
```
(2 Pts)
# Assignment G: Docker &nbsp; (18 Pts)
This assignment will setup Docker. If you already have it, simply run challenges and answer questions.
Docker is a popular software packaging, distribution and execution infrastructure using containers.
- Docker runs on Linux only (LXC). Mac, Windows have built adapter technologies.
- Windows uses an internal Linux VM to run Docker engine ( *dockerd* ).
- Client tools (CLI, GUI, e.g.
[Docker Desktop](https://docs.docker.com/desktop/install/windows-install/)
for Windows) are used to manage and execute containers.
Docker builds on Linux technologies:
- stackable layers of filesystem images that each contain only a diff to an underlying image.
- tools to build, manage and distribute layered images ("ship containers").
- Linux LXC technology to “execute containers” as groups of isolated processes on a Linux system (create/run a new container, start/stop/join container).
Salomon Hykes, PyCon 2013, Santa Clara CA: *"The Future of Linux Containers"* ([watch](https://www.youtube.com/watch?v=9xciauwbsuo), 5:21min).
### Challenges
- [Challenge 1:](#1-challenge-docker-setup-and-cli) Docker Setup and CLI
- [Challenge 2:](#2-challenge-run-hello-world-container) Run *hello-world* Container
- [Challenge 3:](#3-challenge-run-minimal-alpine-python-container) Run minimal (Alpine) Python Container
- [Challenge 4:](#4-challenge-configure-alpine-container-for-ssh) Configure Alpine Container for *ssh*
- [Challenge 5:](#5-challenge-build-alpine-python-container-with-ssh-access) Build Alpine-Python Container with *ssh*-Access
- [Challenge 6:](#6-challenge-setup-ide-to-develop-code-in-alpine-python-container) Setup IDE to develop Code in Alpine-Python Container
- [Challenge 7:](#7-challenge-run-jupyter-in-docker-container) Run *Jupyter* in Docker Container
Points: [2, 2, 2, 4, 4, 2, 2]
&nbsp;
### 1.) Challenge: Docker Setup and CLI
[Docker Desktop](https://docs.docker.com/desktop)
bundles all necessary Docker components necessary to run Docker on your
system (Windows, Mac, Linux). It comes with a GUI that makes using Docker
easier and is recommended for beginners.
Components can also be installed individually (e.g. "Docker Engine"), but this
may involve installation of dependencies such as the WSL virtual machine on Windows.
Docker CLI
Docker CLI is the Docker command-line interface that is needed to run docker
commands in a terminal.
After setting up Docker Desktop, open a terminal and type commands:
```sh
> docker --version
Docker version 20.10.17, build 100c701
> docker --help
...
> docker ps ; dockerd is not running
error during connect: This error may indicate that the docker daemon is not runn
ing.
> docker ps ; dockerd is now running, no containers yet
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
```
If you can't run the `docker` command, the client-side **docker-CLI** (Command-Line-Interface) may not be installed or not on the PATH-variable. If `docker ps` says: "can't connect", the **Docker engine** (server-side: *dockerd* ) is not running and must be started.
(2 Pts)
&nbsp;
### 2.) Challenge: Run *hello-world* Container
Run the *hello-world* container from Docker-Hub: [hello-world](https://hub.docker.com/_/hello-world):
```sh
> docker run hello-world
Unable to find image 'hello-world:latest' locally
latest: Pulling from library/hello-world
2db29710123e: Pull complete
Digest: sha256:62af9efd515a25f84961b70f973a798d2eca956b1b2b026d0a4a63a3b0b6a3f2
Status: Downloaded newer image for hello-world:latest
Hello from Docker!
This message shows that your installation appears to be working correctly.
```
Show the container image loaded on your system:
```sh
> docker image ls
REPOSITORY TAG IMAGE ID CREATED SIZE
hello-world latest feb5d9fea6a5 12 months ago 13.3kB
```
Show that the container is still present after the end of execution:
```sh
> docker ps -a
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
da16000022e0 hello-world "/hello" 6 min ago Exited(0) magical_aryabhata
```
Re-start the container with an attached (-a) *stdout* terminal.
Refer to the container either by its ID ( *da16000022e0* ) or by its
generated NAME ( *magical_aryabhata* ).
```sh
> docker start da16000022e0 -a or: docker start magical_aryabhata -a
Hello from Docker!
This message shows that your installation appears to be working correctly.
```
Re-run will create a new container and execut it. `docker ps -a ` will then
show two containers created from the same image.
```sh
> docker run hello-world
Hello from Docker!
This message shows that your installation appears to be working correctly.
> docker ps -a
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
da16000022e0 hello-world "/hello" 6 min ago Exited(0) magical_aryabhata
40e605d9b027 hello-world "/hello" 4 sec ago Exited(0) pedantic_rubin
```
"Run" always creates new containers while "start" restarts existing containers.
(2 Pts)
&nbsp;
### 3.) Challenge: Run minimal (Alpine) Python Container
[Alpine](https://www.alpinelinux.org) is a minimal base image that has become
popular for building lean containers (few MB as opposed to 100's of MB or GB's).
Being mindful of resources is important for container deployments in cloud
environments where large numbers of containers are deployed and resource use
is billed.
Pull the latest Alpine image from Docker-Hub (no container is created with just
pulling the image). Mind image sizes: hello-world (13.3kB), alpine (5.54MB).
```sh
> docker pull alpine:latest
docker pull alpine:latest
latest: Pulling from library/alpine
Digest: sha256:bc41182d7ef5ffc53a40b044e725193bc10142a1243f395ee852a8d9730fc2ad
Status: Image is up to date for alpine:latest
docker.io/library/alpine:latest
> docker image ls
REPOSITORY TAG IMAGE ID CREATED SIZE
hello-world latest feb5d9fea6a5 12 months ago 13.3kB
alpine latest 9c6f07244728 8 weeks ago 5.54MB
```
Create and run an Alpine container executing an interactive shell `/bin/sh` attached to the terminal ( `-it` ). It launches the shell that runs commands inside the Alpine
container.
```sh
> docker run -it alpine:latest /bin/sh
# ls -la
total 64
drwxr-xr-x 1 root root 4096 Oct 5 18:32 .
drwxr-xr-x 1 root root 4096 Oct 5 18:32 ..
-rwxr-xr-x 1 root root 0 Oct 5 18:32 .dockerenv
drwxr-xr-x 2 root root 4096 Aug 9 08:47 bin
drwxr-xr-x 5 root root 360 Oct 5 18:32 dev
drwxr-xr-x 1 root root 4096 Oct 5 18:32 etc
drwxr-xr-x 2 root root 4096 Aug 9 08:47 home
drwxr-xr-x 7 root root 4096 Aug 9 08:47 lib
drwxr-xr-x 5 root root 4096 Aug 9 08:47 media
drwxr-xr-x 2 root root 4096 Aug 9 08:47 mnt
drwxr-xr-x 2 root root 4096 Aug 9 08:47 opt
dr-xr-xr-x 179 root root 0 Oct 5 18:32 proc
drwx------ 1 root root 4096 Oct 5 18:36 root
drwxr-xr-x 2 root root 4096 Aug 9 08:47 run
drwxr-xr-x 2 root root 4096 Aug 9 08:47 sbin
drwxr-xr-x 2 root root 4096 Aug 9 08:47 srv
dr-xr-xr-x 13 root root 0 Oct 5 18:32 sys
drwxrwxrwt 2 root root 4096 Aug 9 08:47 tmp
drwxr-xr-x 7 root root 4096 Aug 9 08:47 usr
drwxr-xr-x 12 root root 4096 Aug 9 08:47 var
# whoami
root
# uname -a
Linux aab69035680f 5.10.124-linuxkit #1 SMP Thu Jun 30 08:19:10 UTC 2022 x86_64
# exit
```
Commands after the `#` prompt (*root* prompt) are executed by the `/bin/sh` shell
inside the container.
`# exit` ends the shell process and returns to the surrounding shell. The container
will go into a dormant (inactive) state.
```sh
> docker ps -a
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
aab69035680f alpine:latest "/bin/sh" 9 min ago Exited boring_ramanujan
```
The container can be restarted with any number of `/bin/sh` shell processes.
Containers are executed by **process groups** - so-called
[cgroups](https://en.wikipedia.org/wiki/Cgroups) used by
[LXC](https://wiki.gentoo.org/wiki/LXC) -
that share the same environment (filesystem view, ports, etc.), but are isolated
from process groups of other containers.
Start a shell process in the dormant Alpine-container to re-activate.
The start command will execute the default command that is built into the container
(see the COMMAND column: `"/bin/sh"`). The option `-ai` attaches *stdout* and *stdin*
of the terminal to the container.
Write *"Hello, container"* into a file: `/tmp/hello.txt`. Don't leave the shell.
```sh
> docker start aab69035680f -ai
# echo "Hello, container!" > /tmp/hello.txt
# cat /tmp/hello.txt
Hello, container!
#
```
Start another shell in another terminal for the container. Since it refers to the same
container, both shell processes share the same filesystem.
The second shell can therefore see the file created by the first and append another
line, which again will be seen by the first shell.
```sh
> docker start aab69035680f -ai
# cat /tmp/hello.txt
Hello, container!
# echo "How are you?" >> /tmp/hello.txt
```
First terminal:
```sh
# cat /tmp/hello.txt
Hello, container!
How are you?
#
```
In order to perform other commands than the default command in a running container,
use `docker exec`.
Execute command: `cat /tmp/hello.txt` in a third terminal:
```sh
docker exec aab69035680f cat /tmp/hello.txt
Hello, container!
How are you?
```
The execuition creates a new process that runs in the container seeing its filesystem
and other resources.
Explain the next command:
- What is the result?
- How many processes are involved?
- Draw a skech with the container, processes and their stdin/-out connections.
```sh
echo "echo That\'s great to hear! >> /tmp/hello.txt" | \
docker exec -i aab69035680f /bin/sh
```
When all processes have exited, the container will return to the dormant state.
It will preserve the created file.
(2 Pts)
&nbsp;
### 4.) Challenge: Configure Alpine Container for *ssh*
Create a new Alpine container with name `alpine-ssh` and configure it for
[ssh](https://en.wikipedia.org/wiki/Secure_Shell) access.
```sh
docker run --name alpine-ssh -p 22:22 -it alpine:latest
```
Instructions for installation and confiduration can be found here:
["How to install OpenSSH server on Alpine Linux"](https://www.cyberciti.biz/faq/how-to-install-openssh-server-on-alpine-linux-including-docker) or here:
["Setting up a SSH server"](https://wiki.alpinelinux.org/wiki/Setting_up_a_SSH_server).
Add a local user *larry* with *sudo*-rights, install *sshd* listening on the
default port 22.
Write down commands that you used for setup and configuration to enable the
container to run *sshd*.
Verify that *sshd* is running in the container:
```sh
# ps -a
PID USER TIME COMMAND
1 root 0:00 /bin/sh
254 root 0:00 sshd: /usr/sbin/sshd [listener] 0 of 10-100 startups
261 root 0:00 ps -a
```
Show that *ssh* is working by login in as *larry* from another terminal:
```sh
> ssh larry@localhost
Welcome to Alpine!
The Alpine Wiki contains a large amount of how-to guides and general
information about administrating Alpine systems.
See <http://wiki.alpinelinux.org/>.
You can setup the system with the command: setup-alpine
You may change this message by editing /etc/motd.
54486c62d745:~$ whoami
larry
54486c62d745:~$ ls -la
total 32
drwxr-sr-x 1 larry larry 4096 Oct 2 21:34 .
drwxr-xr-x 1 root root 4096 Oct 2 20:40 ..
-rw------- 1 larry larry 602 Oct 5 18:53 .ash_history
54486c62d745:~$ uname -a
Linux 54486c62d745 5.10.124-linuxkit #1 SMP Thu Jun 30 08:19:10 UTC 2022 x86_64 Linux
54486c62d745:~$
```
(4 Pts)
&nbsp;
### 5.) Challenge: Build Alpine-Python Container with *ssh*-Access
[`python:latest`](https://hub.docker.com/_/python/tags) official image is 340MB while [`python:3.9.0-alpine`](https://hub.docker.com/_/python/tags?name=3.9-alpine&page=1) is ~18MB. The alpine-version builds on minimal Alpine Linux while the official version builds on Ubuntu. "Minimal" means available commands, tools inside the container is restricted. Only basic tools are available. Required additional tools need to be installed into the container.
Build an new ```alpine-python-sshd``` container based on the ```python:3.9.0-alpine``` image that includes Python 3.9.0 and ssh-access so that your IDE can remotely connect to the container and run/debug Python code inside the container, which is the final challenge.
Copy file [print_sys.py](https://github.com/sgra64/cs4bigdata/blob/main/A_setup_python/print_sys.py) from Assignment A into larry's ```$HOME``` directory and execute.
```sh
> ssh larry@localhost
Welcome to Alpine!
54486c62d745:~$ python print_sys.py
Python impl: CPython
Python version: #1 SMP Thu Jun 30 08:19:10 UTC 2022
Python machine: x86_64
Python system: Linux
Python version: 3.9.0
54486c62d745:~$
```
(4 Pts)
&nbsp;
### 6.) Challenge: Setup IDE to develop Code in Alpine-Python Container
Setup your IDE to run/debug Python code inside the `alpine-python-sshd` container. In Visual Studio Code (with extensions for "Remote Development", "Docker" and "Dev Containers"), go to the Docker side-Tab, Right-click the running container and "Attach Visual Studio Code". This opens a new VSCode Window with a view from inside the container with file [print_sys.py](https://github.com/sgra64/cs4bigdata/blob/main/A_setup_python/print_sys.py) from the previous challenge.
Run this file in the IDE connected to the container. Output will show it running under Linux, Python 3.9.0 in `/home/larry`.
<!-- ![Remote Code](Setup_img01.png) -->
<img src="../markup/img/G_docker_img01.png" alt="drawing" width="640"/>
(2 Pts)
&nbsp;
### 7.) Challenge: Run *Jupyter* in Docker Container
Setup a Jupyter-server from the [Jupyter Docker Stack](https://jupyter-docker-stacks.readthedocs.io/en/latest/index.html). Jupyter Docker Stacks are a set of ready-to-run Docker images containing Jupyter applications and interactive computing tools.
[Selecting an Image](https://jupyter-docker-stacks.readthedocs.io/en/latest/using/selecting.html) decides about the features preinstalled for Jupyter. Configurations exit for *all-spark-notebook* building on *pyspark-notebook* building on *scipy-notebook*, which builds on a *minimal-* and *base-notebook*, which builds on an *Ubuntu LTS* distribution. Other variations exist for *tensorflow-*, *datascience-*, or *R-notebooks*.
![Remote Code](https://jupyter-docker-stacks.readthedocs.io/en/latest/_images/inherit.svg)
Pull the image for the *minimal-notebook* (415 MB, [tags](https://hub.docker.com/r/jupyter/minimal-notebook/tags/) ) and start it.
```sh
docker pull jupyter/minimal-notebook:latest
docker image ls
REPOSITORY TAG IMAGE ID CREATED SIZE
jupyter/minimal-notebook latest 33f2fa3eb079 18h ago 1.39GB
```
Create the container with Jupyters default port 8888 exposed to the host environment.
Watch for the URL with the token in the output log.
```sh
docker run --name jupyter-minimal -p 8888:8888 jupyter/minimal-notebook
Entered start.sh with args: jupyter lab
Executing the command: jupyter lab
[I 2022-10-10 21:53:22.855 ServerApp] jupyterlab | extension was successfully linked.
...
To access the server, open this file in a browser:
http://127.0.0.1:8888/lab?token=6037ff448a79463b97e3c29af712b9395dd8548b
71d77769
```
After the first access with the token included in the URL, the browser
opens with http://localhost:8888/lab.
<!-- ![Remote Code](Setup_img02.png) -->
<img src="../markup/img/G_docker_img02.png" alt="drawing" width="640"/>
Start to work with Jupyter. A Jupyter notebook is a web-form comprised
of cells where Python commands can be entered. Execution is triggered
by `SHIFT + Enter`. Run the Code below (copy & paste from file *print_sys.py* ).
<!-- ![Remote Code](Setup_img03.png) -->
<img src="../markup/img/G_docker_img03.png" alt="drawing" width="640"/>
The notebook is stored **inside** the container under *Untitled.ipynb*.
Shut down the Jupyter-server with: File -> Shutdown.
Reatart the (same) container as "daemon" process running in the background (the container remembers flags given at creation: `-p 8888:8888`). The flag `-a` attaches a terminal to show log lines with the token-URL.
```sh
docker start jupyter-minimal -a
```
After login, the prior notebook with the code above is still there under *Untitled.ipynb*. Open and re-run the notebook.
A container preserves its state, e.g. files that get created. Docker
simply adds them in another image layer. Therefore, the size of a
container is only defined by state changes that occured after the
creation of the container instance.
Shut down the Jupyter server with: File -> Shutdown, not with:
```sh
docker stop jupyter-minimal
```
(2 Pts)
#######################################################################
# Dockerfile to build image for Python-Alpine container that also has
# ssh access.
#
# Use python:alpine image and install needed packages for ssh:
# - openrc, system services control system
# - openssh, client- and server side ssh
# - sudo, utility to enable root rights to users
#
FROM python:3.9.0-alpine
RUN apk update
RUN apk add --no-cache openrc
RUN apk add --update --no-cache openssh
RUN apk add --no-cache sudo
# adjust sshd configuration
RUN echo 'PasswordAuthentication yes' >> /etc/ssh/sshd_config
RUN echo 'PermitEmptyPasswords yes' >> /etc/ssh/sshd_config
RUN echo 'IgnoreUserKnownHosts yes' >> /etc/ssh/sshd_config
# add user larry with empty password
RUN adduser -h /home/larry -s /bin/sh -D larry
RUN echo -n 'larry:' | chpasswd
# add larry to sudo'ers list
RUN mkdir -p /etc/sudoers.d
RUN echo '%wheel ALL=(ALL) NOPASSWD: ALL' >> /etc/sudoers.d/wheel
RUN adduser larry wheel
# generate host key
RUN ssh-keygen -A
# add sshd as service, start on boot [default], touch file to prevent error:
# "You are attempting to run an openrc service on a system which openrc did not boot."
RUN rc-update add sshd default
RUN mkdir -p /run/openrc
RUN touch /run/openrc/softlevel
# sshd is started in /entrypoint.sh
#
#######################################################################
ENTRYPOINT ["/entrypoint.sh"]
EXPOSE 22
COPY entrypoint.sh /
#!/bin/sh
# ssh-keygen -A
exec /usr/sbin/sshd -D -e "$@"
#######################################################################
# Dockerfile to build image for Alpine container that has sshd daemon.
#
# Use bare Alpine image and install all needed packages:
# - openrc, system services control system
# - openssh, client- and server side ssh
# - sudo, utility to enable root rights to users
#
FROM alpine:latest
RUN apk update
RUN apk add --no-cache openrc
RUN apk add --update --no-cache openssh
RUN apk add --no-cache sudo
# adjust sshd configuration
RUN echo 'PasswordAuthentication yes' >> /etc/ssh/sshd_config
RUN echo 'PermitEmptyPasswords yes' >> /etc/ssh/sshd_config
RUN echo 'IgnoreUserKnownHosts yes' >> /etc/ssh/sshd_config
# add user larry with empty password
RUN adduser -h /home/larry -s /bin/sh -D larry
RUN echo -n 'larry:' | chpasswd
# add larry to sudo'ers list
RUN mkdir -p /etc/sudoers.d
RUN echo '%wheel ALL=(ALL) NOPASSWD: ALL' >> /etc/sudoers.d/wheel
RUN adduser larry wheel
# generate host key
RUN ssh-keygen -A
# add sshd as service, start on boot [default], touch file to prevent error:
# "You are attempting to run an openrc service on a system which openrc did not boot."
RUN rc-update add sshd default
RUN mkdir -p /run/openrc
RUN touch /run/openrc/softlevel
# sshd is started in /entrypoint.sh
#
#######################################################################
ENTRYPOINT ["/entrypoint.sh"]
EXPOSE 22
COPY entrypoint.sh /
### Docker-compose
[Docker-compose](https://docs.docker.com/compose/features-uses) is a tool set
for Docker to automate building, configuring and running containers from a single file:
[docker-compose.yaml](https://docs.docker.com/compose/compose-file) specification.
Containers are referred to as *services* in the Docker-compose specification.
When a specified image does not exist and the build-tag of a service refers to a
directory where the container can be built from a
[Dockerfile](https://docs.docker.com/engine/reference/builder) (must be in same
directory of `docker-compose.yaml`):
```
docker-compose up -d
```
will automatically perform these steps (`-d` starts container in background):
1. build the image from `Dockerfile`,
1. register the image locally,
1. create a new container from the image,
1. register the container locally and
1. start it.
Multilpe containers can be specified in a single `docker-compose.yaml` file and
started in a defined order expressed by dependencies (`depends_on`-tag, e.g. to
express that a database service must be started before an application service
that is depending on it).
To stop all services specified in a `docker-compose.yaml` file:
```
docker-compose stop
```
To (re-)start services and show their running states:
```
docker-compose start
docker-compose ps
```
The `alpine-sshd` container can therefore always fully be re-produced from the
specifications in this directory.
Images and containers should always be reproduceable. They can be deleted any
time and recovered from specifications.
Container specifications are therefore common in code repositories controlling
automated *build-* and *deployment*-processes.
The principle implies that state ("data" such as databases) should not be
stored in containers and rather reside on outside volumes that are
[mounted](https://docs.docker.com/storage/volumes)
into the container.
Build and start `alpine-sshd` container from scratch:
```
docker-compose up
[+] Running 0/1
- alpine-sshd Error 2.5s
[+] Building 0.3s (23/23) FINISHED
=> [internal] load build definition from Dockerfile 0.0s
=> => transferring dockerfile: 32B 0.0s
=> [internal] load .dockerignore 0.0s
=> => transferring context: 2B 0.0s
=> [internal] load metadata for docker.io/library/alpine:latest 0.0s
=> [ 1/18] FROM docker.io/library/alpine:latest 0.0s
=> [internal] load build context 0.0s
=> => transferring context: 34B 0.0s
=> CACHED [ 2/18] RUN apk update 0.0s
=> CACHED [ 3/18] RUN apk add --no-cache openrc 0.0s
=> CACHED [ 4/18] RUN apk add --update --no-cache openssh 0.0s
=> CACHED [ 5/18] RUN apk add --no-cache sudo 0.0s
=> CACHED [ 6/18] RUN echo 'PasswordAuthentication yes' >> /etc/ssh/sshd 0.0s
=> CACHED [ 7/18] RUN echo 'PermitEmptyPasswords yes' >> /etc/ssh/sshd_c 0.0s
=> CACHED [ 8/18] RUN echo 'IgnoreUserKnownHosts yes' >> /etc/ssh/sshd_c 0.0s
=> CACHED [ 9/18] RUN adduser -h /home/larry -s /bin/sh -D larry 0.0s
=> CACHED [10/18] RUN echo -n 'larry:' | chpasswd 0.0s
=> CACHED [11/18] RUN mkdir -p /etc/sudoers.d 0.0s
=> CACHED [12/18] RUN echo '%wheel ALL=(ALL) NOPASSWD: ALL' >> /etc/sudo 0.0s
=> CACHED [13/18] RUN adduser larry wheel 0.0s
=> CACHED [14/18] RUN ssh-keygen -A 0.0s
=> CACHED [15/18] RUN rc-update add sshd default 0.0s
=> CACHED [16/18] RUN mkdir -p /run/openrc 0.0s
=> CACHED [17/18] RUN touch /run/openrc/softlevel 0.0s
=> CACHED [18/18] COPY entrypoint.sh / 0.0s
=> exporting to image 0.1s
=> => exporting layers 0.0s
=> => writing image sha256:5664d856423d679de32c4b58fc1bb55d5973acb62507d 0.0s
=> => naming to docker.io/library/alpine-sshd 0.0s
Use 'docker scan' to run Snyk tests against images to find vulnerabilities and l
earn how to fix them
[+] Running 2/2
- Network alpine-sshd_default C... 0.1s
- Container alpine-sshd-alpine-sshd-1 Created 0.2s
Attaching to alpine-sshd-alpine-sshd-1
alpine-sshd-alpine-sshd-1 | Server listening on 0.0.0.0 port 22.
alpine-sshd-alpine-sshd-1 | Server listening on :: port 22.
```
Show running container:
```
docker-compose ps
NAME COMMAND SERVICE STATUS PORTS
alpine-sshd-alpine-sshd-1 "/entrypoint.sh" alpine-sshd running 0.0.0.0:22->22/tcp
```
Log in as user *larry* that was configured when the container was built from
the `Dockerfile`:
```
ssh larry@localhost
```
Output:
```
ssh larry@localhost
The authenticity of host 'localhost (::1)' can't be established.
ED25519 key fingerprint is SHA256:5ZZ4bnRJh3DxlDaWJooC1qYjKj00U+pHCuNGEWZPVqA.
This key is not known by any other names
Are you sure you want to continue connecting (yes/no/[fingerprint])? yes
Could not create directory '/cygdrive/c/Sven1/svgr/.ssh' (No such file or directory).
Failed to add the host to the list of known hosts (/cygdrive/c/Sven1/svgr/.ssh/known_hosts).
Welcome to Alpine!
The Alpine Wiki contains a large amount of how-to guides and general
information about administrating Alpine systems.
See <http://wiki.alpinelinux.org/>.
You can setup the system with the command: setup-alpine
You may change this message by editing /etc/motd.
85dbbb7c316a:~$ ls -la
total 16
drwxr-sr-x 1 larry larry 4096 Nov 1 12:48 .
drwxr-xr-x 1 root root 4096 Oct 7 18:31 ..
-rw------- 1 larry larry 7 Nov 1 12:48 .ash_history
85dbbb7c316a:~$ whoami
larry
85dbbb7c316a:~$ pwd
/home/larry
85dbbb7c316a:~$
```
Stop container:
```
docker-compose stop
- Container alpine-sshd-alpine-sshd-1 Stopped 0.3s
docker-compose ps
NAME COMMAND SERVICE STATUS
PORTS
alpine-sshd-alpine-sshd-1 "/entrypoint.sh" alpine-sshd exited (0)
```
Restart same container:
```
docker-compose start
[+] Running 1/1
- Container alpine-sshd-alpine-sshd-1 Started 0.4s
docker-compose ps
NAME COMMAND SERVICE STATUS PORTS
alpine-sshd-alpine-sshd-1 "/entrypoint.sh" alpine-sshd running 0.0.0.0:22->22/tcp
```
#######################################################################
# Create, start/stop container with docker-compose.
#
# Build image and container (once):
# - docker-compose up -d
# creates/builds image from Dockerfile, creates and runs container:
# - new local image: alpine-sshd from Dockerfile
# - new container created from image and started.
# --> container is running
#
# From the image, the (same) container can be restarted and stopped:
# - docker-compose start
# - docker-compose stop
#
#######################################################################
services:
alpine-sshd:
build: .
image: alpine-sshd
ports:
- "22:22" # host-env:container
# command: ["exec", "/usr/sbin/sshd"]
#!/bin/sh
# ssh-keygen -A
exec /usr/sbin/sshd -D -e "$@"
# /etc/init.d/sshd start
# service sshd restart
#######################################################################
# Steps to set up a bare Alpine container for ssh access.
#
# create and run bare Alpine container with name "alpine-ssh"
docker run --name alpine-ssh -p 22:22 -it alpine:latest
#######################################################################
# update package list and install all needed packages
# - openrc, system services control system
# - openssh, client- and server side ssh
# - sudo, utility to enable root rights to users
#
apk update
apk add --no-cache openrc
apk add --update --no-cache openssh
apk add --no-cache sudo
# adjust sshd configuration
echo 'PasswordAuthentication yes' >> /etc/ssh/sshd_config
echo 'PermitEmptyPasswords yes' >> /etc/ssh/sshd_config
echo 'IgnoreUserKnownHosts yes' >> /etc/ssh/sshd_config
# add user larry with empty password
adduser -h /home/larry -s /bin/sh -D larry
echo -n 'larry:' | chpasswd
# add larry to sudo'ers list
mkdir -p /etc/sudoers.d
echo '%wheel ALL=(ALL) NOPASSWD: ALL' >> /etc/sudoers.d/wheel
adduser larry wheel
# generate host key in /etc/ssh, e.g. /etc/ssh/ssh_host_rsa_key.pub
ssh-keygen -A
# add sshd as service, start on boot [default], touch file to prevent error:
# "You are attempting to run an openrc service on a system which openrc did not boot."
# rc-update add sshd default
rc-update add sshd
mkdir -p /run/openrc
touch /run/openrc/softlevel
# start sshd - ssh larry@localhost now working
# /etc/init.d/sshd start
# service sshd start
# ---- exec prevents shell as parent process
# exec /usr/sbin/sshd -D -e &
exec /usr/sbin/sshd &
#
#######################################################################
#
# stop alpine-ssh container
docker stop alpine-ssh
# start alpine-ssh container, create root sh (exec requires a started container)
docker start alpine-ssh
docker exec -it alpine-ssh /bin/sh
# start sshd in container
/etc/init.d/sshd restart
service sshd restart
/etc/init.d/sshd status
service sshd status
#######################################################################
# build docker container "alpine-sshd" from Dockerfile, entrypoint.sh
# image file "alpine-sshd" is 18.5 MB
docker build -t alpine-sshd .
docker image ls
REPOSITORY TAG IMAGE ID CREATED SIZE
alpine-sshd latest 0d286d424c80 1 min ago 18.5MB
docker run --name alpine-sshd -p 22:22 -it -d alpine-sshd:latest
docker start alpine-sshd
docker exec -it alpine-sshd /bin/sh
docker stop alpine-sshd
#######################################################################
# References:
# How to enable and start services on Alpine Linux
# https://www.cyberciti.biz/faq/how-to-enable-and-start-services-on-alpine-linux
# How to install OpenSSH server on Alpine Linux
# https://www.cyberciti.biz/faq/how-to-install-openssh-server-on-alpine-linux-including-docker
# https://wiki.alpinelinux.org/wiki/Setting_up_a_SSH_server
# How To Set Up a Firewall with Awall on Alpine Linux
# https://www.cyberciti.biz/faq/how-to-set-up-a-firewall-with-awall-on-alpine-linux/
# Add, Delete And Grant Sudo Privileges To Users In Alpine Linux
# https://ostechnix.com/add-delete-and-grant-sudo-privileges-to-users-in-alpine-linux/
#
#######################################################################
# Jupyter with Docker Compose
This repository contains a simple docker-compose definition for launching the popular Jupyter Data Science Notebook.
You can define a password with the script ```generate_token.py -p S-E-C-R-E-T``` and generate SSL certificates as described below.
## Control the container:
* ```docker-compose up``` mounts the directory and starts the container
* ```docker-compose down``` destroys the container
## The compose file: docker-compose.yml
```bash
version: '3'
services:
datascience-notebook:
image: jupyter/datascience-notebook
volumes:
- ${LOCAL_WORKING_DIR}:/home/jovyan/work
- ${LOCAL_DATASETS}:/home/jovyan/work/datasets
- ${LOCAL_MODULES}:/home/jovyan/work/modules
- ${LOCAL_SSL_CERTS}:/etc/ssl/notebook
ports:
- ${PORT}:8888
container_name: jupyter_notebook
command: "start-notebook.sh \
--NotebookApp.password=${ACCESS_TOKEN} \
--NotebookApp.certfile=/etc/ssl/notebook/jupyter.pem"
```
## Example with a custom user
```YAML
version: '2'
services:
datascience-notebook:
image: jupyter/base-notebook:latest
volumes:
- /tmp/jupyter_test_dir:/home/docker_worker/work
ports:
- 8891:8888
command: "start-notebook.sh"
user: root
environment:
NB_USER: docker_worker
NB_UID: 1008
NB_GID: 1011
CHOWN_HOME: 'yes'
CHOWN_HOME_OPTS: -R
```
## The environment file .env
```bash
# Define a local data directory
# Set permissions for the container:
# sudo chown -R 1000 ${LOCAL_WORKING_DIR}
LOCAL_WORKING_DIR=/data/jupyter/notebooks
# Generate an access token like this
# import IPython as IPython
# hash = IPython.lib.passwd("S-E-C-R-E-T")
# print(hash)
# You can use the script generate_token.py
ACCESS_TOKEN=sha1:d4c78fe19cb5:0c8f830971d52da9d74b9985a8b87a2b80fc6e6a
# Host port
PORT=8888
# Provide data sets
LOCAL_DATASETS=/data/jupyter/datasets
# Provide local modules
LOCAL_MODULES=/home/git/python_modules
# SSL
# Generate cert like this:
# openssl req -x509 -nodes -newkey rsa:2048 -keyout jupyter.pem -out jupyter.pem
# Copy the jupyter.pem file into the location below.
LOCAL_SSL_CERTS=/opt/ssl-certs/jupyter
```
# Version Conflicts
Make sure to have the latest versions installed. You can use the Notebook Browser interface.
```python
pip install -U jupyter
```
# pip install -r ~/work/requirements.txt
# export PYTHONPATH=PYTHONPATH:/home/jovyan/work
\ No newline at end of file
version: '3'
#
# adopted from:
# https://github.com/dsmits/jupyter-docker-compose/blob/main/docker-compose.yml
# using jupyter/minimal-notebook image 1.47 GB
#
# must enable: Docker Desktop -> Settings -> Resources -> File Sharing
# path: C:/Sven1/svgr/workspaces/cs4bigdata/B_setup_docker/jupyter
#
# last started with:
# http://localhost:8888/lab?token=cbc1aa82b9144579334507e07607d4c83d8dfa9e11473df8
#
services:
jupyter:
image: jupyter/minimal-notebook
volumes:
- .:/home/jovyan/work
- ./configure_environment.sh:/usr/local/bin/before-notebook.d/configure_environment.sh
ports:
- 8888:8888
#
# alternative:
# https://github.com/stefanproell/jupyter-notebook-docker-compose
# using jupyter/datascience-notebook image 4.56 GB
#
# services:
# datascience-notebook:
# # image: jupyter/datascience-notebook
# image: jupyter/minimal-notebook
# volumes:
# - ${LOCAL_WORKING_DIR}:/home/jovyan/work
# - ${LOCAL_DATASETS}:/home/jovyan/work/datasets
# - ${LOCAL_MODULES}:/home/jovyan/work/modules
# - ${LOCAL_SSL_CERTS}:/etc/ssl/notebook
# ports:
# - ${PORT}:8888
# container_name: jupyter_notebook
# command: "start-notebook.sh \
# --NotebookApp.password=${ACCESS_TOKEN} \
# --NotebookApp.certfile=/etc/ssl/notebook/jupyter.pem"
#!/usr/bin/env sh
# import IPython as IPython
# import bcrypt
import hashlib
# python generate_token.py --password=password
#
# Generate a access token
# Copy this line into the .env file:
# ACCESS_TOKEN=5baa61e4c9b93f3f0682250b6cf8331b7ee68fd8
if __name__ == "__main__":
print("Generate a access token")
from argparse import ArgumentParser
parser = ArgumentParser()
parser.add_argument("-p",
"--password",
dest="password",
help="The password you want to use for authentication.",
required=True)
args = parser.parse_args()
print("\nCopy this line into the .env file:\n")
# hash = IPython.lib.passwd(args.password)
dataBase_password = args.password
hash = hashlib.sha1(dataBase_password.encode()).hexdigest()
print("ACCESS_TOKEN=" + hash)
-----BEGIN PRIVATE KEY-----
MIIEvgIBADANBgkqhkiG9w0BAQEFAASCBKgwggSkAgEAAoIBAQCgq1Hakw3KI+uv
d3LSJWGdl+uUX0FkVInUuFzQGDsyToPOo9qIIsYP5Avo3Tn7AXoQKoYz/zR0MOms
WBqkztah9ahS8uUkDkwHcJ6BIFZ9UApWQ+klEnHkBgSzDjfZhDZT/sIlEdCLOYsi
agpUdZtnDKmkpWTAAn4Tc6vLMj5Xi7mqLDybbsVlyZ/Iy/nvG1pZknmE/R7HOs8g
IkuZ7KKvbaaN6nOfujeCOCCmTyCefjU0famZxzVJxAbUwo40BMT6w7VVmUOwoNS0
LUtcVUDZ+alINrO59fu7oKyQgVr2yeVyoqUJPD/FgS/fFXSAoEtf5e1ekMXgcvl4
EGacVtwfAgMBAAECggEAGmYLvO4MhfoA74OgygZ6U3pyqp48EFATlW/1T/urPkjI
P1uMvHF6OYIussQmkqdbduyFwGVeKPkga8DOH+YcPeAvF/Hw1EvFEjPe1ziI/W35
RNNDq2OsctrKSuE7K/IdOw/QtmaG7Vk3EyB5Mgdg0T2zYeoK88F1FZ0bzPckZx27
X9Df5AXJdAEx2ikK5+vG6hDpKgDGHa5e3+96PRTqlv0AJqw9FDGLpheCNZ6F/TnR
KQWhyG3F8njQ/B97L3yiY7K7fZ/XQbgnvxFtKdSwdCkptKjQCq53alwh09Qb9YWo
l5J+TZa3xpgZX/0QEwzpSsh7kcRFZ2Zu+8fVpw14AQKBgQDTZ7D69DCblv367+c7
7PmWQkQjEGUlRxi0GRoBAgQ/FcfNnKhJVKSYThjtPcAt7E8YM/A8VlVtOiTrZnS7
Z+7CT+jRIK03jmegDgAegERO/oiH3jIRwLLFYFlOzVj3/XlwEbtRjQR2IdZkaeFC
mx9wqu5y5Vo7ml98H2Vsg8O0XwKBgQDCj8oA3wZOePSWH2Lo+BU9rP3/2SjmIzZZ
EhpO3ARa8tuxS37wFXuRtWDPb2NO2b+yzo9zxrA0Fbyx8kl3Jg03iH+eNQh12A/m
HikUoK6PgdKPZcYjONKRcH7+SUsBvxOkO1y5zqRzWwQvsAK+QaDky14IHSnRIOWx
BrkEZJPwQQKBgFPqDuguUbUQ5FPdMm4pDJFGUIGSmnOHmxix9g58XG8mGB9Xlb01
6ffC2EYjgss3x9WVmEB7DIHE2K7QBnn1MWLUEVghnmA1GJEBva5dv7+TbWJxInLF
iLCsJAcRn8UgSjnf7/jY/vJdUBqfpJiptnskfm4A+CY8irZcSAgg7WgFAoGBAKeX
PCWr9r65qdV2i7ipmYJa9R/haz1xr2riEQ9Ereu5rkv2AA3GM367gfysshpFrr7S
9vZ/e2AiKTwOvAGKIXBof6VDgVohFvDdof1Gu5aZ+UnUHOxSEe99u6ZGc/m5Ia4i
BCl5OmazS9PYBUTlOzZZh1Ht7QtbDv+CDvUdveEBAoGBALFBGDBB+syXZfnFxD6r
TZ8E3ZH9uGKWFnSLjMY1jmz0g1knFya6yn7O9N7/KoFZwMf4qUvpKBzSO8l6ZJ/w
oh7YGKrZQrcIneHm2t12BAH+7LoaNLGNXEaA5q4D2+jGC00Ma0GL/qjOHMGXT9oO
mLrQZttUIVEfOe7Qi+JnINK2
-----END PRIVATE KEY-----
-----BEGIN CERTIFICATE-----
MIIEGTCCAwGgAwIBAgIUWiant8BjxGVzc2j/FvGP05SFmdYwDQYJKoZIhvcNAQEL
BQAwgZsxCzAJBgNVBAYTAkRFMRQwEgYDVQQIDAtCcmFuZGVuYnVyZzEVMBMGA1UE
BwwMS2xlaW5tYWNobm93MQwwCgYDVQQKDANCSFQxEzARBgNVBAsMCkluZm9ybWF0
aWsxEDAOBgNVBAMMB2p1cHl0ZXIxKjAoBgkqhkiG9w0BCQEWG3N2ZW4uZ3JhdXBu
ZXJAYmh0LWJlcmxpbi5kZTAeFw0yMjEyMTgyMDQ4MDRaFw0yMzAxMTcyMDQ4MDRa
MIGbMQswCQYDVQQGEwJERTEUMBIGA1UECAwLQnJhbmRlbmJ1cmcxFTATBgNVBAcM
DEtsZWlubWFjaG5vdzEMMAoGA1UECgwDQkhUMRMwEQYDVQQLDApJbmZvcm1hdGlr
MRAwDgYDVQQDDAdqdXB5dGVyMSowKAYJKoZIhvcNAQkBFhtzdmVuLmdyYXVwbmVy
QGJodC1iZXJsaW4uZGUwggEiMA0GCSqGSIb3DQEBAQUAA4IBDwAwggEKAoIBAQCg
q1Hakw3KI+uvd3LSJWGdl+uUX0FkVInUuFzQGDsyToPOo9qIIsYP5Avo3Tn7AXoQ
KoYz/zR0MOmsWBqkztah9ahS8uUkDkwHcJ6BIFZ9UApWQ+klEnHkBgSzDjfZhDZT
/sIlEdCLOYsiagpUdZtnDKmkpWTAAn4Tc6vLMj5Xi7mqLDybbsVlyZ/Iy/nvG1pZ
knmE/R7HOs8gIkuZ7KKvbaaN6nOfujeCOCCmTyCefjU0famZxzVJxAbUwo40BMT6
w7VVmUOwoNS0LUtcVUDZ+alINrO59fu7oKyQgVr2yeVyoqUJPD/FgS/fFXSAoEtf
5e1ekMXgcvl4EGacVtwfAgMBAAGjUzBRMB0GA1UdDgQWBBS4ojNVtQI3kPq8uVKh
lMsZPSpQUzAfBgNVHSMEGDAWgBS4ojNVtQI3kPq8uVKhlMsZPSpQUzAPBgNVHRMB
Af8EBTADAQH/MA0GCSqGSIb3DQEBCwUAA4IBAQBoEUE2m4EeC7SIrv3SESbiAQc4
qtXpJkS8jdQq3cuBx0zvDf41nN84JmxCSGLYsYDnQKjY5KgW9Azkf0jANEvxM6re
Ay4nyZAU9SmQWj4tQAVD5hnmtN3mE7bWwjaRr7B/D407NpmhhfamX2k+hFKc3Kr6
ilUNNgQoOqSoo3hEhavrpPIRhVDWza/glOd2coPLycbt4D7yYT7QrMcadCotiLnr
yBmPpKgaNQaNAHKVSkVN2aKKKtzLZkVutaQRL5W4OgWe5+J5dy4WeiNFB4jxqR46
C33htIGVmZW5Isa5kIQn0SOSaK3Nb+C4WTlJU87tTdmZnL8MDTugEN8YY8jq
-----END CERTIFICATE-----
File added
# Assignment H1: Solve the Shakespeare Challenge &nbsp; (8 Pts)
[William Shakespeare](https://en.wikipedia.org/wiki/William_Shakespeare) (1564 ‐ 1616)
has written many plays that have become popular for text analysis.
- [Project Gutenberg](http://gutenberg.org) is one project that has compiled
Shakespeare’s plays into a single text file
[Shakespeare.txt](data/Shakespeare.txt).
- [Processing Shakespeare](https://lmackerman.com/AdventuresInR/docs/shakespeare.nb.html)
is a project that aimes at visualizing Shakespeare’s texts.
&nbsp;
---
Task: [H1_Analyzing_Shakespeare.pdf](H1_Analyzing_Shakespeare.pdf)
- use: [Shakespeare.txt](data/Shakespeare.txt).
# Apache Spark Standalone Cluster on Docker
> The project was featured on an **[article](https://www.mongodb.com/blog/post/getting-started-with-mongodb-pyspark-and-jupyter-notebook)** at **MongoDB** official tech blog! :scream:
> The project just got its own **[article](https://towardsdatascience.com/apache-spark-cluster-on-docker-ft-a-juyterlab-interface-418383c95445)** at **Towards Data Science** Medium blog! :sparkles:
## Introduction
This project gives you an **Apache Spark** cluster in standalone mode with a **JupyterLab** interface built on top of **Docker**.
Learn Apache Spark through its **Scala**, **Python** (PySpark) and **R** (SparkR) API by running the Jupyter [notebooks](build/workspace/) with examples on how to read, process and write data.
<p align="center"><img src="docs/image/cluster-architecture.png"></p>
![build-master](https://github.com/cluster-apps-on-docker/spark-standalone-cluster-on-docker/workflows/build-master/badge.svg)
![sponsor](https://img.shields.io/badge/patreon-sponsor-ff69b4)
![jupyterlab-latest-version](https://img.shields.io/docker/v/andreper/jupyterlab/3.0.0-spark-3.0.0?color=yellow&label=jupyterlab-latest)
![spark-latest-version](https://img.shields.io/docker/v/andreper/spark-master/3.0.0?color=yellow&label=spark-latest)
![spark-scala-api](https://img.shields.io/badge/spark%20api-scala-red)
![spark-pyspark-api](https://img.shields.io/badge/spark%20api-pyspark-red)
![spark-sparkr-api](https://img.shields.io/badge/spark%20api-sparkr-red)
## TL;DR
```bash
curl -LO https://raw.githubusercontent.com/cluster-apps-on-docker/spark-standalone-cluster-on-docker/master/docker-compose.yml
docker-compose up
```
## Contents
- [Quick Start](#quick-start)
- [Tech Stack](#tech-stack)
- [Metrics](#metrics)
- [Contributing](#contributing)
- [Contributors](#contributors)
- [Support](#support)
## <a name="quick-start"></a>Quick Start
### Cluster overview
| Application | URL | Description |
| --------------- | ---------------------------------------- | ---------------------------------------------------------- |
| JupyterLab | [localhost:8888](http://localhost:8888/) | Cluster interface with built-in Jupyter notebooks |
| Spark Driver | [localhost:4040](http://localhost:4040/) | Spark Driver web ui |
| Spark Master | [localhost:8080](http://localhost:8080/) | Spark Master node |
| Spark Worker I | [localhost:8081](http://localhost:8081/) | Spark Worker node with 1 core and 512m of memory (default) |
| Spark Worker II | [localhost:8082](http://localhost:8082/) | Spark Worker node with 1 core and 512m of memory (default) |
### Prerequisites
- Install [Docker](https://docs.docker.com/get-docker/) and [Docker Compose](https://docs.docker.com/compose/install/), check **infra** [supported versions](#tech-stack)
### Download from Docker Hub (easier)
1. Download the [docker compose](docker-compose.yml) file;
```bash
curl -LO https://raw.githubusercontent.com/cluster-apps-on-docker/spark-standalone-cluster-on-docker/master/docker-compose.yml
```
2. Edit the [docker compose](docker-compose.yml) file with your favorite tech stack version, check **apps** [supported versions](#tech-stack);
3. Start the cluster;
```bash
docker-compose up
```
4. Run Apache Spark code using the provided Jupyter [notebooks](build/workspace/) with Scala, PySpark and SparkR examples;
5. Stop the cluster by typing `ctrl+c` on the terminal;
6. Run step 3 to restart the cluster.
### Build from your local machine
> **Note**: Local build is currently only supported on Linux OS distributions.
1. Download the source code or clone the repository;
2. Move to the build directory;
```bash
cd build
```
3. Edit the [build.yml](build/build.yml) file with your favorite tech stack version;
4. Match those version on the [docker compose](build/docker-compose.yml) file;
5. Build up the images;
```bash
chmod +x build.sh ; ./build.sh
```
6. Start the cluster;
```bash
docker-compose up
```
7. Run Apache Spark code using the provided Jupyter [notebooks](build/workspace/) with Scala, PySpark and SparkR examples;
8. Stop the cluster by typing `ctrl+c` on the terminal;
9. Run step 6 to restart the cluster.
## <a name="tech-stack"></a>Tech Stack
- Infra
| Component | Version |
| -------------- | ------- |
| Docker Engine | 1.13.0+ |
| Docker Compose | 1.10.0+ |
- Languages and Kernels
| Spark | Hadoop | Scala | [Scala Kernel](https://almond.sh/) | Python | [Python Kernel](https://ipython.org/) | R | [R Kernel](https://irkernel.github.io/) |
| ----- | ------ | ------- | ---------------------------------- | ------ | ------------------------------------- | ----- | --------------------------------------- |
| 3.x | 3.2 | 2.12.10 | 0.10.9 | 3.7.3 | 7.19.0 | 3.5.2 | 1.1.1 |
| 2.x | 2.7 | 2.11.12 | 0.6.0 | 3.7.3 | 7.19.0 | 3.5.2 | 1.1.1 |
- Apps
| Component | Version | Docker Tag |
| -------------- | ----------------------- | ---------------------------------------------------- |
| Apache Spark | 2.4.0 \| 2.4.4 \| 3.0.0 | **\<spark-version>** |
| JupyterLab | 2.1.4 \| 3.0.0 | **\<jupyterlab-version>**-spark-**\<spark-version>** |
## <a name="metrics"></a>Metrics
| Image | Size | Downloads |
| -------------------------------------------------------------- | ---------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------- |
| [JupyterLab](https://hub.docker.com/r/andreper/jupyterlab) | ![docker-size-jupyterlab](https://img.shields.io/docker/image-size/andreper/jupyterlab/latest) | ![docker-pull](https://img.shields.io/docker/pulls/andreper/jupyterlab) |
| [Spark Master](https://hub.docker.com/r/andreper/spark-master) | ![docker-size-master](https://img.shields.io/docker/image-size/andreper/spark-master/latest) | ![docker-pull](https://img.shields.io/docker/pulls/andreper/spark-master) |
| [Spark Worker](https://hub.docker.com/r/andreper/spark-worker) | ![docker-size-worker](https://img.shields.io/docker/image-size/andreper/spark-worker/latest) | ![docker-pull](https://img.shields.io/docker/pulls/andreper/spark-worker) |
## <a name="contributing"></a>Contributing
We'd love some help. To contribute, please read [this file](CONTRIBUTING.md).
## <a name="contributors"></a>Contributors
A list of amazing people that somehow contributed to the project can be found in [this file](CONTRIBUTORS.md). This
project is maintained by:
> **André Perez** - [dekoperez](https://twitter.com/dekoperez) - andre.marcos.perez@gmail.com
## <a name="support"></a>Support
> Support us on GitHub by staring this project :star:
> Support us on [Patreon](https://www.patreon.com/andreperez). :sparkling_heart:
\ No newline at end of file
# Assignment H: PySpark &nbsp; (10 Pts)
Apache Spark is an open-source unified analytics engine for large-scale data processing. Spark provides an interface for programming clusters with:
- implicit data parallelism and
- fault tolerance.
&nbsp;
---
### Challenges
1. [Challenge 1:](#1-challenge-1-get-pyspark-containers) Get PySpark Containers - (3 Pts)
1. [Challenge 2:](#2-challenge-2-set-up-simple-pyspark-program) Set-up simple PySpark Program - (2 Pts)
1. [Challenge 3:](#3-challenge-3-run-on-pyspark-cluster) Run on PySpark Cluster - (3 Pts)
1. [Challenge 4:](#4-challenge-4-explain-the-program) Explain the Program - (2 Pts)
&nbsp;
---
### 1.) Challenge 1: Get PySpark Containers
Setup PySpark as Spark Standalone Cluster with Docker:
[https://github.com/cluster-apps-on-docker/spark-standalone-cluster-on-docker](https://github.com/cluster-apps-on-docker/spark-standalone-cluster-on-docker)
The setup looks like this:
![text](../markup/img/H_spark_cluster_architecture.png)
One simple command will:
- fetch all needed Docker images (~1.5GB).
- create containers for: Spark-Master, 2 Worker-Processes, Jupyter-Server.
- lauch all containers at once.
Clone the project and use as project directory:
```
git clone https://github.com/cluster-apps-on-docker/spark-standalone-cluster-on-docker
```
Fetch images, create and launch all containers with one command:
```
docker-compose up
```
It will launch the following containers:
![text](../markup/img/H_img01.png)
Open Urls:
<table>
<thead>
<tr>
<th>Application</th>
<th>URL</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>JupyterLab</td>
<td><a href="http://localhost:8888/" rel="nofollow">localhost:8888</a></td>
<td>Cluster interface with built-in Jupyter notebooks</td>
</tr>
<tr>
<td>Spark Driver</td>
<td><a href="http://localhost:4040/" rel="nofollow">localhost:4040</a></td>
<td>Spark Driver web ui</td>
</tr>
<tr>
<td>Spark Master</td>
<td><a href="http://localhost:8080/" rel="nofollow">localhost:8080</a></td>
<td>Spark Master node</td>
</tr>
<tr>
<td>Spark Worker I</td>
<td><a href="http://localhost:8081/" rel="nofollow">localhost:8081</a></td>
<td>Spark Worker node with 1 core and 512m of memory (default)</td>
</tr>
<tr>
<td>Spark Worker II</td>
<td><a href="http://localhost:8082/" rel="nofollow">localhost:8082</a></td>
<td>Spark Worker node with 1 core and 512m of memory (default)</td>
</tr>
</tbody>
</table>
&nbsp;
---
### 2.) Challenge 2: Set-up simple PySpark Program
Understand the simple PySpark program `pyspark_pi.py`:
```py
from __future__ import print_function
import sys
from random import random
from operator import add
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("PyPi").getOrCreate()
slices = 1
n = 100000 * slices
def f(_):
x = random() * 2 -1
y = random() * 2 - 1
return 1 if x ** 2 + y ** 2 <= 1 else 0
count = spark.sparkContext.parallelize(range(1,n+1), slices).map(f).reduce(add)
print("Pi is roughly %f" % (4.0 * count / n))
spark.stop()
```
What is the output of the program?
What happens when the value of variable ‘slices’ increases from 1 to 2 and 4?
&nbsp;
---
### 3.) Challenge 3: Run on PySpark Cluster
Open Jupyter, [http://localhost:8888](http://localhost:8888/) and paste the code
into the cell.
Execute the cell.
![text](../markup/img/H_img02.png)
&nbsp;
---
### 4.) Challenge 4: Explain the PySpark Environment
Briefly describe which essential parts a PySpark-environment consists of and
concepts:
- RDD, DF, DS
- Transformation
- Action
- Lineage
- Partition
%% Cell type:code id:c3bac98d tags:
``` python
from __future__ import print_function
import sys
from random import random
from operator import add
from pyspark.sql import SparkSession
import os
import pyspark.sql.functions as f
spark = SparkSession.builder.appName("PyPi").getOrCreate()
df_all = spark.read.option('lineSep', r'(THE\sEND)').text("./data/Shakespeare.txt")
```
%% Output
23/01/26 09:55:32 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
%% Cell type:code id:bb3abe55 tags:
``` python
df = df_all.withColumn('value', f.regexp_replace('value', r'<<[\w\s\d\n()\,.-]{495}>>', ''))\
.withColumn('value', f.explode(f.split('value', r'THE\sEND', -1))) \
.withColumn('index', f.monotonically_increasing_id())\
.filter("index > 1")\
.filter('index < 38') \
.withColumn("title", f.regexp_extract('value', r'(.*)\n*by', 0))\
.withColumn("value", f.regexp_replace('value', r'([A-Z ,]*)\n*by William Shakespeare', ''))\
.withColumn("title", f.regexp_replace('title', r'\n*by', ''))\
.withColumn("year", f.regexp_extract('value', r'\d{4}', 0))\
.withColumn("value", f.regexp_replace('value', r'\d{4}', ''))\
.withColumn('value', f.trim('value'))\
.withColumn('value', f.regexp_replace('value', r' {2,}', ' '))\
.withColumn('value', f.regexp_replace('value', r'\n{2,}', ''))\
.withColumn('wordCount', f.size(f.split('value', ' ')))\
.withColumn('lineCount', f.size(f.split('value', r'\n')))\
.orderBy(f.col("lineCount").desc())
```
%% Cell type:code id:3991296f tags:
``` python
def play_counts(df):
results = df.select('title','lineCount', 'wordCount').collect()
for r in results:
print(f"{r['title']}, {r['lineCount']} lines, {r['wordCount']} words.")
# df.filter("index == 37").collect()
play_counts(df)
```
%% Output
THE TRAGEDY OF HAMLET, PRINCE OF DENMARK, 3947 lines, 32079 words.
KING RICHARD III, 3914 lines, 31193 words.
THE TRAGEDY OF CORIOLANUS, 3691 lines, 29293 words.
CYMBELINE, 3649 lines, 28870 words.
THE TRAGEDY OF ANTONY AND CLEOPATRA, 3587 lines, 26552 words.
THE TRAGEDY OF OTHELLO, MOOR OF VENICE, 3479 lines, 27986 words.
THE TRAGEDY OF KING LEAR, 3433 lines, 27585 words.
THE HISTORY OF TROILUS AND CRESSIDA, 3431 lines, 27623 words.
KING HENRY THE EIGHTH, 3327 lines, 25886 words.
THE WINTER'S TALE, 3249 lines, 26059 words.
THE LIFE OF KING HENRY THE FIFTH, 3147 lines, 27498 words.
THE SECOND PART OF KING HENRY THE SIXTH, 3133 lines, 26840 words.
SECOND PART OF KING HENRY IV, 3101 lines, 27689 words.
THE TRAGEDY OF ROMEO AND JULIET, 3089 lines, 25857 words.
THE THIRD PART OF KING HENRY THE SIXTH, 3012 lines, 25873 words.
THE FIRST PART OF KING HENRY THE FOURTH, 2926 lines, 25783 words.
THE FIRST PART OF HENRY THE SIXTH, 2852 lines, 22883 words.
KING RICHARD THE SECOND, 2851 lines, 23363 words.
MEASURE FOR MEASURE, 2740 lines, 22947 words.
LOVE'S LABOUR'S LOST, 2735 lines, 22987 words.
KING JOHN, 2659 lines, 21776 words.
THE TRAGEDY OF JULIUS CAESAR, 2629 lines, 20930 words.
THE TRAGEDY OF TITUS ANDRONICUS, 2625 lines, 21701 words.
THE TAMING OF THE SHREW, 2616 lines, 22243 words.
THE MERCHANT OF VENICE, 2609 lines, 22309 words.
THE MERRY WIVES OF WINDSOR, 2579 lines, 23411 words.
AS YOU LIKE IT, 2543 lines, 22860 words.
THE LIFE OF TIMON OF ATHENS, 2437 lines, 19691 words.
MUCH ADO ABOUT NOTHING, 2425 lines, 22501 words.
THE TRAGEDY OF MACBETH, 2396 lines, 18246 words.
TWELFTH NIGHT; OR, WHAT YOU WILL, 2353 lines, 21208 words.
THE TEMPEST, 2328 lines, 17498 words.
THE TWO GENTLEMEN OF VERONA, 2196 lines, 18327 words.
A MIDSUMMER NIGHT'S DREAM, 2119 lines, 17306 words.
THE COMEDY OF ERRORS, 1815 lines, 15464 words.
A LOVER'S COMPLAINT, 283 lines, 2579 words.
%% Cell type:code id:676fab8d tags:
``` python
```
%% Cell type:code id:9d323a4b tags:
``` python
pip install pandas
```
%% Output
Requirement already satisfied: pandas in /usr/local/lib/python3.9/dist-packages (1.5.3)
Requirement already satisfied: python-dateutil>=2.8.1 in /usr/local/lib/python3.9/dist-packages (from pandas) (2.8.2)
Requirement already satisfied: numpy>=1.20.3 in /usr/local/lib/python3.9/dist-packages (from pandas) (1.24.1)
Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.9/dist-packages (from pandas) (2022.7.1)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.9/dist-packages (from python-dateutil>=2.8.1->pandas) (1.16.0)
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
Note: you may need to restart the kernel to use updated packages.
%% Cell type:code id:fc4a8512 tags:
``` python
from __future__ import print_function
import sys
from pyspark.sql import functions as F
from pyspark.sql.window import Window
from pyspark.sql import SparkSession
from pyspark.ml.feature import Bucketizer
import pandas as pd
def remove_header(df):
w = Window().orderBy(F.lit('value'))
df = df.withColumn("rowNum", F.row_number().over(w))
df = df.filter(df.rowNum > 244)
return df
def remove_filler(df):
return df.filter(~ df.value.startswith("<<THIS") & \
~ df.value.startswith("SHAKESPEARE IS") & \
~ df.value.startswith("PROVIDED BY") & \
~ df.value.startswith("WITH PERMISSION") & \
~ df.value.startswith("DISTRIBUTED") & \
~ df.value.startswith("PERSONAL USE") & \
~ df.value.startswith("COMMERCIALLY.") & \
~ df.value.startswith("SERVICE THAT"))
def get_play_rows(df):
return df.filter(df.value.rlike('[0-9]{4}')).drop('value')
def partition_by_play(df):
line_ids = get_play_rows(data)
splits = [x['rowNum'] for x in line_ids.collect()]
splits.append(float('Inf'))
bucketizer = Bucketizer(splits=splits,inputCol="rowNum", outputCol="playNum")
df = bucketizer.setHandleInvalid("keep").transform(df)
return df.repartition(df.playNum)
def count_words(df):
return df.withColumn('words', F.size(F.split(F.col('value'), ' ')))
def format_play(play, id, words) :
txt = "Play {}, words: {}, lines: {}"
return txt.format(play, words[id]['sum(words)'], words[id]['count(value)'])
spark = SparkSession.builder.appName("PyPi").getOrCreate()
data = spark.read.text("data/Shakespeare.txt")
data = remove_header(data)
data = remove_filler(data)
data = partition_by_play(data)
data = count_words(data)
result = data.groupBy(F.col('playNum')).agg(F.sum('words'), F.count('value')).sort('playNum').collect()
play_names = [x.strip() for x in open('data/plays.txt').readlines()]
play_results = [format_play(x, id, result) for id, x in enumerate(play_names)]
[print(x) for x in play_results]
spark.stop()
```
%% Output
23/01/26 12:48:06 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
23/01/26 12:48:20 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
23/01/26 12:48:25 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
Play THE SONNETS, words: 26469, lines: 2634
Play ALLS WELL THAT ENDS WELL, words: 37196, lines: 3199
Play THE TRAGEDY OF ANTONY AND CLEOPATRA, words: 45757, lines: 4167
Play AS YOU LIKE IT, words: 35589, lines: 2939
Play THE COMEDY OF ERRORS, words: 18817, lines: 2080
Play THE TRAGEDY OF CORIOLANUS, words: 46589, lines: 4253
Play CYMBELINE, words: 46143, lines: 4140
Play THE TRAGEDY OF HAMLET, PRINCE OF DENMARK, words: 52365, lines: 4489
Play THE FIRST PART OF KING HENRY THE FOURTH, words: 39843, lines: 3323
Play SECOND PART OF KING HENRY IV, words: 41774, lines: 3555
Play THE LIFE OF KING HENRY THE FIFTH, words: 42167, lines: 3603
Play THE FIRST PART OF HENRY THE SIXTH, words: 38962, lines: 3377
Play THE SECOND PART OF KING HENRY THE SIXTH, words: 41811, lines: 3614
Play THE THIRD PART OF KING HENRY THE SIXTH, words: 40832, lines: 3484
Play KING HENRY THE EIGHTH, words: 40894, lines: 3733
Play KING JOHN, words: 33625, lines: 2997
Play THE TRAGEDY OF JULIUS CAESAR, words: 33572, lines: 2984
Play THE TRAGEDY OF KING LEAR, words: 48177, lines: 3954
Play LOVE'S LABOUR'S LOST, words: 35286, lines: 2984
Play THE TRAGEDY OF MACBETH, words: 30479, lines: 2868
Play MEASURE FOR MEASURE, words: 35260, lines: 3095
Play THE MERCHANT OF VENICE, words: 34144, lines: 2952
Play THE MERRY WIVES OF WINDSOR, words: 35980, lines: 3104
Play A MIDSUMMER NIGHT'S DREAM, words: 29784, lines: 2424
Play MUCH ADO ABOUT NOTHING, words: 33536, lines: 2757
Play THE TRAGEDY OF OTHELLO, MOOR OF VENICE, words: 49358, lines: 3865
Play KING RICHARD THE SECOND, words: 36350, lines: 3204
Play KING RICHARD III, words: 49737, lines: 4509
Play THE TRAGEDY OF ROMEO AND JULIET, words: 41590, lines: 3584
Play THE TAMING OF THE SHREW, words: 35010, lines: 3001
Play THE TEMPEST, words: 28460, lines: 2612
Play THE LIFE OF TIMON OF ATHENS, words: 31097, lines: 2820
Play THE TRAGEDY OF TITUS ANDRONICUS, words: 34396, lines: 2961
Play THE HISTORY OF TROILUS AND CRESSIDA, words: 44082, lines: 3951
Play TWELFTH NIGHT; OR, WHAT YOU WILL, words: 32840, lines: 2767
Play THE TWO GENTLEMEN OF VERONA, words: 27622, lines: 2527
Play THE WINTER'S TALE, words: 40806, lines: 3563
Play A LOVER'S COMPLAINT, words: 3334, lines: 395
%% Cell type:code id:3f802133 tags:
``` python
```