How to install Spark locally?
-> You don’t!
You can find easy guides for sure, they will still guide you through a forrest of opaque steps: installing Java, Scala, downloading and installing Spark, setting environmental variables and paths. Then to find out you have the wrong versions. Have fun good luck!
Kubernetes in production? Kubernetes locally!
The closer you can develop to your production environment the better; In 2024 you run Spark on Kubernetes? Run kubernetes locally with Rancher Desktop. After installation the k3s “cluster” is available through kubectl:
$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
lima-rancher-desktop Ready control-plane,master 7m20s v1.29.0+k3s1
Spark Operator
Managing Spark applications on Kubernetes is simplified with Spark-Operator.
helm repo add spark-operator https://kubeflow.github.io/spark-operator
helm upgrade --install spark-operator \
spark-operator/spark-operator \
--namespace spark-operator \
--values ./values.yaml \
--create-namespace
With the following yaml file values.yaml
:
# https://github.com/kubeflow/spark-operator/blob/master/charts/spark-operator-chart/values.yaml
serviceAccounts:
spark:
create: true
name: "spark"
sparkoperator:
create: true
name: "spark-operator-spark"
webhook:
enable: true
port: 8080
sparkJobNamespace: default
Now that Spark-Operator you can learn here, how to submit a SparkApplication.
Also we need a webhook for the operator:
apiVersion: batch/v1
kind: Job
metadata:
name: sparkoperator-init
namespace: spark-operator
labels:
app.kubernetes.io/name: spark-operator
app.kubernetes.io/version: v1beta2-1.3.0-3.1.1
spec:
backoffLimit: 3
template:
metadata:
labels:
app.kubernetes.io/name: spark-operator
app.kubernetes.io/version: v1beta2-1.3.0-3.1.1
spec:
serviceAccountName: spark-operator-spark
restartPolicy: Never
containers:
- name: main
image: ghcr.io/googlecloudplatform/spark-operator:v1beta2-1.3.8-3.1.1
imagePullPolicy: IfNotPresent
command: ["/usr/bin/gencerts.sh", "-p"]
---
kind: Service
apiVersion: v1
metadata:
name: spark-webhook
namespace: spark-operator
spec:
ports:
- port: 443
targetPort: 8080
name: webhook
selector:
app.kubernetes.io/name: spark-operator
app.kubernetes.io/version: v1beta2-1.3.0-3.1.1
Alo we need a service account in the default namespace:
apiVersion: v1
kind: ServiceAccount
metadata:
name: spark
namespace: default
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
namespace: default
name: spark-role
rules:
- apiGroups: [""]
resources: ["pods"]
verbs: ["*"]
- apiGroups: [""]
resources: ["services"]
verbs: ["*"]
- apiGroups: [""]
resources: ["configmaps"]
verbs: ["*"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: spark-role-binding
namespace: default
subjects:
- kind: ServiceAccount
name: spark
namespace: default
roleRef:
kind: Role
name: spark-role
apiGroup: rbac.authorization.k8s.io
Develop with Jupyter
For small iterations of code, I enjoy to work with Jupyter Notebooks. I can choose any jupyter image with Spark, for example jupyter/pyspark-notebook:spark-3.5.0
:
- mount a local repository () as a volume to the notebook container
- Expose the continar port
8888
to local with a service - have a headless serive, so Spark can find the notebook when running in client mode
apiVersion: apps/v1
kind: Deployment
metadata:
name: jupyter
labels:
app: jupyter
spec:
replicas: 1
selector:
matchLabels:
app: jupyter
template:
metadata:
labels:
app: jupyter
spec:
containers:
- name: jupyter
image: jupyter/pyspark-notebook:spark-3.5.0
resources:
requests:
memory: 512Mi
limits:
memory: 512Mi
env:
- name: JUPYTER_PORT
value: "8888"
ports:
- containerPort: 8888
volumeMounts:
- mountPath: /home/jovyan/work
name: src-local
readOnly: false
volumes:
- name: src-local
hostPath:
path: /Users/joostdobken/repos/spark-app # local path to your repository
serviceAccount: spark
serviceAccountName: spark
---
kind: Service
apiVersion: v1
metadata:
name: jupyter
spec:
type: LoadBalancer # LoadBalancer exposes this port to your local machine
selector:
app: jupyter
ports:
- protocol: TCP
port: 8888
targetPort: 8888
---
kind: Service
apiVersion: v1
metadata:
name: jupyter-headless
spec:
clusterIP: None
selector:
app: jupyter
And find the notebook on http://localhost:8888 (TODO: disable the token).
Spark Session
Run spark session with configuration to launch an application on Spark-Operator:
import os
from pyspark.sql import SparkSession
spark = (
SparkSession.builder.appName("JupyterApp")
.master("k8s://https://kubernetes.default.svc.cluster.local:443")
.config("spark.submit.deployMode", "client")
.config("spark.executor.instances", "1")
.config("spark.executor.memory", "1G")
.config("spark.driver.memory", "1G")
.config("spark.executor.cores", "1")
.config("spark.kubernetes.namespace", "default")
.config(
"spark.kubernetes.container.image", "ghcr.io/apache/spark-docker/spark:3.5.0"
)
.config("spark.kubernetes.authenticate.driver.serviceAccountName", "spark")
.config("spark.kubernetes.driver.pod.name", os.environ["HOSTNAME"])
.config("spark.driver.bindAddress", "0.0.0.0")
.config("spark.driver.host", "jupyter-headless.default.svc.cluster.local")
.config("spark.kubernetes.executor.volumes.hostPath.src-local.mount.path", "/home/jovyan/work")
.getOrCreate()
)
QA:
- How to mount a volume to the executor? Read here