How to install Spark locally?

-> You don’t!

You can find easy guides for sure, they will still guide you through a forrest of opaque steps: installing Java, Scala, downloading and installing Spark, setting environmental variables and paths. Then to find out you have the wrong versions. Have fun good luck!

Kubernetes in production? Kubernetes locally!

The closer you can develop to your production environment the better; In 2024 you run Spark on Kubernetes? Run kubernetes locally with Rancher Desktop. After installation the k3s “cluster” is available through kubectl:

$ kubectl get nodes
NAME                   STATUS   ROLES                  AGE     VERSION
lima-rancher-desktop   Ready    control-plane,master   7m20s   v1.29.0+k3s1

Spark Operator

Managing Spark applications on Kubernetes is simplified with Spark-Operator.

helm repo add spark-operator https://kubeflow.github.io/spark-operator
helm upgrade --install spark-operator \
    spark-operator/spark-operator \
    --namespace spark-operator \
    --values ./values.yaml \
    --create-namespace

With the following yaml file values.yaml:

# https://github.com/kubeflow/spark-operator/blob/master/charts/spark-operator-chart/values.yaml
serviceAccounts:
  spark:
    create: true
    name: "spark"
  sparkoperator:
    create: true
    name: "spark-operator-spark"
webhook:
  enable: true
  port: 8080
sparkJobNamespace: default

Now that Spark-Operator you can learn here, how to submit a SparkApplication.

Also we need a webhook for the operator:

apiVersion: batch/v1
kind: Job
metadata:
  name: sparkoperator-init
  namespace: spark-operator
  labels:
    app.kubernetes.io/name: spark-operator
    app.kubernetes.io/version: v1beta2-1.3.0-3.1.1
spec:
  backoffLimit: 3
  template:
    metadata:
      labels:
        app.kubernetes.io/name: spark-operator
        app.kubernetes.io/version: v1beta2-1.3.0-3.1.1
    spec:
      serviceAccountName: spark-operator-spark
      restartPolicy: Never
      containers:
        - name: main
          image: ghcr.io/googlecloudplatform/spark-operator:v1beta2-1.3.8-3.1.1
          imagePullPolicy: IfNotPresent
          command: ["/usr/bin/gencerts.sh", "-p"]
---
kind: Service
apiVersion: v1
metadata:
  name: spark-webhook
  namespace: spark-operator
spec:
  ports:
    - port: 443
      targetPort: 8080
      name: webhook
  selector:
    app.kubernetes.io/name: spark-operator
    app.kubernetes.io/version: v1beta2-1.3.0-3.1.1

Alo we need a service account in the default namespace:

apiVersion: v1
kind: ServiceAccount
metadata:
  name: spark
  namespace: default
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  namespace: default
  name: spark-role
rules:
  - apiGroups: [""]
    resources: ["pods"]
    verbs: ["*"]
  - apiGroups: [""]
    resources: ["services"]
    verbs: ["*"]
  - apiGroups: [""]
    resources: ["configmaps"]
    verbs: ["*"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: spark-role-binding
  namespace: default
subjects:
  - kind: ServiceAccount
    name: spark
    namespace: default
roleRef:
  kind: Role
  name: spark-role
  apiGroup: rbac.authorization.k8s.io

Develop with Jupyter

For small iterations of code, I enjoy to work with Jupyter Notebooks. I can choose any jupyter image with Spark, for example jupyter/pyspark-notebook:spark-3.5.0:

  • mount a local repository () as a volume to the notebook container
  • Expose the continar port 8888 to local with a service
  • have a headless serive, so Spark can find the notebook when running in client mode
apiVersion: apps/v1
kind: Deployment
metadata:
  name: jupyter
  labels:
    app: jupyter
spec:
  replicas: 1
  selector:
    matchLabels:
      app: jupyter
  template:
    metadata:
      labels:
        app: jupyter
    spec:
      containers:
        - name: jupyter
          image: jupyter/pyspark-notebook:spark-3.5.0
          resources:
            requests:
              memory: 512Mi
            limits:
              memory: 512Mi
          env:
            - name: JUPYTER_PORT
              value: "8888"
          ports:
            - containerPort: 8888
          volumeMounts:
            - mountPath: /home/jovyan/work
              name: src-local
              readOnly: false
      volumes:
        - name: src-local
          hostPath:
            path: /Users/joostdobken/repos/spark-app # local path to your repository
      serviceAccount: spark
      serviceAccountName: spark
---
kind: Service
apiVersion: v1
metadata:
  name: jupyter
spec:
  type: LoadBalancer # LoadBalancer exposes this port to your local machine
  selector:
    app: jupyter
  ports:
    - protocol: TCP
      port: 8888
      targetPort: 8888
---
kind: Service
apiVersion: v1
metadata:
  name: jupyter-headless
spec:
  clusterIP: None
  selector:
    app: jupyter

And find the notebook on http://localhost:8888 (TODO: disable the token).

Spark Session

Run spark session with configuration to launch an application on Spark-Operator:

import os
from pyspark.sql import SparkSession

spark = (
    SparkSession.builder.appName("JupyterApp")
    .master("k8s://https://kubernetes.default.svc.cluster.local:443")
    .config("spark.submit.deployMode", "client")
    .config("spark.executor.instances", "1")
    .config("spark.executor.memory", "1G")
    .config("spark.driver.memory", "1G")
    .config("spark.executor.cores", "1")
    .config("spark.kubernetes.namespace", "default")
    .config(
        "spark.kubernetes.container.image", "ghcr.io/apache/spark-docker/spark:3.5.0"
    )
    .config("spark.kubernetes.authenticate.driver.serviceAccountName", "spark")
    .config("spark.kubernetes.driver.pod.name", os.environ["HOSTNAME"])
    .config("spark.driver.bindAddress", "0.0.0.0")
    .config("spark.driver.host", "jupyter-headless.default.svc.cluster.local")
    .config("spark.kubernetes.executor.volumes.hostPath.src-local.mount.path", "/home/jovyan/work")
    .getOrCreate()
)

QA:

  • How to mount a volume to the executor? Read here