Spark

Local Spark

How to install Spark locally? -> You don’t! You can find easy guides for sure, they will still guide you through a forrest of opaque steps: installing Java, Scala, downloading and installing Spark, setting environmental variables and paths. Then to find out you have the wrong versions. Have fun good luck! Kubernetes in production? Kubernetes locally! The closer you can develop to your production environment the better; In 2024 you run Spark on Kubernetes? Run kubernetes locally with Rancher Desktop. After installation the k3s “cluster” is available through kubectl: ...

Interactive Scala with Almond

Almond is a Scala kernel for Jupyter. Some features: Ammonite, a Scala REPL implementation. Coursier, an artefact manager. You can deploy Almond on Kubernetes with the following manifest: apiVersion: apps/v1 kind: Deployment metadata: name: almond labels: app: almond spec: replicas: 1 selector: matchLabels: app: almond template: metadata: labels: app: almond spec: containers: - name: almond image: almondsh/almond:0.13.11 resources: requests: memory: 384Mi limits: memory: 384Mi ports: - containerPort: 8888 --- kind: Service apiVersion: v1 metadata: name: almond spec: type: ClusterIP selector: app: almond ports: - protocol: TCP port: 8888 targetPort: 8888 --- kind: Service apiVersion: v1 metadata: name: almond-headless spec: clusterIP: None selector: app: almond Port forward: ...

Overview of Spark configurations

Find myself looking for an overview too often. So let’s create a rough overview of common used config for Spark. As a start, create a Spark Session with default config: from pyspark.sql import SparkSession spark = SparkSession.builder \ .master(SPARK_MASTER) \ .appname("app name") \ .getOrCreate() The Spark Context represents the connection to the cluster; communicaties with lower-level API’s and RDDs. Some resource settings on the driver: ... .config("spark.driver.memory", "8g") ... .config("spark.cores.max", "4") .config("spark.executor.memory", "8g") .config("spark.executor.cores", "4") ... Number of shuffle partitions (default is 200), should ideally be equal to the number of cores in the cluster: ...

Provide Spark with cross-account access

In case you need to provide Spark with resources from a different AWS account, I found that quite tricky to figure out. Let’s assume you have two AWS accounts: the alpha account where you run Python with IAM role alpha-role and access to the Spark cluster; and the beta account where you have the S3 bucket you want to get access to. You could give S3 read access to the alpha-role, but it is more persistent and easier to manage by creating an access-role in the beta account that can be assumed by the alpha-role. ...