Local Spark

How to install Spark locally? -> You don’t! You can find easy guides for sure, they will still guide you through a forrest of opaque steps: installing Java, Scala, downloading and installing Spark, setting environmental variables and paths. Then to find out you have the wrong versions. Have fun good luck! Kubernetes in production? Kubernetes locally! The closer you can develop to your production environment the better; In 2024 you run Spark on Kubernetes?...

<span title='2024-04-15 04:07:47 +0000 UTC'>April 15, 2024</span>&nbsp;·&nbsp;3 min&nbsp;·&nbsp;571 words&nbsp;·&nbsp;Joost

Interactive Scala with Almond

Almond is a Scala kernel for Jupyter. Some features: Ammonite, a Scala REPL implementation. Coursier, an artefact manager. You can deploy Almond on Kubernetes with the following manifest: apiVersion: apps/v1 kind: Deployment metadata: name: almond labels: app: almond spec: replicas: 1 selector: matchLabels: app: almond template: metadata: labels: app: almond spec: containers: - name: almond image: almondsh/almond:0.13.11 resources: requests: memory: 384Mi limits: memory: 384Mi ports: - containerPort: 8888 --- kind: Service apiVersion: v1 metadata: name: almond spec: type: ClusterIP selector: app: almond ports: - protocol: TCP port: 8888 targetPort: 8888 --- kind: Service apiVersion: v1 metadata: name: almond-headless spec: clusterIP: None selector: app: almond Port forward:...

<span title='2023-05-07 11:37:35 +0100 +0100'>May 7, 2023</span>&nbsp;·&nbsp;1 min&nbsp;·&nbsp;193 words&nbsp;·&nbsp;Joost

Overview of Spark configurations

Find myself looking for an overview too often. So let’s create a rough overview of common used config for Spark. As a start, create a Spark Session with default config: from pyspark.sql import SparkSession spark = SparkSession.builder \ .master(SPARK_MASTER) \ .appname("app name") \ .getOrCreate() The Spark Context represents the connection to the cluster; communicaties with lower-level API’s and RDDs. Some resource settings on the driver: ... .config("spark.driver.memory", "8g") ... .config("spark.cores.max", "4") ....

<span title='2021-11-08 01:26:43 +0000 UTC'>November 8, 2021</span>&nbsp;·&nbsp;3 min&nbsp;·&nbsp;431 words&nbsp;·&nbsp;Joost

Provide Spark with cross-account access

In case you need to provide Spark with resources from a different AWS account, I found that quite tricky to figure out. Let’s assume you have two AWS accounts: the alpha account where you run Python with IAM role alpha-role and access to the Spark cluster; and the beta account where you have the S3 bucket you want to get access to. You could give S3 read access to the alpha-role, but it is more persistent and easier to manage by creating an access-role in the beta account that can be assumed by the alpha-role....

<span title='2020-08-21 01:26:43 +0000 UTC'>August 21, 2020</span>&nbsp;·&nbsp;2 min&nbsp;·&nbsp;413 words&nbsp;·&nbsp;Joost