Process datasets that cannot be handled on a single pc; Spark. Run queries on very large datasets. Grew from the Hadoop ecosystem, a software platform for distributed storage and processing of very large data sets.

We are going to deploy Spark on our cluster, specifaclly Spark 3.0.0 was officially released on 18th June 2020. Why Spark 3? Compared to Spark 2, it has improbed performance and it will be easier to extend it with libraries that run on GPU nodes, there is support for Cypher, a query language for graph integration. These are not features that we’ll need rightaway, but we might need them in the near future.

PySpark Style guide

The Palantir’s Style Guide for PySpark includes basic rules to improve your PySpark code including good reasons why this is a best practice.

Performence & Optimization

Notes from the article Tuning Apache Spark Jobs the Easy Way.

Running Spark jobs is not very difficult, it is not always easy to tell whether we can run it optimally.

  • Jobs: operations that physically move data in order to produce some result; are decomposed into “stages” by separating where a shuffle is required.
  • Stage: segments of work that run from date input, or data read from a previous shuffle, through a set of operations named tasks.
  • task: one task per data partition.

The Stage Detail View gives a report with various hints how to optimize.

The Event Timeline with all tasks in the stage, grouped by executor. If less cores are used than available in the cluster, this can be optimized.

  • If there are too many partitions, the last ending tasks are limiting. This can also be too few partitions or a data skew.
  • with too many partitions, leading to too many tasks; many short tasks dominated by time spent in non-compute

You want the “computing time” to be at least 70% of the time, or you have too many tasks/partitions or the partitions are too small or require too little work.

The Summary Metrics for Competed Tasks show variance in compute metrics; using skew in the min and max values, you can find partitions too small or too big.

The Aggregated Metrics by Executor gives a set of summerized statistics, if one executor is consistently off, this could mean that there is something wrong with the JVM, the hosting node, data locally trouble.

The Task List includes all the tasks already shown on the event timeline, but now with more details.

How many partitions depends really on so many variables – workload, cluster configuration, data sources – but a ball park figure: each partition to contain 100-200MB of data and each task to take 50-200ms.