Polars has been making waves in the Python data world — and for good reason. It’s fast, expressive, and built with performance-first principles. If you’re dealing with Parquet files in an S3 bucket and care even a little about speed or memory, this post is for you.
Let’s talk about three powerful components working together:
- 🪣 S3 bucket storing Parquet files
- 📦 PyArrow-style datasets
- ⚡ Polars doing its thing — lazily and efficiently
🔍 The Case for Lazy Reading
When you reach for read_parquet
, you get everything. That’s fine… until it’s not.
Instead, scan_parquet
gives you a lazy frame — and that changes everything.
df = (
pl.scan_parquet(
"s3://your-bucket-name/path/to/file.parquet",
storage_options=storage_options
)
.with_columns([
pl.lit("example").alias("some_column")
])
.select([
"column_you_need_1",
"column_you_need_2"
])
.collect()
)
This approach defers execution until .collect()
is called. In the meantime, Polars builds a computation graph. With scan_parquet
, Polars can:
- Push down filters: Only scan rows that match your criteria.
- Select only required columns: Avoid wasting bandwidth and memory on unused data.
- Avoid intermediate materializations: Keep your pipeline lean and mean.
Write Parquet to S3
Polars makes writing Parquet just as slick, and when you’re targeting S3 with partitioning, a little pyarrow finesse goes a long way.
Here’s how you do it right:
df.write_parquet(
target_path,
compression="snappy",
use_pyarrow=True,
pyarrow_options={
"partition_cols": ["year", "month"],
"existing_data_behavior": "overwrite_or_ignore",
"filesystem": settings.pyarrow_s3file_system,
"coerce_timestamps": "ms",
},
retries=4,
)
✨ Let’s break this down
use_pyarrow=True
: This enables advanced write options like partitioning, timestamp coercion, and direct S3 access via PyArrow’s filesystem abstraction.partition_cols
: Partitions your data by column (e.g.,year=2025/month=04/
) — super useful for downstream performance in Spark, DuckDB, or Polars’ ownscan_parquet
.existing_data_behavior
Tells PyArrow how to deal with existing files.overwrite_or_ignore
is a nice compromise: it replaces what it needs to, skips the rest.filesystem
A PyArrow S3 filesystem object, typically initialized like so:import pyarrow.fs fs = pyarrow.fs.S3FileSystem(region="eu-west-1")
coerce_timestamps
Timestamps can be tricky. Setting"ms"
(milliseconds) keeps them consistent and portable.compression="snappy"
: Lightweight and fast — basically the default for modern Parquet workflows.