Polars and Parquet

Polars has been making waves in the Python data world — and for good reason. It’s fast, expressive, and built with performance-first principles. If you’re dealing with Parquet files in an S3 bucket and care even a little about speed or memory, this post is for you.

Let’s talk about three powerful components working together:

🪣 S3 bucket storing Parquet files
📦 PyArrow-style datasets
⚡ Polars doing its thing — lazily and efficiently

🔍 The Case for Lazy Reading

When you reach for read_parquet, you get everything. That’s fine… until it’s not. Instead, scan_parquet gives you a lazy frame — and that changes everything.

df = (
    pl.scan_parquet(
        "s3://your-bucket-name/path/to/file.parquet",
        storage_options=storage_options
    )
    .with_columns([
        pl.lit("example").alias("some_column")
    ])
    .select([
        "column_you_need_1",
        "column_you_need_2"
    ])
    .collect()
)

This approach defers execution until .collect() is called. In the meantime, Polars builds a computation graph. With scan_parquet, Polars can:

Push down filters: Only scan rows that match your criteria.
Select only required columns: Avoid wasting bandwidth and memory on unused data.
Avoid intermediate materializations: Keep your pipeline lean and mean.

Write Parquet to S3

Polars makes writing Parquet just as slick, and when you’re targeting S3 with partitioning, a little pyarrow finesse goes a long way.

Here’s how you do it right:

df.write_parquet(
    target_path,
    compression="snappy",
    use_pyarrow=True,
    pyarrow_options={
        "partition_cols": ["year", "month"],
        "existing_data_behavior": "overwrite_or_ignore",
        "filesystem": settings.pyarrow_s3file_system,
        "coerce_timestamps": "ms",
    },
    retries=4,
)

✨ Let’s break this down

use_pyarrow=True: This enables advanced write options like partitioning, timestamp coercion, and direct S3 access via PyArrow’s filesystem abstraction.
partition_cols: Partitions your data by column (e.g., year=2025/month=04/) — super useful for downstream performance in Spark, DuckDB, or Polars’ own scan_parquet.
existing_data_behavior Tells PyArrow how to deal with existing files. overwrite_or_ignore is a nice compromise: it replaces what it needs to, skips the rest.

filesystem A PyArrow S3 filesystem object, typically initialized like so:

import pyarrow.fs
fs = pyarrow.fs.S3FileSystem(region="eu-west-1")

coerce_timestamps Timestamps can be tricky. Setting "ms" (milliseconds) keeps them consistent and portable.
compression="snappy": Lightweight and fast — basically the default for modern Parquet workflows.

🔍 The Case for Lazy Reading#

Write Parquet to S3#

🔍 The Case for Lazy Reading

Write Parquet to S3