Polars has been making waves in the Python data world — and for good reason. It’s fast, expressive, and built with performance-first principles. If you’re dealing with Parquet files in an S3 bucket and care even a little about speed or memory, this post is for you.

Let’s talk about three powerful components working together:

  • 🪣 S3 bucket storing Parquet files
  • 📦 PyArrow-style datasets
  • ⚡ Polars doing its thing — lazily and efficiently

🔍 The Case for Lazy Reading

When you reach for read_parquet, you get everything. That’s fine… until it’s not. Instead, scan_parquet gives you a lazy frame — and that changes everything.

df = (
    pl.scan_parquet(
        "s3://your-bucket-name/path/to/file.parquet",
        storage_options=storage_options
    )
    .with_columns([
        pl.lit("example").alias("some_column")
    ])
    .select([
        "column_you_need_1",
        "column_you_need_2"
    ])
    .collect()
)

This approach defers execution until .collect() is called. In the meantime, Polars builds a computation graph. With scan_parquet, Polars can:

  • Push down filters: Only scan rows that match your criteria.
  • Select only required columns: Avoid wasting bandwidth and memory on unused data.
  • Avoid intermediate materializations: Keep your pipeline lean and mean.

Write Parquet to S3

Polars makes writing Parquet just as slick, and when you’re targeting S3 with partitioning, a little pyarrow finesse goes a long way.

Here’s how you do it right:

df.write_parquet(
    target_path,
    compression="snappy",
    use_pyarrow=True,
    pyarrow_options={
        "partition_cols": ["year", "month"],
        "existing_data_behavior": "overwrite_or_ignore",
        "filesystem": settings.pyarrow_s3file_system,
        "coerce_timestamps": "ms",
    },
    retries=4,
)

✨ Let’s break this down

  • use_pyarrow=True: This enables advanced write options like partitioning, timestamp coercion, and direct S3 access via PyArrow’s filesystem abstraction.

  • partition_cols: Partitions your data by column (e.g., year=2025/month=04/) — super useful for downstream performance in Spark, DuckDB, or Polars’ own scan_parquet.

  • existing_data_behavior Tells PyArrow how to deal with existing files. overwrite_or_ignore is a nice compromise: it replaces what it needs to, skips the rest.

  • filesystem A PyArrow S3 filesystem object, typically initialized like so:

    import pyarrow.fs
    fs = pyarrow.fs.S3FileSystem(region="eu-west-1")
    
  • coerce_timestamps Timestamps can be tricky. Setting "ms" (milliseconds) keeps them consistent and portable.

  • compression="snappy": Lightweight and fast — basically the default for modern Parquet workflows.