Posts

Polars and Parquet

Polars has been making waves in the Python data world — and for good reason. It’s fast, expressive, and built with performance-first principles. If you’re dealing with Parquet files in an S3 bucket and care even a little about speed or memory, this post is for you. Let’s talk about three powerful components working together: 🪣 S3 bucket storing Parquet files 📦 PyArrow-style datasets ⚡ Polars doing its thing — lazily and efficiently 🔍 The Case for Lazy Reading When you reach for read_parquet, you get everything. That’s fine… until it’s not. Instead, scan_parquet gives you a lazy frame — and that changes everything. ...

Comparison of Python Clients for Object Storage

Basics: botocore is the low-level interface for practically all other clients. boto3 official Amazon Python client minio standalone alternative to botocore/boto3, does not natively support asynchronous operations as of now. Async: aiobotocore async support for botocore aioboto3 wrapper for aiobotocore s3fs wrapper for aiobotocore Boto3 import boto3 from botocore.config import Config # Initialize the S3 client with a custom endpoint s3 = boto3.client( "s3", endpoint_url="https://custom-endpoint.com", aws_access_key_id="your-access-key", aws_secret_access_key="your-secret-key", config=Config(signature_version="s3v4"), ) # Upload a file s3.upload_file("local/path/file.txt", "bucket-name", "destination/path/file.txt") # Download a file s3.download_file("bucket-name", "source/path/file.txt", "local/path/file.txt") Minio from minio import Minio # Initialize the Minio client with a custom endpoint client = Minio( "custom-endpoint.com", access_key="your-access-key", secret_key="your-secret-key", secure=True, # Set to False if not using HTTPS ) # Upload a file client.fput_object("bucket-name", "destination/path/file.txt", "local/path/file.txt") # Download a file client.fget_object("bucket-name", "source/path/file.txt", "local/path/file.txt") aioboto3 import aioboto3 # Create an async session and client session = aioboto3.Session( aws_access_key_id="your-access-key", aws_secret_access_key="your-secret-key", ) async with aioboto3.Session().client( "s3", endpoint_url=endpoint_url, aws_access_key_id=access_key, aws_secret_access_key=secret_key, ) as s3: # Upload the file with open(file_path, "rb") as file: await s3.upload_fileobj(file, bucket_name, object_name) S3fs Looks the most clean: ...

Azure Pipelines

I can firmly say that I am well experienced in Continuous Integration / Continuous Deployment (CI/CD) pipelines, and in general simple is always better dan complex. If you need something complex, there is a different place for your process: CICD should be nothing more than: BUILD TEST RELEASE DEPLOY For a client I am working on Azure Devops Pipelines and so far my initial experiences have left me somewhat skeptical, as I find that it often imposes a level of abstraction that veers away from my preference for simplicity. ...

Async Pandas

Pandas is great for Python because it offers efficient data manipulation and analysis capabilities, leveraging the speed of the underlying NumPy library. How does it behave with asyncio since I could not find much about it. Have an enourmnes dataset call an API with a throughput of 10call at once. The simple example pandas.DataFrame consists of 100 rows of lorem text: import lorem import pandas as pd df = pd.DataFrame({"Text": [lorem.text() for _ in range(100)]}) >>> df.head() Text 0 Labore quisquam neque adipisci labore non quae... 1 Aliquam etincidunt dolore dolore voluptatem. A... 2 Aliquam consectetur dolor dolorem dolorem ipsu... 3 Labore non aliquam numquam sed. Eius neque con... 4 Voluptatem ipsum modi amet tempora tempora eti... Asyncio If we want to sent every row to an API and that call takes about a second. Let’s consider this method reverses the text and returns the final three letters: ...

Local Spark

How to install Spark locally? -> You don’t! You can find easy guides for sure, they will still guide you through a forrest of opaque steps: installing Java, Scala, downloading and installing Spark, setting environmental variables and paths. Then to find out you have the wrong versions. Have fun good luck! Kubernetes in production? Kubernetes locally! The closer you can develop to your production environment the better; In 2024 you run Spark on Kubernetes? Run kubernetes locally with Rancher Desktop. After installation the k3s “cluster” is available through kubectl: ...