Quantcast
Channel: Active questions tagged amazon-s3 - DevOps Stack Exchange
Viewing all articles
Browse latest Browse all 73

s3: reading parquet files with a subset of columns as argument; does this reduce data transfer?

$
0
0

I have set of large tables stored as parquets on s3.In python, I'm using:

pd.read_parquet(...,columns=columns)

I'm reading the files directly from s3, without any database engine whatsoever for preprocessing in between.

My question is, will the columns argument allow me to reduce data transfer to my remote dask cluster workers by just specifying a subset of the columns I'm interested in, or will they load the full parquet, at first, and then extract the columns? I suspect the latter is the case.

Looking for a solution, I found S3 select:https://docs.aws.amazon.com/AmazonS3/latest/userguide/selecting-content-from-objects.html

I think, I could use boto3 and sql syntax to read subsets of columns directly on s3, similar as done here:https://www.msp360.com/resources/blog/how-to-use-s3-select-feature-amazon/

But what I would really like to have is a version of pd.read_parquet that does this in the background. What I found here is the awswrangler library:https://aws-sdk-pandas.readthedocs.io/en/stable/stubs/awswrangler.s3.read_parquet.html

I suspect that awswrangler does exactly what I want, but I did not find an example that shows this. Does anybody know how it works?

Thanks!


Viewing all articles
Browse latest Browse all 73

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>