aviyehuda.com

Data Engineering: Strategies for data retrieval on multi-dimensional data

Posted on 20/11/2023

You’ve likely heard about the benefits of partitioning data by a single dimension to boost retrieval performance. It’s a common practice in relational databases, NoSQL databases, and, notably, data lakes. For example,…

Parquet data filtering with Pandas

Posted on 13/10/2023

When it comes to filtering data from Parquet files using pandas, several strategies can be employed. While it’s widely recognized that partitioning data can significantly enhance the efficiency of filtering operations, there…

Spark and Small Files

Posted on 12/03/2022

In my previous post I have showed this short code example: And I asked what may be the problem with that code, assuming that the input ( my_website_visits ) is very big…

Coalesce with care…

Coalesce Vs. Repartition in SparkSQL

Posted on 10/01/2022

Here is a quick Spark SQL riddle for you; what do you think can be problematic in the next spark code (assume that spark session was configured in an ideal way)? Hint1:…