Spark – aviyehuda.com

Data Engineering: Strategies for data retrieval on multi-dimensional data

Posted on 20/11/2023

You’ve likely heard about the benefits of partitioning data by a single dimension to boost retrieval performance. It’s a common practice in relational databases, NoSQL databases, and, notably, data lakes. For example,…

Spark and Small Files

Posted on 12/03/2022

In my previous post I have showed this short code example: And I asked what may be the problem with that code, assuming that the input ( my_website_visits ) is very big…

Coalesce with care…

Coalesce Vs. Repartition in SparkSQL

Posted on 10/01/2022

Here is a quick Spark SQL riddle for you; what do you think can be problematic in the next spark code (assume that spark session was configured in an ideal way)? Hint1:…

The right way to use Spark and JDBC

Posted on 17/12/2018

A while ago I had to read data from a MySQL table, do a bit of manipulations on that data and store the results on the disk.The obvious choice was to use…