You’ve likely heard about the benefits of partitioning data by a single dimension to boost retrieval performance. It’s a common practice in relational databases, NoSQL databases, and, notably, data lakes. For example,…
Category: Spark
Spark and Small Files
In my previous post I have showed this short code example: And I asked what may be the problem with that code, assuming that the input ( my_website_visits ) is very big…
Coalesce with care…
Coalesce Vs. Repartition in SparkSQL
Here is a quick Spark SQL riddle for you; what do you think can be problematic in the next spark code (assume that spark session was configured in an ideal way)? Hint1:…
The right way to use Spark and JDBC
A while ago I had to read data from a MySQL table, do a bit of manipulations on that data and store the results on the disk.The obvious choice was to use…