Coalesce with care…

Coalesce Vs. Repartition in SparkSQL

Here is a quick Spark SQL riddle for you; what do you think can be problematic in the next spark code (assume that spark session was configured in an ideal way)?   Hint1: the input data (my_website_visits) is quite big. Hint2: we filter out most of the data before writing. I’m sure that you got it by now; if the input data is big, and spark is configured in an ideal way, it means that my spark job has a lot of tasks. Which means that the writing is also done from multiple tasks. This probably means that the output of this will be a large amount of a very small parquet files. Small files is a known problem in the big data world. It takes an unnecessary large amount of resources to write this data, but more importantly, it takes a large amount of resources to read this data (more IO, more memory, more runtime…). This is how it looks in Spark UI. In this case we have 165 tasks, which means that we can have up to 165 output files. How would you improve this? Instead of writing from multiple workers, let’s write from a single worker. How Continue reading Coalesce with care…

Coalesce Vs. Repartition in SparkSQL