Spark and Small Files

In my previous post I have showed this short code example: And I asked what may be the problem with that code, assuming that the input ( my_website_visits ) is very big and that we filter most of it using the ‘where’ clause. Well the answer is of course, is that that piece of code may result in a large amount of small files. Why? Because we are reading a large input, the number of tasks will be quite large. When filtering out most of the data and then writing it, the number of tasks will remain the same, since no shuffling was done. This means that each task will write only a small amount of data, which means small files on the target path. If in the example above Spark created 165 tasks to handle our input. That means that even after filtering most of the data, the output of this process will be at least 165 files with only a few kb in each. What is the problem with a lot of small files?Well, first of all, the writing itself is inefficient. More files means unneeded overhead in resources and time. If you’re storing your output on the Continue reading Spark and Small Files