Spark and Small Files

In my previous post I have showed this short code example: And I asked what may be the problem with that code, assuming that the input ( my_website_visits ) is very big and that we filter most of it using the ‘where’ clause. Well the answer is of course, is that that piece of code may result in a large amount of small files. Why? Because we are reading a large input, the number of tasks will be quite large. When filtering out most of the data and then writing it, the number of tasks will remain the same, since no shuffling was done. This means that each task will write only a small amount of data, which means small files on the target path. If in the example above Spark created 165 tasks to handle our input. That means that even after filtering most of the data, the output of this process will be at least 165 files with only a few kb in each. What is the problem with a lot of small files?Well, first of all, the writing itself is inefficient. More files means unneeded overhead in resources and time. If you’re storing your output on the Continue reading Spark and Small Files

Coalesce with care…

Coalesce Vs. Repartition in SparkSQL

Here is a quick Spark SQL riddle for you; what do you think can be problematic in the next spark code (assume that spark session was configured in an ideal way)?   Hint1: the input data (my_website_visits) is quite big. Hint2: we filter out most of the data before writing. I’m sure that you got it by now; if the input data is big, and spark is configured in an ideal way, it means that my spark job has a lot of tasks. Which means that the writing is also done from multiple tasks. This probably means that the output of this will be a large amount of a very small parquet files. Small files is a known problem in the big data world. It takes an unnecessary large amount of resources to write this data, but more importantly, it takes a large amount of resources to read this data (more IO, more memory, more runtime…). This is how it looks in Spark UI. In this case we have 165 tasks, which means that we can have up to 165 output files. How would you improve this? Instead of writing from multiple workers, let’s write from a single worker. How Continue reading Coalesce with care…

Coalesce Vs. Repartition in SparkSQL

Quick tip: Easily find data on the data lake when using AWS Glue Catalog

Finding data on the data lake can sometimes be a challenge. At my current workplace (ZipRecruiter) we have hundreds of tables on the data lake and it’s growing each day. We store the data on AWS S3 and we use AWS Glue Catalog as meta data for our Hive tables. But even with Glue Catalog, finding data on the data lake can still be a hustle. Let’s say I am trying to find a certain type of data, like ‘clicks’ for example. It would be very nice to have an easy way to get all the clicks related tables (including aggregation tables, join tables and so on..) so i could choose from. Or perhaps I would like to know which tables were generated by a specific application. There is no easy way to find these table by default. But here is something pretty cool that I recently found about Glue Catalog that can help.If you add properties to glue tables, then you can search tables based on those properties. For example, if you would add the property “clicks” to all the job related tables, then you can get all of those tables as a result by searching the phrase “clicks” Continue reading Quick tip: Easily find data on the data lake when using AWS Glue Catalog

The right way to use Spark and JDBC

A while ago I had to read data from a MySQL table, do a bit of manipulations on that data and store the results on the disk. The obvious choice was to use Spark, I was already using it for other stuff and it seemed super easy to implement. This is more or less what I had to do (I removed the part which does the manipulation for the sake of simplicity): Looks good, only it didn’t quite work. Either it was super slow or it totally crashed depends on the size of the table. Tuning Spark and the cluster properties helped a bit, but it didn’t solve the problems. Since I was using AWS EMR, it made sense to give Sqoop a try since it is a part of the applications supported on EMR. Sqoop performed so much better almost instantly, all you needed to do is to set the number of mappers according to the size of the data and it was working perfectly. Since both Spark and Sqoop are a based on Hadoop map-reduce framework, it was clear that Spark can work at least as good as Sqoop, I only needed to find out how to do Continue reading The right way to use Spark and JDBC