Quick tip: Easily find data on the data lake when using AWS Glue Catalog

Finding data on the data lake can sometimes be a challenge. At my current workplace (ZipRecruiter) we have hundreds of tables on the data lake and it’s growing each day. We store the data on AWS S3 and we use AWS Glue Catalog as meta data for our Hive tables. But even with Glue Catalog, finding data on the data lake can still be a hustle. Let’s say I am trying to find a certain type of data, like ‘clicks’ for example. It would be very nice to have an easy way to get all the clicks related tables (including aggregation tables, join tables and so on..) so i could choose from. Or perhaps I would like to know which tables were generated by a specific application. There is no easy way to find these table by default. But here is something pretty cool that I recently found about Glue Catalog that can help.If you add properties to glue tables, then you can search tables based on those properties. For example, if you would add the property “clicks” to all the job related tables, then you can get all of those tables as a result by searching the phrase “clicks” Continue reading Quick tip: Easily find data on the data lake when using AWS Glue Catalog

The right way to use Spark and JDBC

A while ago I had to read data from a MySQL table, do a bit of manipulations on that data and store the results on the disk. The obvious choice was to use Spark, I was already using it for other stuff and it seemed super easy to implement. This is more or less what I had to do (I removed the part which does the manipulation for the sake of simplicity): Looks good, only it didn’t quite work. Either it was super slow or it totally crashed depends on the size of the table. Tuning Spark and the cluster properties helped a bit, but it didn’t solve the problems. Since I was using AWS EMR, it made sense to give Sqoop a try since it is a part of the applications supported on EMR. Sqoop performed so much better almost instantly, all you needed to do is to set the number of mappers according to the size of the data and it was working perfectly. Since both Spark and Sqoop are a based on Hadoop map-reduce framework, it was clear that Spark can work at least as good as Sqoop, I only needed to find out how to do Continue reading The right way to use Spark and JDBC

How to properly collect AWS EMR metrics?

Working with AWS EMR has a lot of benefits. But when it comes to metrics, AWS currently does not supply a proper solution for collecting cluster metrics from EMRs. Well, there is AWS Cloudwatch of course, which works out of the box and gives you loads of EMR metrics. The problem with CloudWatch is that it doesn’t give you the ability to follow metrics per a business unit, or a tag, only per a specific EMR id. This simply means you can not compare the metrics over time, only for specific EMRs. Let me explain again the problem. A common use of EMR, is that you write some kind of code that will be executed inside an EMR, and will be triggered every given amount of time, lets say every 5 hours. This means that every 5 hours a new EMR, with a new ID will be spawned. In CloudWatch you can see each of these EMRs individually but not in a single graph, which is defiantly a disadvantage. Just to note, I am referring only to machine metrics, like memory, cpu and disk. Other metrics like jvm metrics or business metrics, are usually collected by the process itself and Continue reading How to properly collect AWS EMR metrics?

Best code convention syndrome

Developers often tend to think that one coding convention is better than another in terms of readability. Some people think that adding a break before the curly braces is more coherent. Some like camelCase, some hate it. The fact of the matter is, there isn’t any proof that one convention gives better readability than another. Now don’t misunderstand me, I am all for coding conventions, coding conventions are a good thing, but the holy wars for which convention is better seems redundant to me. If indeed one format is better than the other, it has a lot less affect on your performance than you would like to think. The truth is that the convention you prefer gives you better performance simply because you are used to it. The best format to use, the one that would give best results, is the one that you are most comfortable with, even if it is not the best one (if there is indeed a best one). The way our brain works is it always searches for familiar structures to make it work less. When you read a piece of code, your mind finds templates which exists in your memory to ease the job Continue reading Best code convention syndrome