You’ve likely heard about the benefits of partitioning data by a single dimension to boost retrieval performance. It’s a common practice in relational databases, NoSQL databases, and, notably, data lakes. For example,…
Category: Data Lake
Spark and Small Files
In my previous post I have showed this short code example: And I asked what may be the problem with that code, assuming that the input ( my_website_visits ) is very big…
Quick tip: Easily find data on the data lake when using AWS Glue Catalog
Finding data on the data lake can sometimes be a challenge. At my current workplace (ZipRecruiter) we have hundreds of tables on the data lake and it’s growing each day. We store…