Skip to content

aviyehuda.com

Menu
  • Open Source
  • Android
  • Java
  • Others
  • Contact Me
  • About Me
Menu

Quick tip: Easily find data on the data lake when using AWS Glue Catalog

Posted on 15/01/2021

Finding data on the data lake can sometimes be a challenge. At my current workplace (ZipRecruiter) we have hundreds of tables on the data lake and it’s growing each day. We store the data on AWS S3 and we use AWS Glue Catalog as meta data for our Hive tables.

But even with Glue Catalog, finding data on the data lake can still be a hustle. Let’s say I am trying to find a certain type of data, like ‘clicks’ for example. It would be very nice to have an easy way to get all the clicks related tables (including aggregation tables, join tables and so on..) so i could choose from. Or perhaps I would like to know which tables were generated by a specific application. There is no easy way to find these table by default.

But here is something pretty cool that I recently found about Glue Catalog that can help.
If you add properties to glue tables, then you can search tables based on those properties.

For example, if you would add the property “clicks” to all the job related tables, then you can get all of those tables as a result by searching the phrase “clicks” in GlueCatalog.

You can also add property like “Application: ClicksGenerator” to all of the tables that were generated by the ClicksGenerator application.
Other ideas for labels may be: team names, last update date, data lag, data update frequency, and so on…

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *


About Me

REFCARD – Code Gems for Android Developers

Categories

  • Android
  • AWS
  • AWS EMR
  • bluetooth
  • Chrome extension
  • ClientSide
  • Clover
  • Coding Coventions
  • Data Lake
  • General
  • GreaseMonkey
  • Hacks
  • hibernate
  • hibernate validator
  • HTML5
  • HtmlUnit
  • Image Manipulation
  • Java
  • Java Technologies
  • JavaScript
  • Java_Mail
  • JEE/Network
  • Job searching
  • Open Source
  • Pivot
  • projects
  • Pure Java
  • software
  • Spark
  • Trivia
  • Web development

Archives

  • March 2022 (1)
  • January 2022 (1)
  • January 2021 (1)
  • December 2018 (1)
  • August 2018 (1)
  • October 2013 (1)
  • March 2013 (1)
  • January 2013 (2)
  • July 2012 (1)
  • April 2012 (1)
  • March 2012 (1)
  • December 2011 (1)
  • July 2011 (1)
  • June 2011 (1)
  • May 2011 (2)
  • January 2011 (1)
  • December 2010 (1)
  • November 2010 (3)
  • October 2010 (4)
  • July 2010 (1)
  • April 2010 (2)
  • March 2010 (1)
  • February 2010 (2)
  • January 2010 (5)
  • December 2009 (10)
  • September 2009 (1)
 RSS Feed
6dc4b508c301309b9e838ba10393207a-332
©2023 aviyehuda.com | Design: Newspaperly WordPress Theme