Data Engineering: Strategies for data retrieval on multi-dimensional data

You’ve likely heard about the benefits of partitioning data by a single dimension to boost retrieval performance. It’s a common practice in relational databases, NoSQL databases, and, notably, data lakes. For example, a very common dimension to partition data in data lakes is by date or time. However, what if your data querying requirements involve multiple dimensions. Let’s say you wish to query your’e data by field A and also by field B or sometimes by field A but other times by field B.
In this post I’ll go over several common options for such case.
For the sake of connivance I’ll give examples on how to implement it on the data lake using standard folder names and parquet to hold the data. You should know however that the paradigms are also valid for other areas like DBs, NoSQL DBs, memory storage and so on.

The default: micro partitions

Micro-partitions is a technique used to sub-partition data within a dataset. Each micro-partition contains metadata for individual fields, providing valuable information for data consumption performance optimization.

For instance, consider a scenario where data is organized into daily partitions stored in Parquet files.

<dataset_root>/day=20240101/data1.parquet
<dataset_root>/day=20240101/data2.parquet
<dataset_root>/day=20240101/data3.parquet

In this setup, each Parquet file (or even each page within a Parquet file) can be referred to as a micro-partition. Parquet files inherently store metadata per file and per page, which can enhance data consumption performance.

Snowflake, also employs micro-partitions by default, only it uses richer metadata and superior indexing capabilities than the simple parquet files. This enhanced metadata and indexing within Snowflake’s micro-partitions contribute to significant performance gains, making micro-partitions a highly beneficial feature within the platform.

The obvious approach: nested partitions

Let’s start with nested partitions. In a typical Hive partitions structure, it looks like this:

<dataset_root>/<FieldA>=<value>/<FieldB>=<value>/data.parquet

While this works well for consistent queries involving both Field A and Field B, it falls short when you need the flexibility to query either field separately. For instance:

//Good for this:
Spark.sql("select * from my_data_set where FieldA=11 and FieldB=22 ")

//Not so good for this:
Spark.sql("select * from my_data_set where FieldA=11")
Spark.sql("select * from my_data_set where FieldB=22")

The reason this method is not usfull for these cases is that for the 2nd type query, all partitions need to be scanned which makes it not as usuful.

The opposite approach: data duplication with separate partitions

Another approach involves duplicating the data and partitioning it once by Field A and once by Field B. The directory structure in a hive like structure might look like this:

<dataset_root>/<FieldA>=<value>/data.parquet

and

<dataset_root>/<FieldB>=<value>/data.parquet

It represents the opposite of the previous option, meaning:

// Good for this:
Spark.sql("select * from my_data_set where FieldA=11"); 
Spark.sql("select * from my_data_set where FieldB=22");

// Not good for this:
Spark.sql("select * from my_data_set where FieldA=11 and FieldB=22 ");

Also, maintaining data consistency becomes more challenging in this scenario.

Best of Both Worlds? Partitioning by field A + externally indexing by field B

A widely adopted strategy in databases. The advantage here is that the index serves as a reference to the data, not a copy of it.

In the data lake world it means partition the data by fieldA, same as before

<dataset_root>/<FieldA>=<value>/data.parquet

And in addition mantaining a slim dataset which reffrances the same data files by fieldB values.

In datalakes It’s possible to implement it yourself, although usually it is implemented using some additional data catalog. This is also one of the advantages of using lakehouses(like databricks data lakehouse) since you get it out of the box.
It’s ideal for cases where you need to query the data based on specific values for field B.

Spark.sql("select * from my_data_set where FieldB=22");

However, it’s less suitable for queries involving a range of values for field B,

Spark.sql("select * from my_data_set where FieldB>22");

The reason it is not as useful is since the indexed keys are not stored in a continuous manner on the machine like partitions usually do.

Often useful: partitioning by field A + sorting or bucketing by range field B:

This is an improvement over the micro partitions approach. Here we partition the data by fieldA as you normally do, but you make sure that inside each partition the data is sorted by fieldB or grouped by range by field B.

Here is one example of how to implement it using Spark:

    // first partition the data by fieldA and than sort it by fieldB
    val sortedDF = df.repartitionByRange($"fieldA", $"fieldB").

    // than write the data in a partitioned manner
    sortedDF.write
      .mode(SaveMode.Overwrite) 
      .partitionBy("fieldA")
      .parquet("/dataset_root")

In the example above, data will be written partitioned by field A. But inside each partitioned the data will be divided to files (micro-partitioned) also by field B.

The used theologies need to support this of course. In case of parquet it works well since parquet holds metadata for each field which includes min and max values. Most of the technologies (like apache spark) take this into account so they are able to skip files which do not include the required values for field-B.

This is a solid choice for various use cases, while it is not the best approach for queries like this:

Spark.sql("select * from my_data_set where FieldB=22 ");

Spark.sql("select * from my_data_set where FieldB>22 ");

Since it means going over all partitions. However since the data is grouped by fieldB within the partitions at least some of the files may be skipped.
This approach is particularly useful when field B contains a wide range of possible values. It can also be a beneficial design when field B’s values are unevenly distributed (skwed).

This is why this paradigm is very common in multiple technologies, for example: clustering in BigQuery, sort key in DynamoDB. clustering inside micro partitions in snowflake and so on.

The secret weapon – Z-order

A less common but important option is to index or partition by a Z-order. In this case, the data partition will be a composite of both fields A and B:

<dataset_root>/<A combination of FieldA+fieldB>

This method is actually ideal for all of the query types mentioned so far.
The secret is in the method which combines the 2 fields together, it makes sure that keys with similar values are stored in proximity to one another, and this holds true for both fields that make up the partition. So no matter wether your’e retrieving data based on one of the fields or both, whether you need a precise value or range of values, this method will surely help.

The implementation of this is not very common though and quite complex. Currently the most common implementations are by hosted environments like Databricks lakehouse.

Conclusion:

Choosing the right strategy for multi-dimensional data querying depends on your specific use case. Each approach has its strengths and trade-offs. Whether you go for nested partitions, data duplication, external indexing, sorting, or Z-indexing, understanding these strategies equips you to make informed decisions based on your data lake architecture and querying needs.