3 Things To Consider When Partitioning Your Data Lake

Evan Morris / 3 min read.
February 22, 2022

Datafloq enables anyone to contribute articles, but we value high-quality content. This means that we do not accept SEO link building content, spammy articles, clickbait, articles written by bots and especially not misinformation. Therefore, we have developed an AI, built using multiple built open-source and proprietary tools to instantly define whether an article is written by a human or a bot and determine the level of bias, objectivity, whether it is fact-based or not, sentiment and overall quality.

Articles published on Datafloq need to have a minimum AI score of 60% and we provide this graph to give more detailed information on how we rate this article. Please note that this is a work in progress and if you have any suggestions, feel free to contact us.

floq.to/DX0uq

Partitioning in data lakes is an improvement practice for your query speed. Managed lake solutions like AWS Athena suggest partitioning as best practices to optimize query performance. When you read through their partitioning documentation, it may seem to be easy to implement and may give the impression that once you apply to a partition, you can just forget about it. However, as the size and variety of your data grow, things can get more complicated than expected. In this article, we will discuss traditional partitioning practices and their problems. Also, we will talk about indexing and nano-blocking technologies that readers could consider to mitigate their partitioning problems.

What does a usual data lake partitioning look like?

Data lake partitioning is a technique that’s designed to improve query performance by distributing files into separate storage locations by a selected column. For example, if you have customer data and partition by residential postcode, the customer data will be split based on the postcode value. This means, you don’t have to query the entire data but limit the scope of reading, which leads to cost reduction and faster query response.

In Athena, you can partition your data before or after creating a table. When you create a new table, you can define a partition as below.

After your table is created with the Hive-compatible partitions, you need to update the metadata in the catalogue by executing an MSCK command. The MSCK REPAIR TABLE command scans the S3 bucket location and inspects its partitions. Then, it adds those partitions to the table’s metadata and to that Athena table. Running MSCK command for the table looks like below.

When your data is not in Hive format, you cannot use the MSCK command. Instead, you will have to manually add each partition by altering the table. For example, to load the data in s3://SAMPLE_BUCKET/users/postcode/12345, you can run the following query.

When you want to add other partitions, you will have to add them manually using this command.

Problems of traditional partitioning

As we could learn, it is not difficult to create a new table with partitioning. However, what if you want to change that base column later? Furthermore, if the data format is not Hive compatible, you have to manually add each partition by running the SQL command. Traditionally, to change a base partitioning column, one option was to recreate a table using CTAS query. For example, let’s say we want to partition based on country_code.

Interested in what the future will bring? Download our 2024 Technology Trends eBook for free.

After re-creating the new table, you may want to delete the old-partitioned data from S3. The re-creating CTAS query fully scans your data to re-partition, which can be considered an expensive operation.

Also, when your data is incompatible with Hive, manually adding each partition can become demanding maintenance work. You can consider writing scheduled jobs to detect new values in the partitioned column and run the partition adding a command, but some users might take this as a maintenance burden.

Indexing and nano-blocking

In partitioning, AWS Athena needs some of your attention as we discussed above. As the variety and size of your data grow, it is likely to require more of your data team’s time to manage these. As an option to avoid this challenge, you might want to look outside and seek a third-party solution. One example is Varada’s indexing and nano-blocking technologies.

Their product offers an autonomous indexing feature based on nano-blocking. In contrast to the partition-based approach that is limited to several columns, their indexing allows users to choose any column and automatically decides which data to index and which index to use on each nano-block that consists of 64K rows of a single column.

Then, each nano-block is mapped back to the original data. A nano-block can contain metadata, index, Lucene, and transformation.

A nano-block can use a different indexing algorithm including Bitmap, Dictionary, Trees, and others, and their technology can decide which one is best for each block.

Wrapping up

The article discussed the challenges people can face while using traditional data lake solutions. Although partitioning can improve your query performance, it does not come with a free price tag. Data teams will have to dedicate some of their time to planning and running partitioning for each table. When the types and amounts of data increase, it can be a big challenge. Auto-indexing with nano-blocking was born to minimize the maintenance burden. If partitioning starts consuming the resources of your data team, adopting these new technologies can be a good option.

3 Things To Consider When Partitioning Your Data Lake

What does a usual data lake partitioning look like?

Problems of traditional partitioning

Interested in what the future will bring? Download our 2024 Technology Trends eBook for free.

Indexing and nano-blocking

Wrapping up

The Advantages of IT Staff Augmentation Over Traditional Hiring

The State of Digital Asset Management in 2023

Test Data Management – Implementation Challenges and Tools Available

Recent

Search

What does a usual data lake partitioning look like?

Problems of traditional partitioning

Interested in what the future will bring? Download our 2024 Technology Trends eBook for free.

Indexing and nano-blocking

Wrapping up

About Evan Morris

Footer

Recent

Search

Tags