• Skip to main content
  • Skip to secondary menu
  • Skip to primary sidebar
  • Skip to footer
  • Articles
  • News
  • Events
  • Advertize
  • Jobs
  • Courses
  • Contact
  • (0)
  • LoginRegister
    • Facebook
    • LinkedIn
    • RSS
      Articles
      News
      Events
      Job Posts
    • Twitter
Datafloq

Datafloq

Data and Technology Insights

  • Categories
    • Big Data
    • Blockchain
    • Cloud
    • Internet Of Things
    • Metaverse
    • Robotics
    • Cybersecurity
    • Startups
    • Strategy
    • Technical
  • Big Data
  • Blockchain
  • Cloud
  • Metaverse
  • Internet Of Things
  • Robotics
  • Cybersecurity
  • Startups
  • Strategy
  • Technical

3 Things To Consider When Partitioning Your Data Lake

Evan Morris / 3 min read.
February 22, 2022
Datafloq AI Score
×

Datafloq AI Score: 83.33

Datafloq enables anyone to contribute articles, but we value high-quality content. This means that we do not accept SEO link building content, spammy articles, clickbait, articles written by bots and especially not misinformation. Therefore, we have developed an AI, built using multiple built open-source and proprietary tools to instantly define whether an article is written by a human or a bot and determine the level of bias, objectivity, whether it is fact-based or not, sentiment and overall quality.

Articles published on Datafloq need to have a minimum AI score of 60% and we provide this graph to give more detailed information on how we rate this article. Please note that this is a work in progress and if you have any suggestions, feel free to contact us.

floq.to/DX0uq

Partitioning in data lakes is an improvement practice for your query speed. Managed lake solutions like AWS Athena suggest partitioning as best practices to optimize query performance. When you read through their partitioning documentation, it may seem to be easy to implement and may give the impression that once you apply to a partition, you can just forget about it. However, as the size and variety of your data grow, things can get more complicated than expected. In this article, we will discuss traditional partitioning practices and their problems. Also, we will talk about indexing and nano-blocking technologies that readers could consider to mitigate their partitioning problems.

What does a usual data lake partitioning look like?

Data lake partitioning is a technique that’s designed to improve query performance by distributing files into separate storage locations by a selected column. For example, if you have customer data and partition by residential postcode, the customer data will be split based on the postcode value. This means, you don’t have to query the entire data but limit the scope of reading, which leads to cost reduction and faster query response.

In Athena, you can partition your data before or after creating a table. When you create a new table, you can define a partition as below.

After your table is created with the Hive-compatible partitions, you need to update the metadata in the catalogue by executing an MSCK command. The MSCK REPAIR TABLE command scans the S3 bucket location and inspects its partitions. Then, it adds those partitions to the table’s metadata and to that Athena table. Running MSCK command for the table looks like below.

When your data is not in Hive format, you cannot use the MSCK command. Instead, you will have to manually add each partition by altering the table. For example, to load the data in s3://SAMPLE_BUCKET/users/postcode/12345, you can run the following query.

When you want to add other partitions, you will have to add them manually using this command.

Problems of traditional partitioning

As we could learn, it is not difficult to create a new table with partitioning. However, what if you want to change that base column later? Furthermore, if the data format is not Hive compatible, you have to manually add each partition by running the SQL command. Traditionally, to change a base partitioning column, one option was to recreate a table using CTAS query. For example, let’s say we want to partition based on country_code.


Interested in what the future will bring? Download our 2023 Technology Trends eBook for free.

Consent

After re-creating the new table, you may want to delete the old-partitioned data from S3. The re-creating CTAS query fully scans your data to re-partition, which can be considered an expensive operation.

Also, when your data is incompatible with Hive, manually adding each partition can become demanding maintenance work. You can consider writing scheduled jobs to detect new values in the partitioned column and run the partition adding a command, but some users might take this as a maintenance burden.

Indexing and nano-blocking

In partitioning, AWS Athena needs some of your attention as we discussed above. As the variety and size of your data grow, it is likely to require more of your data team’s time to manage these. As an option to avoid this challenge, you might want to look outside and seek a third-party solution. One example is Varada’s indexing and nano-blocking technologies.

Their product offers an autonomous indexing feature based on nano-blocking. In contrast to the partition-based approach that is limited to several columns, their indexing allows users to choose any column and automatically decides which data to index and which index to use on each nano-block that consists of 64K rows of a single column.

Then, each nano-block is mapped back to the original data. A nano-block can contain metadata, index, Lucene, and transformation.

A nano-block can use a different indexing algorithm including Bitmap, Dictionary, Trees, and others, and their technology can decide which one is best for each block.

Wrapping up

The article discussed the challenges people can face while using traditional data lake solutions. Although partitioning can improve your query performance, it does not come with a free price tag. Data teams will have to dedicate some of their time to planning and running partitioning for each table. When the types and amounts of data increase, it can be a big challenge. Auto-indexing with nano-blocking was born to minimize the maintenance burden. If partitioning starts consuming the resources of your data team, adopting these new technologies can be a good option.

Categories: Big Data
Tags: Big Data, data lake

About Evan Morris

Known for his boundless energy and enthusiasm. Evan works as a Freelance Networking Analyst, an avid blog writer, particularly around technology, cybersecurity and forthcoming threats which can compromise sensitive data. With a vast experience of ethical hacking, Evan's been able to express his views articulately

Primary Sidebar

E-mail Newsletter

Sign up to receive email updates daily and to hear what's going on with us!

Publish
AN Article
Submit
a press release
List
AN Event
Create
A Job Post

Related Articles

The Advantages of IT Staff Augmentation Over Traditional Hiring

May 4, 2023 By Mukesh Ram

The State of Digital Asset Management in 2023

May 3, 2023 By pimcoremkt

Test Data Management – Implementation Challenges and Tools Available

May 1, 2023 By yash.mehta262

Related Jobs

  • Software Engineer | South Yorkshire, GB - February 07, 2023
  • Software Engineer with C# .net Investment House | London, GB - February 07, 2023
  • Senior Java Developer | London, GB - February 07, 2023
  • Software Engineer – Growing Digital Media Company | London, GB - February 07, 2023
  • LBG Returners – Senior Data Analyst | Chester Moor, GB - February 07, 2023
More Jobs

Tags

AI Amazon analysis analytics app application Artificial Intelligence BI Big Data blockchain business China Cloud Companies company costs crypto customers Data development digital engineer environment experience future Google+ government information learning machine learning market mobile Musk news public research security share skills social social media software strategy technology twitter

Related Events

  • 6th Middle East Banking AI & Analytics Summit 2023 | Riyadh, Saudi Arabia - May 10, 2023
  • Data Science Salon NYC: AI & Machine Learning in Finance & Technology | The Theater Center - December 7, 2022
  • Big Data LDN 2023 | Olympia London - September 20, 2023
More events

Related Online Courses

  • Oracle Cloud Data Management Foundations Workshop
  • Data Science at Scale
  • Statistics with Python
More courses

Footer


Datafloq is the one-stop source for big data, blockchain and artificial intelligence. We offer information, insights and opportunities to drive innovation with emerging technologies.

  • Facebook
  • LinkedIn
  • RSS
  • Twitter

Recent

  • 5 Reasons Why Modern Data Integration Gives You a Competitive Advantage
  • 5 Most Common Database Structures for Small Businesses
  • 6 Ways to Reduce IT Costs Through Observability
  • How is Big Data Analytics Used in Business? These 5 Use Cases Share Valuable Insights
  • How Realistic Are Self-Driving Cars?

Search

Tags

AI Amazon analysis analytics app application Artificial Intelligence BI Big Data blockchain business China Cloud Companies company costs crypto customers Data development digital engineer environment experience future Google+ government information learning machine learning market mobile Musk news public research security share skills social social media software strategy technology twitter

Copyright © 2023 Datafloq
HTML Sitemap| Privacy| Terms| Cookies

  • Facebook
  • Twitter
  • LinkedIn
  • WhatsApp

In order to optimize the website and to continuously improve Datafloq, we use cookies. For more information click here.

settings

Dear visitor,
Thank you for visiting Datafloq. If you find our content interesting, please subscribe to our weekly newsletter:

Did you know that you can publish job posts for free on Datafloq? You can start immediately and find the best candidates for free! Click here to get started.

Not Now Subscribe

Thanks for visiting Datafloq
If you enjoyed our content on emerging technologies, why not subscribe to our weekly newsletter to receive the latest news straight into your mailbox?

Subscribe

No thanks

Privacy Overview

This website uses cookies so that we can provide you with the best user experience possible. Cookie information is stored in your browser and performs functions such as recognising you when you return to our website and helping our team to understand which sections of the website you find most interesting and useful.

Necessary Cookies

Strictly Necessary Cookie should be enabled at all times so that we can save your preferences for cookie settings.

If you disable this cookie, we will not be able to save your preferences. This means that every time you visit this website you will need to enable or disable cookies again.

Marketing cookies

This website uses Google Analytics to collect anonymous information such as the number of visitors to the site, and the most popular pages.

Keeping this cookie enabled helps us to improve our website.

Please enable Strictly Necessary Cookies first so that we can save your preferences!