Consistent databases and performant data warehouses are essential when working with big data. Data warehouse concurrency refers to a setup where many users can work simultaneously so that business intelligence can be performed in real-time and on a large scale. It‘s the cornerstone for good data quality, solid evaluation, and creating a user-friendly data platform.
Concurrency in Data Warehouses
In data warehouses, the requirements are somewhat different from those of normal databases. Data warehouses focus on querying data rather than modifying it. Therefore, ACID (Atomicity, Consistency, Isolation, Durability) compliance is less strictly enforced. However, it is still relevant.
The first aim in data warehouses is that many users be able to work simultaneously on the system. A few users running ten queries with ten rows or tables may not be difficult to manage, but scaling to thousands or millions creates an environment that is impossible for humans to manage. Everyone must be able to work with the same real-time data without negatively impacting other users. This is the only way to guarantee a modern and targeted analysis process in the age of big data.
Different Solutions
Firebolt
Firebolt is the 3rd and newest generation of data warehouses. It continues the idea of fully cloud-operated SaaS technologies, increasing performance, computing choices, and control. Additionally, it provides different pricing models.
It plays an important role with respect to data warehouse concurrency. The performance does not decrease exponentially, even with many users. Queries should always offer a reasonable response time, regardless of whether 1, 100, or 1000 users are involved.
Firebolt gives the following statistics for an application case:
-
1000 queries (100 different sessions x 10 queries each) on a 4 billion row datasets
-
0.1 sec average execution times
-
An engine that costs $3.6/h.
Firebolt achieves this through modern technologies and methods, such as sparse indexing, granular data pruning, or vectorized processing. This guarantees unlimited manual scaling of users and fast usage ‘even in times of big data.
Google BigQuery
Google’s BigQuery is a serverless, easily scalable, and cost-efficient multi-cloud data warehouse made especially for business agility. It is a 2nd generation data warehouse based completely in the easy-to-use SaaS technology. One doesn’t have to provision individual instances or virtual machines to make use of BigQuery. BigQuery allocates computing resources as needed on its own. You can also reserve computing capacity ahead of time in the form of slots, which are virtual CPUs.
Similar to Firebolt, new concepts prevent an exponential increase in response times for many users. However, BigQuery has a restriction that your project can run up to 100 interactive queries simultaneously by default. BigQuery quotes a cost of $5.00 per TB (on demand). The first 1 TB per month is free. Thereafter, a monthly flat rate with 100 slots costs $2,000.
Amazon Redshift
Amazon Redshift is a 1st generation cloud data warehouse. It is possibly the best-known tool on this list. With Redshift, the idea is also to process as much data from as many users as possible. Of course, this means that compared to other solutions, more manual effort is required with Redshift. For example, you take care of the elasticity and the query scalability yourself.
However, note that Redshift predates the other two cloud data warehouses on this list. So, it can no longer keep up with the 2nd and 3rd generation tools. For example, the limit for concurrent queries is 50 by default. Amazon quotes the price at $0.25-13 per node on demand.
Best Practices
So, what are the best practices for selecting a data warehouse when looking at concurrency? First, you need to narrow down the providers or tools themselves. If you want something efficient but also easy to use, you should look at solutions from the second and third generations. Other factors that should be taken into consideration are:
-
Using cloud-native and SaaS-based technologies, so you don’t have to worry about scalability and keeping the system running.
-
Elasticity through decoupled storage and user-controlled computing power for better performance.
-
Knowing your business needs and metrics. This means your requirements for scalability and performance should be clear from the outset.
Of course, the performance/cost ratio should always be taken into account when making a selection. But scalability and concurrency should be priorities.
Summary
While the ACID principles should always apply to a classic database to ensure good data quality and analysis, they are no longer as strict when it comes to data warehouses. Here, the ability to provide a lot of data to many users and enable concurrent queries is the most important factor. The latest data warehouse technologies solve the problem of high simultaneity through good scalability and enable companies to perform data analysis on a large scale.