Database and Data Warehousing

Data Warehousing is used for reporting and data analysis. Big data however, requires different data warehouses than the traditional standard ones used in the past 10-20 years. There are multiple open source data warehouses available for different purposes:


Infobright offers a data warehouse that is scalable and that can store up to 50 terabyte of data. They offer a data compression technique that is up to 40:1 for better functioning. Next to open source do they also offer commercial products based on the same technology. It is especially designed to analyse large amounts of machine-generated data. The latest edition has the capabilities of nearly real-time analysis.


Cassandra is a NoSQL database that was initially created by Facebook. The Apache Foundation however manages it today. The database is mainly used by large organisations that have massive active databases. Companies such as Twitter, Cisco, Netflix use it to optimize their databases. Cassandra also offers commercial support and services.

Apache HBase

HBase is another Apache product and it includes linear and modular scalability. It is the non-relational data store for Hadoop. HBases is used by companies who need to random, real-time read/write access to Big Data. The objective of HBase is to host very large tables (billions of rows x millions of columns), while using commodity hardware.


Riak is a distributed database that is open source, scalable, fault-tolerant. IT is especially architected for replicating and retrieving data intelligently in order to read and write operations, even when the operations fail. Users can even lose access to nodes without losing data. Riak’s customers are among others the Danish government, Boeing and


Infinispan is a Java based data grid platform that was designed for multi-core architecture. It provides distributed cache capabilities. The objective is to have parallel data structures make the most of multi-core and multi processor architecture. Infinspan is not only available for Java, but also for PHP, Python, Ruby, C, etc. Infinispan also offer a ReST API to make connections with other websites easy going


Bigdata is a distributed database that can scale from a single system to 1000s of machines. It is a horizontally scaled storage and the architecture provides for data-intensive, high performance distributed computing on commodity clusters. It can cope with many different data models, applications or workloads.


Hypertable is a NoSQL databse that allows fast performance and is efficient in use. It runs on top of a distributed file system such as Hadoop DFS and is written almost completely in C++. It offers comprehensive language support for languages such as Java, PHP, Python, Perl, Ruby etc. Although the software is completely open source, they also offer paid support.


Terrastore is a document store providing advanced scalability. It relies on industry proven clustering technology (Terracotta) and it can be accessed via the HTTP protocol. It supports event processing, range queries, data partitioning, Mapreduce quering and processing functions. All queries and updates are distributed to the nodes thereby minimizing traffic and spreading computational load.

Apache Hive

This is Hadoop’s data warehouse and it uses a SQL-like language called HiveQL. It promises easy data summarization, ad-hoc queries and other analysis of big data. It uses a mechanism to project structure onto data, while allowing map/reduce programmers to plug in custom mapper and reduces. Hive is an open source volunteer project under the Apache Software Foundation.


Globals is the free database developed by InterSystems. It is a fast and scalable database offering multi-dimensional array storage. It is a NoSQL offering and it also offers an API that gives a rich approach to data modelling. The API is easy to use and fully adaptable. Globals is used by hundreds of thousands of sites.


Firebird is a relational database that can run on Linux, Windows & various UNIX platforms. It offers high performance and powerful language support for stored procedures and triggers.

Oracle Berkely DB

The Oracle Berkeley DB family of open source, embeddable databases provides developers with fast, reliable, local persistence with zero administration. Berkeley DB enables the development of custom data management solutions, without the overhead traditionally associated with such custom projects.


MariaDB is a backward compatible, drop-in replacement branch of the MySQL® Database Server. It includes all major open source storage engines + the Maria storage engine.


H2 is a very fast Java SQL database with embedded and server modes, in-memory databases and a browser based console application. It has a very small footprint of only 1.5 MB


It is a SQL relational database engine written in Java. HyperSQL offers a small & fast database engine which has in-memory and disk-based tables, supports embedded/server modes. Also, it has tools such as a command line SQL tool & GUI query apps.


The Drizzle project is building a database optimized for Cloud and Net applications. It is being designed for massive concurrency on modern multi-cpu/core architecture. The code is originally derived from MySQL.


MonetDB is an open source column-oriented database management system developed at the Centrum Wiskunde & Informatica (CWI) in the Netherlands. It is a database system for high-performance applications in data mining, OLAP, GIS, XML Query, text & multimedia retrieval.


SQLite is a software library that implements a self-contained, serverless, zero-configuration, transactional SQL database engine. SQLite is the most widely deployed SQL database engine in the world.


RethinkDB is built to store JSON documents, and scale to multiple machines with very little effort. It has a pleasant query language that supports really useful queries like table joins and group by, and is easy to setup and learn.

comments powered by Disqus