Data Analytics Platforms
There are several tools available which effectively are a Data as a Platform tool. These tools allow data analytics to be performed as a complete package.
Hadoop is the most well-known big data open source tool around at the moment. It supports data-intensive distributed applications that can run simultaneously on large clusters of normal, commodity, hardware. It is licensed under the Apache v2 license. A Hadoop network is reliable and extremely scalable and it works according to the computational model MapReduce. Hadoop is written in the Java programming language and is used by a global community of distributors.
Apache Spark is an open-source cluster computing framework originally developed in the AMPLab at UC Berkeley. It is easy to use for developers, who can write applications in Java, Python or Scala. Programs run up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk. Spark comes with several libraries: Spark SQL, Spark Streaming, the MLlib machine learning library, and GraphX. It is scalable to 1000s of nodes and fault-tolerant.
Storm, which is now owned by Twitter, is a real-time distributed computation system. It works the same way as Hadoop provides batch processing as it uses a set of general primitives for performing real-time analyses. Storm is easy to use and it works with any programming language. It is very scalable and fault-tolerant.
MapReduce was originally developed by Google but has now been adapted by many big data tools, among others Hadoop. It is a software framework and model that can process vast amounts of data parallel on a large system of different computer nodes. The MapReduce libraries have been written in many programming languages and it therefore can work with all of them. MapReduce can work with structured and unstructured data.
HPCC means ‘high performance computing cluster’ and was developed by LexisNexis Risk Solutions. It is a similar version of Hadoop, but it claims to offer ‘superior performance’. There is a free and paid version available. It works with structured and unstructured data and it is scalable from 1-1000s of nodes. It therefore also offers high-performance, parallel big data processing.
Hortonworks is a pure open source Hadoop Distribution system. It is built on top of Hadoop and it allows users to capture, process and share data at any scale and in any format in a simple and cost-effective manner. Apache Hadoop is a core component of the Hortonworks architecture.
Dremel is an interactive ad-hoc query system, which is developed by Google. IT offers analyses of read-only nested data. The system is extremely scalable; to 1000s of PCs and petabytes of data. It can process a collection of queries over massive, trillion-row, tables in just a matter of seconds by combining multi-level execution trees and columnar data layout.
Apache Drill is part of the Apache Incubator and it offers a distributed system to perform interactive analyses of large-scale datasets that are based on Dremel. At the moment it is still incubating but the goals is to eventually become a massive scalable platform that can process petabytes of data in seconds over up to 10.000 servers.
Greenplum HD allows users to start with big data analytics without the need to built an entire new project. Greenplum HD is offered as software or can be used in a pre-configured Data Computing Appliance Module. IT exists of a complete data analysis platform and it combines Hadoop and Greenplum database into a single Data Computing Appliance.
SAMOA is a platform for mining on big data streams. It is a distributed streaming machine learning (ML) framework that contains a programing abstraction for distributed streaming ML algorithms.
Ikanow focuses on developing products to enable uninhibited fusion and analysis of Big Data using open source technology. They have created an open source analytics platform.