Data Aggregation

Data aggregation is the process of transforming scattered data from numerous sources into a single new one. The objective of data aggregation can be to combine sources together as such that the output is smaller than the input. This helps processing massive amounts of data in batch jobs and in real time applications. This reduces the network traffic and increases the performance while in progress.


Sqoop, another Apache license product, is such a tool that is developed to move bulk data between Hadoop and structured data stores effectively. When using Sqoop this process is automatic for most of it. Sqoop uses MapReduce to import and export data, which provides fault tolerance as well as parallel operation.


Flume, also an Apache product, is an efficient and distribute system for collecting, aggregating and moving massive amounts of log data. The objective of Flume is to move data between applications and Hadoop. Its architecture is simple and based on streaming data flows. It is fault tolerant and offers the possibility for online analytic applications.


Chukwa an incubator under Apache and is built on top of a MapReduce framework as well as the Hadoop Distributed File System. It therefore is scalable and robust. It comes included with a powerful and flexible toolkit for monitoring, analysing and displaying results to make best use of big data.

comments powered by Disqus