Before diving deep into how Apache Spark works, lets understand the jargon of Apache Spark Job: A piece of code which reads some input from HDFS or local, performs some computation on the data and writes some output data. Stages: Jobs are divided into stages. Stages are classified as a Map or reduce stages (Its easier to understand if you have worked on Hadoop and want to … [Read more...] about Apache Spark – A Basic Understanding
apache spark
25 Must Know Big Data Terms To Impress Your Date
Big Data can be intimidating! If you are new to Big Data, please read What is Big Data' to get you started. With the basic concepts under your belt, let's focus on some key terms to impress your date or boss or family. So let's get going with this list. Algorithm A mathematical formula or statistical process used to perform an analysis of data. How is Algorithm is related … [Read more...] about 25 Must Know Big Data Terms To Impress Your Date
Real-Time Kafka Data Ingestion into HBase via PySpark
Streaming data is becoming an essential part of every data integration project nowadays, if not a focus requirement, a second nature. Advantages gained from real-time data streaming are so many. To name a few: real-time analytics and decision making, better resource utilization, data pipelining, facilitation for micro-services and much more. Python has many modules out there … [Read more...] about Real-Time Kafka Data Ingestion into HBase via PySpark
How to Overcome Big Data Analytics Limitations With Hadoop
Hadoop is an open source project that was developed by Apache back in 2011. The initial version had a variety of bugs, so a more stable version was introduced in August. Hadoop is a great tool for big data analytics, because it is highly scalable, flexible and cost-effective. However, there are also some challenges big data analytics professionals need to be aware of. The good … [Read more...] about How to Overcome Big Data Analytics Limitations With Hadoop
Five Patterns of Big Data Integration
As reliance on Hadoop and Spark grows for data management, processing and analytics, data integration strategies should evolve to exploit big data platforms in support of digital business, Internet of Things (IoT) and analytics use cases. While Hadoop is used for batch data processing, Spark supports low-latency processing. Integration leaders should understand the various … [Read more...] about Five Patterns of Big Data Integration