How to Integrate Sqoop in a Data Ingestion Layer with Hive, Oozie in a Big Data Application?

Chirag Thumar

9 years ago

Sqoop is a tool which helps to migrate and transfer the data between RDBMS and Hadoop system in bulk mode. This blog post will focus on integrating Sqoop with other projects in Hadoop ecosystem and Big Data Applications. As I am working for Big Data Solution providers, I learned it and here I will show how to schedule Sqoop job with Oozie and how to load the data from Sqoop to the data warehouse in Hive on Hadoop or even HBase. I tried to give the solutions with the code here which may become easy to understand.

Environment

Java: JDK 1.7

Cloudera version: CDH4.6

MySQL.

Initial steps

Here we assume that we will have some data on HDFS of our Hadoop cluster and in our MySQL. Now we need to integrate jobs with Oozie and Hive. Lets understand it with the code!

Code walkthrough

Below is an Oozie workflow which will trigger import job to import the data from MySQL to Hadoop:

<workflow-app name="musqoop-wf" xmlns="uri:oozie:workflow:0.1">

...

<action name="mysqoopaction">

<sqoop xmlns="uri:oozie:mysqoopaction:0.2">

<job-tracker>myjt:8021</job-tracker>

<name-node>nn:8020</name-node>

<command>import --table People --connect ...</command>

</sqoop>

<ok to="next"/>

<error to="error"/>

</action>

...

</workflow-app>

Now, How to add the property for Sqoop when integrating with Oozie? This workflow file will help to give the answer of this question and will configure the property to sqoop.

For Ex: Add a statement in a transaction when import the data.

<workflow-app name="musqoop-wf" xmlns="uri:oozie:workflow:0.1">

...

<action name="mysqoopaction">

<sqoop xmlns="uri:oozie:mysqoopaction:0.2">

<job-tracker>myjt:8021</job-tracker>

<name-node>nn:8020</name-node>

<configuration>

<property>

<name>sqoop.export.statements.per.transaction</name>

<value>10</value>

</property>

</configuration>

<command>import --table People --connect ...</command>

</sqoop>

<ok to="next"/>

<error to="error"/>

</action>

...

</workflow-app>

Below command will import the data to Hadoop and into Hive also.

sqoop import

--connect jdbc:mysql://mysql.example.com/myintegrate

--username hn

--password hn

--table student

--hive-import

To Import the data to Hadoop and into Hive also the below command will be helpful, But it will support to partition data only.

sqoop import

--connect jdbc:mysql://mysql.example.com/myintegrate

--username hn

--password hn

--table student

--hive-import

--hive-partition-key day

--hive-partition-value "2016-09-11"

Verify the Result

Now, we have a question that how can we know that the data is imported from the database only? And we can try to trigger the Oozie workflow for first two commands to make sure the XML schemas in workflow files are good.

For Hive integration, after the importing job complete, we can try to show table in Hive to make sure the table has data from MySQLshow tables by:

select * from student;

select * from student where LastName =HN;

select * from student where LastName like %HN%;

select count(*) from student;

So, experts providing big data services have explained the way to integrate Sqoop with Oozie and hive. If you did not get any point or code shared by experts here, comment below and ask them again.

Hope that this blog can help you guys understand how we can integrate Sqoop with Oozie and Hive in Integration Layer. If you have any questions, you can ask me by commenting here. And also Comment if you know any issue or solutions of any issues in the Oozie-Sqoop-Hive Job in Hadoop Big Data.