How does data gets split in Sqoop

0 votes

I have a query that i would like to execute using sqoop.

Select deptid, avg(salary) from emp group by deptid

Another query is as follows:

sqoop import --connect jdbc:mysql://myserver:1202/ --username=u1 --password=p1 --query 'Select * from emp where empid< 1123 and $CONDITIONS' --split-by empid --target-dir /uname/emp/salary

$CONDITIONS and split-by are used to perform parallel processing or we can say efficiently importing data. Former split the rows based on condition and later use min and max logic on the primary key. What is the difference between these two ($CONDITIONS, split-by). If we use both in the same sqoop statement, which clause would be got the priority?

Please help me on this one.

Jul 16, 2019 in Big Data Hadoop by nitinrawat895
• 11,380 points
8,936 views

1 answer to this question.

0 votes

I will drop the answer in the form of steps for better understanding.

1. The degree of parallelism is controlled by -m <n> or --num-mappers <n>. By default value of --num-mappers is 4.

2. --split-by <column-name>, will split your task on the basis of column-name.

3.  $CONDITIONS, it is used internally by sqoop to achieve this splitting task.

Example, Your query:

sqoop import --connect jdbc:mysql://myserver:1202/ --username u1 --password p1 --query 'select * from emp where $CONDITIONS' --split-by empId --target-dir /temp/emp -m 4

Say, my empId is uniformly distributed from 1- 100.

Now, sqoop will take --split-by column and find its max and min value using the query:

SELECT MIN(empId), MAX(empId) FROM (Select * From emp WHERE (1 = 1) ) t1

See it replaced $CONDITIONS with (1 = 1).

In our case, min, max values are 1 and 100.

As number of mappers are 4, sqoop will divide my query in 4 parts.

Creating input split with lower bound 'empId >= 1' and upper bound 'empId < 25'
Creating input split with lower bound 'empId >= 25' and upper bound 'empId < 50'
Creating input split with lower bound 'empId >= 50' and upper bound 'empId < 75'
Creating input split with lower bound 'empId >= 75' and upper bound 'empId <= 100'

Now $CONDITIONS will again come into the picture. It is replaced by above range queries.

First mapper will fire query like this:

Select * From emp WHERE empId >= 25' AND 'empId < 50

and so on for other 3 mappers.

Results from all the mappers is aggregated and written to a final HDFS directory.

Regarding your next query :

select deptid, avg(salary) from emp group by deptid

you will specify

--query 'select deptid, avg(salary) from emp group by deptid where $CONDITIONS'

It will be first converted to

select deptid, avg(salary) from emp group by deptid where (1 = 0)

to fetch column metadata.

I believe this query won't run in RDBMS. Try above query(having Where (1 = 0)) directly in Mysql.

So you will not be able to use this query to fetch data using Sqoop.

Hope this helps.

answered Jul 16, 2019 by ravikiran
• 4,620 points
Hi,

Can you please explain the fact , if I understood correctly and query has condition like empid<200 and no of mappers are 4, it split the data in (1-49),(50-99),(100-149),(150-200) all inclusive.. what if, primary key is not uniformly distributed and there is no such key in the table. Will it use the concept of sampling in that case?

Related Questions In Big Data Hadoop

0 votes
1 answer

How to import data in sqoop as a Parquet file?

Sqoop allows you to import the file ...READ MORE

answered May 15, 2019 in Big Data Hadoop by Nanda
11,005 views
0 votes
1 answer

How to use data compression in sqoop import?

You can enable data compression from the ...READ MORE

answered May 15, 2019 in Big Data Hadoop by Rocky
2,264 views
0 votes
1 answer

How to implement data locality in Hadoop MapReduce?

You can use this getFileBlockLocations method of ...READ MORE

answered Apr 20, 2018 in Big Data Hadoop by kurt_cobain
• 9,350 points
1,113 views
0 votes
1 answer

How to transfer data from Netezza to HDFS using Apache Sqoop?

Remove the --direct option. It gives issue ...READ MORE

answered Apr 23, 2018 in Big Data Hadoop by kurt_cobain
• 9,350 points
1,755 views
+1 vote
1 answer

Hadoop Mapreduce word count Program

Firstly you need to understand the concept ...READ MORE

answered Mar 16, 2018 in Data Analytics by nitinrawat895
• 11,380 points
11,033 views
0 votes
1 answer

hadoop.mapred vs hadoop.mapreduce?

org.apache.hadoop.mapred is the Old API  org.apache.hadoop.mapreduce is the ...READ MORE

answered Mar 16, 2018 in Data Analytics by nitinrawat895
• 11,380 points
2,540 views
+2 votes
11 answers

hadoop fs -put command?

Hi, You can create one directory in HDFS ...READ MORE

answered Mar 16, 2018 in Big Data Hadoop by nitinrawat895
• 11,380 points
108,853 views
–1 vote
1 answer

Hadoop dfs -ls command?

In your case there is no difference ...READ MORE

answered Mar 16, 2018 in Big Data Hadoop by kurt_cobain
• 9,350 points
4,616 views
0 votes
1 answer

How does Hadoop process data which is split across multiple boundaries in an HDFS?

I found some comments: from the Hadoop ...READ MORE

answered Jul 1, 2019 in Big Data Hadoop by ravikiran
• 4,620 points
969 views
0 votes
1 answer

How to Sqoop in a Java Program?

You can use the following sample code for ...READ MORE

answered Jul 22, 2019 in Big Data Hadoop by ravikiran
• 4,620 points
1,688 views
webinar REGISTER FOR FREE WEBINAR X
REGISTER NOW
webinar_success Thank you for registering Join Edureka Meetup community for 100+ Free Webinars each month JOIN MEETUP GROUP