Error on running multiple Workflow in OOZIE-4.1.0

Posted on

Error on running multiple Workflow in OOZIE-4.1.0Error is an illegal operation performed by the user which results in the abnormal working of the program. By now, you’ve probably seen a few errors, either when compiling or running your code like Error on running multiple Workflow in OOZIE-4.1.0. It can be frustrating, but they can also give you a lot of information about exactly how you can fix the problems in your code about java and hadoop. In this post covers the types of errors you’ll see when programming in Java, and how to fix them. Don’t pay any attention to the number of errors. Just read the first error message and work on fixing that error.

Problem :

I installed oozie 4.1.0 on a Linux machine by following the steps at
http://gauravkohli.com/2014/08/26/apache-oozie-installation-on-hadoop-2-4-1/

hadoop version - 2.6.0 
maven - 3.0.4 
pig - 0.12.0

Cluster Setup –

MASTER NODE runnig – Namenode, Resourcemanager ,proxyserver.

SLAVE NODE running -Datanode,Nodemanager.

When I run single workflow job means it succeeds.
But when I try to run more than one Workflow job i.e. both the jobs are in accepted state
enter image description here

Inspecting the error log, I drill down the problem as,

014-12-24 21:00:36,758 [JobControl] INFO  org.apache.hadoop.ipc.Client  - Retrying connect to server: 172.16.***.***/172.16.***.***:8032. Already tried 9 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2014-12-25 09:30:39,145 [communication thread] INFO  org.apache.hadoop.ipc.Client  - Retrying connect to server: 172.16.***.***/172.16.***.***:52406. Already tried 9 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2014-12-25 09:30:39,199 [communication thread] INFO  org.apache.hadoop.mapred.Task  - Communication exception: java.io.IOException: Failed on local exception: java.net.SocketException: Network is unreachable: no further information; Host Details : local host is: "SystemName/127.0.0.1"; destination host is: "172.16.***.***":52406; 
 at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:764)
 at org.apache.hadoop.ipc.Client.call(Client.java:1415)
 at org.apache.hadoop.ipc.Client.call(Client.java:1364)
 at org.apache.hadoop.ipc.WritableRpcEngine$Invoker.invoke(WritableRpcEngine.java:231)
 at $Proxy9.ping(Unknown Source)
 at org.apache.hadoop.mapred.Task$TaskReporter.run(Task.java:742)
 at java.lang.Thread.run(Thread.java:722)
Caused by: java.net.SocketException: Network is unreachable: no further information
 at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
 at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:701)
 at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
 at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:529)
 at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:493)
 at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:606)
 at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:700)
 at org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:367)
 at org.apache.hadoop.ipc.Client.getConnection(Client.java:1463)
 at org.apache.hadoop.ipc.Client.call(Client.java:1382)
 ... 5 more

Heart beat
Heart beat
.
.

In the above running jobs, if I kill any one launcher job manually (hadoop job -kill <launcher-job-id>) mean all jobs get succeeded. So I think the problem is more than one launcher job running simultaneously mean job will meet deadlock..

If anyone know the reason and solution for above problem. Please do me the favor as soon as possible.

Solution :

I tried below solution it works perfectly for me.

1) Change the Hadoop schedule type from capacity scheduler to fair scheduler. Because for small cluster each queue assign some memory size (2048MB) to complete single map reduce job. If more than one map reduce job run in single queue mean it met deadlock.

Solution: add below property to yarn-site.xml

  <property>
    <name>yarn.resourcemanager.scheduler.class</name>
    <value>org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler</value>
  </property>
  <property>
    <name>yarn.scheduler.fair.allocation.file</name>
    <value>file:/%HADOOP_HOME%/etc/hadoop/fair-scheduler.xml</value>
  </property>

2) By default Hadoop Total memory size was allot as 8GB.

So if we run two mapreduce program memory used by Hadoop get more than 8GB so it met deadlock.

Solution: Increase the size of Total Memory of nodemanager using following properties at yarn-site.xml

<property>
  <name>yarn.nodemanager.resource.memory-mb</name>
  <value>20960</value>
  </property>
  <property>
  <name>yarn.scheduler.minimum-allocation-mb</name>
  <value>1024</value>
  </property>
  <property>
  <name>yarn.scheduler.maximum-allocation-mb</name>
  <value>2048</value>
  </property>

So If user try to run more than two mapreduce program mean he need to increase nodemanager or he need to increase the size of total memory of Hadoop (note: Increasing the size will reduce the system usage memory. Above property file able to run 10 map reduce program concurrently.)

The problem is with the Queue, When we running the Job in SAME QUEUE(DEFAULT) with above cluster setup the Resourcemanager is responsible to run mapreduce job in the salve node. Due to lack of resource in slave node the job running in the queue will meet Deadlock situation.

In order to over come this issue we need to split the Mapreduce job by means of Triggering the mapreduce job in different queue.

enter image description here

you can do this by setting this part in the pig action inside your oozie workflow.xml

<configuration>
<property>
  <name>mapreduce.job.queuename</name>
  <value>launcher2</value>
</property>

NOTE: This solution only for SMALL CLUSTER SETUP

Leave a Reply

Your email address will not be published. Required fields are marked *