Job hanging problems in pseudo-distributed YARN clusters

Often a user of YARN (with MR2) in CDH5 reports that their job just submits and hangs infinitely, with a pattern observable on the submission log such as below:

~> sudo -u hdfs hadoop jar /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar pi 1 10
Number of Maps = 1
Samples per Map = 10
Starting Job
Running job: job_14345324521245_0001
… Hangs …

This is very often seen with users of small or pseudo-distributed clusters, such as those with just 1 running NodeManager role or not over 2-3.

The reason behind this behaviour is simply that the configured resources on the cluster is far too low for the job’s own requested resources to get satisfied. For instance, the yarn.nodemanager.resource.memory-mb and yarn.nodemanager.resource.cpu-vcores are very small values, such that the MR2 job’s Application Master itself consumes all of it, leaving no room for the actual map tasks to get allocated for running after it.

The solution therefore, is to increase the above mentioned properties in the cluster’s yarn-site.xml, and restart the YARN service. Cloudera Manager users would typically find the same configuration property names searchable under CM -> YARN -> Configuration.

Personally I like leaving these values for my 1-node clusters at their default values of 8 GB Memory and 8 CPU Cores. Being just a test-mode cluster for its size that I’d tear down later anyhow, I don’t go chasing an ideal configuration.

Cloudera docs also have a good guide on configuring resources on the cluster here.

Easier way to check and share failed job logs in Cloudera Manager

Since Cloudera Manager 5.2.0, CDH5 users (YARN) can now download (to check or share) application logs and job logs in a much simpler manner than having to browse the Job History Server Web UI, or the Resource Manager Web UI in trying to locate failed map/reduce tasks or spark application and worker containers.

Checking failed job logs has been tedious enough for new-comers, but sharing it across with vendor support or community lists for insight into a mystifying problem has been typically difficult.

source: imgur.com
Failed jobs and applications!

Often I’ve observed users share only partial information that makes troubleshooting difficult for people trying to help them, either over my work at Cloudera Support, or over the many community lists and forums such as the Cloudera Community where I sometimes hang out.

The good bit, going forward, is that Cloudera Manager users should now have it much easier to do this in a more wholesome way.

Starting in CM 5.2.0 and above, users of YARN in CDH5 can visit the YARN Service’s Applications tab to track all MapReduce (MR2) jobs and applications, and also to download all of the container/task logs with the push of a single button, and use the downloaded file to share with others when seeking help with their issue.

The way to get this is simple. First off you need to visit CM’s YARN service page, and then the Applications tab under it, as shown below:

source: imgur.com
CM -> YARN -> Applications

Then look for your job on the new page, or otherwise try to locate it by ID if its not visible on the top list. In drilling down to your job finally you’ll observe a view such as the below:

source: imgur.com
Viewing a failed job on the Applications view

On this view, the right-side button on the job pane has the ability to download diagnostic data which would include all job logs and other information packed as a single gzipped tar-ball.

source: imgur.com
Download job data by clicking on this

The downloaded file can also be used for a more single-point analysis (such as via grep) of errors or messages observed across all the tasks/containers of the job/application.

source: imgur.com
The downloaded JSON and raw files within the tar-ball archive for the job

Fixing connection breakages in older Belkin routers

A recent power cable burn out near the local transformer in my living area caused a power problem that ended up frying most of my power adapters at home – so it made me take down that spares box I’d forgotten for a good 4 years about.

Within that I found an old router+modem hardware I’d purchased years ago and forgotten all about – a Belkin F7D1401 v1, with its power adapter. Ecstatic about having an alternative route to get back on the internet than having to travel a bit to get a fixed adapter for the existing one, I promptly tried to setup the router to connect to my ISP.

Upon initial startup and connect, I received an IP of the form 169.254.x.x which basically would indicate the router not having actually assigned me an IP, such that my own machine had to do it. After a bit of fiddling I got over this by holding the router’s reset button for about 15 seconds instead of the immediate click-and-release reset. After the hard-reset the router was able to assign my devices its expected range of 192.168.2.x addresses.

The rest was a breeze and the router was promptly serving me worthless social media pages in no time. All smooth except, occasionally but not quite randomly, the internet link would go down and reconnect, making browsing a frustrating activity.

In looking at the System Log on the router, I noticed messages appear in the form of “ADSL Media Down !“, having about 14 seconds of pauses in services before it attempts a PPPoE reconnect (which can take a few additional seconds too). In looking over the web for others having this problem, I couldn’t come to a solution – some only had this problem during the evenings and attributed it to their ISP doing strange stuff then, others had delved down the path of noise increasing in lines during nights, etc. – fun reading for sciences but no real solution. I did not suspect my ISP as I’d know if the line really had a gap in service long enough to cause a reconnect – and monitoring for that revealed no such fact.

I did do a firmware upgrade to the v2 version available on the Belkin website for the router after I had begun using it, and felt that was to blame but there had to be a workaround, some function that is incorrectly reinitialising the router and on a schedule at that.

I found the culprit in my case – the router had a “Reinitialise Automatically” function enabled in it, set to do so to maintain the router’s efficiency by clearing up the memory periodically and the period was set to a proper 3 AMs every Tuesday. Not taking its word for it, and finding about only the second such schedule based function configuration in the UI, I disabled it and BAM! problem solved. Haven’t had connection resets since. My city’s periodic power outages will take care of the router reboots instead, thank you.

Erlang: Using the timer:tc function in escripts

Both escripts and timer:tc features of Erlang are very useful for writing simple test code to profile functions.

There is one tiny issue with using timer:tc/1 or timer:tc/2 from within escripts though: they will not work in the interpreted execution mode.

For example, you may have an escript such as below:

#!/usr/bin/env escript
-module(test).

-export([generate_list/1]).

generate_list(N) ->
  lists:seq(1, N).

main(_) ->
  io:format("time: ~p~n", [timer:tc(generate_list, [1000])]).

Running this with “escript test.erl” would yield an error:

escript: exception error: undefined function test:generate_list/1
  in function  timer:tc/2 (timer.erl, line 179)

The issue is that the default escript mode is “interpreted” and not “compiled”. This leads to it not finding the function at all.

To fix the execution, one has to add the “-mode(compile).” line to the script, and also use the timer:tc/3 form of execution instead of the timer:tc/1 or timer:tc/2.

A fixed script of the above example would thus be:

#!/usr/bin/env escript
-mode(compile).

-module(test).

-export([generate_list/1]).

generate_list(N) ->
  lists:seq(1, N).

main(_) ->
  io:format("time: ~p~n", [timer:tc(?MODULE, generate_list, [1000])]).

I think the same should hold for apply calls too, or any function that uses the Module-Function-Arguments style of dynamic calls.