Friday, December 12, 2014

Running Spark on Yarn from Outside the Cluster (a remote machine)

In figuring out how to run Spark on Yarn from a machine that wasn't part of the cluster, I found that I (like a few others in the forums) was confused about how it works.  I was trying to follow along here:

http://hortonworks.com/hadoop-tutorial/using-apache-spark-hdp/

However, those instructions (and a lot of what I found in terms of documentation) seem to be oriented around running the client from a node in the Yarn cluster.  Here's one thing that confused people:

You Don't Have to Install Spark On The Cluster

That's right, extracting the tar file only has to be done on the machine you want to launch from--that could be a node in the cluster or a completely different machine.  It doesn't have to have Hadoop/Yarn on it at all.  You don't need to drop any bits on the cluster.  That probably confuses people who were used to installing things on the cluster before Yarn.  I believe that with Spark on Yarn, the Spark client delivers everything needed for Yarn to set everything up at runtime.

But what about that "export YARN_CONF_DIR=/etc/hadoop/conf" thing?  How does that work if I'm running remotely?  Well, at first I thought that was supposed to point to the configuration on the cluster.  But as I tried working with the command line arguments, I realized there was no way that Spark knew where the cluster was since I wasn't giving it an URL.  So I scp'd the contents of /etc/hadoop/conf from my cluster to my non-cluster machine, and pointed YARN_CONF_DIR at it.  Maybe there is a better way, but it worked.

That may be all you need to get both cluster and client modes working from outside the cluster.  Then again, you are probably more likely to hit permission errors (like me) since you are off-cluster:

Permission denied: user=myusername, access=WRITE, inode="/user":hdfs:hdfs:drwxr-xr-x

If you see this, you just need to provision the user you are running locally on the cluster--probably something along the lines of:

MyUsername=someusername
sudo useradd $MyUsername
sudo -u hdfs hadoop fs -mkdir /user/$MyUsername
sudo -u hdfs hadoop fs -chown -R $MyUsername /user/$MyUsername

Anyway, once you get to permission errors, it probably mean you've got your Spark configuration right--especially if you see your cluster URL showing up in the console logs.

And kudos to the Spark devs for good error messages--I got this trying to run with bits I got on a thumb drive from them at an event:

Error: Could not load YARN classes. This copy of Spark may not have been compiled with YARN support.

This was easily resolved by just downloading the tar file from the Hortonworks article and using that (shame on me for being lazy...).