That depends on which Spark queries are examined here.
Before any specifics, let me make the comment that Hadoop (YARN) jobs have an annoyingly large overhead, which is especially obvious when running simple things on small data sets. That overhead is only relatively small when you run the "real" thing: distributed and/or complex jobs on huge data sets. Then the overhead is not that large compared to the job runtime.
In case of larger jobs, the overall performance may depend on how well the cluster resources are allocated. Spark resource allocation related settings can have an effect on that.
In case of smaller jobs, the overhead should be decreased. However, in case of pure Spark operators - you can recognize them from the Spark (star) icon - there is no general way to achieve that. In case of Hive-based operators - look for the Hive (bee) icon -, when Hive-on-Spark is enabled on the cluster, the overhead can be greatly decreased. In the following screenshot from the Resource Manager interface of the cluster (accessible via a web browser at <resource_manager_host>:8088 by default), you can distinguish between the two types of job by looking at the User column: the first is a Hive-on-Spark job, the second is a pure Spark job.
The overhead of the Hive-on-Spark jobs can be decreased via the "Connection pool" settings in the Preferences, although the default heuristics should already provide good results, when operations are executed frequently.
Let me know if you can share your challenges more specifically.
Also, please note that you can expect performance improvement from upgrading to Spark 2.x.
Switching to Spark 2.x for Radoop is very simple, because the required Spark archive can be uploaded to HDFS and Radoop can already use it. No need to install or upgrade any services on the cluster side.
The Spark assembly jar could not be found on the specified location. Since it is a local address, it means that the file / directory (Spark 2.x) must exist on all nodes at the specified path. So, for example, the default Assembly Jar Location is "local:///opt/cloudera/parcels/CDH/lib/spark/lib/spark-assembly.jar", in that case, on all nodes this path must exist: /opt/cloudera/parcels/CDH/lib/spark/lib/spark-assembly.jar.
If it is somewhere else, the address must be modified. It is also possible to download arbitrary Spark library from spark.apache.org, upload it the HDFS, and specify a HDFS location (with the prefix "hdfs://") and choose the proper Spark version.
that is already some progress!
The client knows the number of DataNodes from the NameNode's response.
The client almost certainly won't be able to access the DataNodes directly, only through a SOCKS proxy, so the traffic goes through a master node.
You need to follow the instructions of "Configuring SOCKS Proxy and SSH tunneling" at
In this case, you don't need to create tunnels one by one. Only one additional for Hive, see the description.
Or is it something you have already configured?
This thread may also be helpul.
But I also wrote "If this does not work, the multi-line default value that is described in the link above for this property can be copy-pasted to the value cell instead." What happens, if you set the value from here?
$HADOOP_CONF_DIR, $HADOOP_COMMON_HOME/share/hadoop/common/*, $HADOOP_COMMON_HOME/share/hadoop/common/lib/*, $HADOOP_HDFS_HOME/share/hadoop/hdfs/*, $HADOOP_HDFS_HOME/share/hadoop/hdfs/lib/*, $HADOOP_YARN_HOME/share/hadoop/yarn/*, $HADOOP_YARN_HOME/share/hadoop/yarn/lib/*
I wonder if these env variables work on the cluster or not....
I'm happy to say the Spark suggestion worked and now I can get Radoop connections working completely.
As promised here is the list of things to do to get to this happy place.
Create an EMR cluster and use the advanced options to select Hadoop, Pig, Spark, Hive and Mahout.
Log on to the master node and determine the internal IP address of the eth0 interface using the command line.
While logged in, there are some configuration steps needed to make the environment work. These are described in the Radoop documentation here. I observed that Java did not need any special configuration, EMR is up to date. The commands to create various staging locations in HDFS are required. I've repeated them below
hadoop fs -mkdir -p /tmp/hadoop-yarn/staging/history hadoop fs -chmod -R 777 /tmp/hadoop-yarn hadoop fs -mkdir /user hadoop fs -chmod 777 /user
An earlier version of Spark needs to be installed. Here are the steps.
wget -O /home/hadoop/spark-1.6.3-bin/hadoop2.6.tgz https://d3kbcqa49mib13.cloudfront.net/spark-1.6.3-bin-hadoop2.6.tgz cd /home/hadoop tar -xzvf spark-1.6.3-bin-hadoop2.6.tgz hadoop fs -put spark-1.6.3-bin-hadoop2.6/lib/spark-assembly-1.6.3-hadoop2.6.0.jar /tmp/
Continue to follow the instructions to set up the network connection. Use the IP address found above as the NameNode address, Resource Manager Address and JobHistory Server Address. Don't be tempted to use any other name or IP address since it will not work.
Set the Hive Server address to localhost.
Set the Hive port to 1235.
Set the Spark version to Spark 1.6 and set the assembly jar location to
Set the advanced Hadoop parameters as follows
dfs.client.use.legacy.blockreader true hadoop.rpc.socket.factory.class.default org.apache.hadoop.net.SocksSocketFactory hadoop.socks.server localhost:1234
Now create the SOCKS connection. On Linux the command is like this.
ssh -i <yourkey>.pem -N -D 1234 -L localhost:1235:<ifconfig ip address>:10000 hadoop@<nameofmaster>
In the command above, things between <> need to be provided by information from the environment you are in.
On Windows, use Putty to create the SOCKS connection. The Radoop documentation gives a nice picture here. Make sure you replace hive-internal-address with the ipaddress determined using the ifconfig command.
Now you can run the Radoop connection tests and with luck, all will be well...
We don't support MapR at the moment. In fact, what we don't support is mainly MapR's security, which is quite different from that in the other distributions. If you tried to connect to an unsecured MapR using Radoop, you would be able to.
That said, supporting MapR is in our mid-term plans. It won't be there in the immediate future, but it's certainly in our radar.
This issue and a couple of other things have been fixed in the recent 7.4.1 release of Radoop.
Upgrade is available via the Marketplace from Studio or at https://marketplace.rapidminer.com/UpdateServer/faces/product_details.xhtml?productId=rmx_radoop.
Release Notes are available at:
RapidMiner Radoop allows you to use the Cloudera Hadoop distribution, this allows you push data science processes to the Hadoop cluster and run them in a distributed fashion.
I also found the list of compatable service providers: http://docs.rapidminer.com/radoop/installation/compatibility.html