We have TONS of videos to help you learn RapidMiner - from beginner to expert. Click to watch!
We're competing as Team "RapidMiners" in DrivenData's latest competition: "Pover-T". Join now!
Read about how our community works. Meet other newbies. Get your questions answered fast!
That depends on which Spark queries are examined here.
Before any specifics, let me make the comment that Hadoop (YARN) jobs have an annoyingly large overhead, which is especially obvious when running simple things on small data sets. That overhead is only relatively small when you run the "real" thing: distributed and/or complex jobs on huge data sets. Then the overhead is not that large compared to the job runtime.
In case of larger jobs, the overall performance may depend on how well the cluster resources are allocated. Spark resource allocation related settings can have an effect on that.
In case of smaller jobs, the overhead should be decreased. However, in case of pure Spark operators - you can recognize them from the Spark (star) icon - there is no general way to achieve that. In case of Hive-based operators - look for the Hive (bee) icon -, when Hive-on-Spark is enabled on the cluster, the overhead can be greatly decreased. In the following screenshot from the Resource Manager interface of the cluster (accessible via a web browser at <resource_manager_host>:8088 by default), you can distinguish between the two types of job by looking at the User column: the first is a Hive-on-Spark job, the second is a pure Spark job.
The overhead of the Hive-on-Spark jobs can be decreased via the "Connection pool" settings in the Preferences, although the default heuristics should already provide good results, when operations are executed frequently.
Let me know if you can share your challenges more specifically.
Also, please note that you can expect performance improvement from upgrading to Spark 2.x.
Switching to Spark 2.x for Radoop is very simple, because the required Spark archive can be uploaded to HDFS and Radoop can already use it. No need to install or upgrade any services on the cluster side.