🎉 🎉 RAPIDMINER 9.10 IS OUT!!! 🎉🎉

Download the latest version helping analytics teams accelerate time-to-value for streaming and IIOT use cases.

CLICK HERE TO DOWNLOAD

Increase Radoop Performance

kevin_mkevin_m Member Posts: 5 Contributor I
edited November 2018 in Help

Hello, is it possible to increase the performance or speed of the spark-query? If so, how? Thanks in advance!

Tagged:

Best Answers

  • phellingerphellinger Employee, Member Posts: 101   RM Engineering
    Solution Accepted

    Hi,

     

    That depends on which Spark queries are examined here.

     

    Before any specifics, let me make the comment that Hadoop (YARN) jobs have an annoyingly large overhead, which is especially obvious when running simple things on small data sets. That overhead is only relatively small when you run the "real" thing: distributed and/or complex jobs on huge data sets. Then the overhead is not that large compared to the job runtime.

     

    In case of larger jobs, the overall performance may depend on how well the cluster resources are allocated. Spark resource allocation related settings can have an effect on that.

     

    In case of smaller jobs, the overhead should be decreased. However, in case of pure Spark operators - you can recognize them from the Spark (star) icon - there is no general way to achieve that. In case of Hive-based operators - look for the Hive (bee) icon -, when Hive-on-Spark is enabled on the cluster, the overhead can be greatly decreased. In the following screenshot from the Resource Manager interface of the cluster (accessible via a web browser at <resource_manager_host>:8088 by default), you can distinguish between the two types of job by looking at the User column: the first is a Hive-on-Spark job, the second is a pure Spark job.Screen Shot 2017-07-12 at 14.42.55.png

    The overhead of the Hive-on-Spark jobs can be decreased via the "Connection pool" settings in the Preferences, although the default heuristics should already provide good results, when operations are executed frequently.

     

    Let me know if you can share your challenges more specifically.

     

    Best,

    Peter

     

    Edit: formatting

    kevin_m
  • phellingerphellinger Employee, Member Posts: 101   RM Engineering
    Solution Accepted

    Also, please note that you can expect performance improvement from upgrading to Spark 2.x.

    Switching to Spark 2.x for Radoop is very simple, because the required Spark archive can be uploaded to HDFS and Radoop can already use it. No need to install or upgrade any services on the cluster side.

     

    Peter

    kevin_m
Sign In or Register to comment.