SparkRM, Hive, TEZ, Python, R, PySpark, SparkR - What is the Sequence? Or, The Radoop Matryoshka

by RMStaff 2 weeks ago - edited a week ago

Question: If I put a Hive operator inside a SparkRM, does it become a Spark job?

No, you can only use standard RapidMiner operators inside SparkRM, you cannot use a Hive operator. However, you can configure Hive to have Spark as its execution engine. Then all the hive operators in Radoop work on Spark. There is a Hive option for that (hive.execution.engine) that you can set in the connection.

Question: If using Hortonworks and Hive with embedded TEZ, do my Hive operators automatically leverage TEZ?

As in the previous question, you just need to set the hive.execution.engine variable in the connection as “tez”.

Question: Can I execute python or R inside a Radoop nest and will it execute on the cluster?

You can use SparkR or PySpark with the “Spark Script” operator. That would be the easiest way.

If, for example, you need a package that is not in SparkR, then you can do it with SparkRM as above, but again, you need to have R installed and all in the same path.

Question: Can I run Hive operators on Spark without Hadoop?

No, we don’t integrate with Spark without Hadoop. You need a Hive server and Yarn installed. You can have Spark as Hive’s execution engine however.

Question: When writing PySpark, where should I execute the code? Radoop nest, SparkRM or Studio?

With the “Spark Script” operator, and that should be inside the Radoop nest.

Submit Radoop jobs to specific queues

by RMStaff 3 weeks ago

Depending on the cluster configuration and nature of Radoop jobs there may be a need to allocate the jobs to specific queue.

 

Such needs can be handled by specifying the queues to use in the Radoop configuration dialog.

Add "mapred.job.queue.name"  as the property and the queue name as the value in the  added to Advanced Hadoop Parameters
for Hive, it is the same added to Advanced Hive Parameters
for Hive on Tez it is tez.queue.name property that controls it.
for Spark, it is "spark.yarn.queue" added to Advanced Spark Parameters 

How to Schedule Radoop Process

by RMStaff on ‎09-23-2016 06:00 AM

Scheduling is a functionality that is provided by RapidMiner Server. A Radoop workflow needs to be saved on RapidMiner server based Repository and then can be scheduled using the schedulers like any standard process.

Before you run Radoop workflows on server you will also need to ensure to install RapidMiner Radoop on the server.

Instructions for installation are here 

http://docs.rapidminer.com/radoop/installation/radoop-server-install.html

 

You will also need to configure Radoop connections on the server

http://docs.rapidminer.com/radoop/installation/configuring-radoop-connections.html

 

Ensure that Radoop connections are defined on the server that you intend to use. Once saved on RapidMiner server one can use the scheduling capabilities of the server to manage Radoop workflows.

To see steps for scheduling process from the Studio Interface, refer to this article

http://docs.rapidminer.com/server/how-to/schedule-a-process/schedule-from-studio.html

To see steps for scheduling process from server interface, refer to this article

http://docs.rapidminer.com/server/how-to/schedule-a-process/schedule-from-server.html

 

Moving data from RDBMS to Hadoop/Hive

by RMStaff on ‎08-09-2016 07:49 AM - edited on ‎09-21-2016 10:42 AM by Community Manager
  1. Many a times there is a need to move Data from relational databases to hadoop to start leveraging the power of Hadoop.

Depending on the use case it may be a one time effort or you may need to do this periodically. Rapidminer provides way to do this very easily using the Rapidminer Studio client and Radoop extension. This article will describe setting up a RapidMiner workflow to import data from Relational data store into Hadoop.

 

You can download the two products from here 

https://my.rapidminer.com/nexus/account/index.html#downloads

or get in touch with us at https://rapidminer.com/contact-sales-request-demo/

 

  • Use the Radoop Nest operator  and drag it into a new process canvasradoop nest.gif
  • Configure the Radoop Connection ( The details for Setting up Radoop Connections are here http://docs.rapidminer.com/radoop/installation/configuring-radoop-connections.html)
  • Provide table prefix(Table prefix are used by temporary objects, Temporary objects are automatically deleted in most cases)
  • Double click on Radoop Nest operator
  • Drag the "Read Database" operator from the Extensions>>Radoop>>Data Access>>Read group
  • Configure the Read Database operator, it allows you to use a predefiend database connection, use jdbc url or jndi name for connection. You can build a query, use atablename or specify a sql file to define the source.
  • Then connect the out port of the read database to "store in hive" operator
  • 2016-08-09 12_12_11-_new process__ – RapidMiner Studio Developer 7.2.000 @ RMUS-BPATIL.png

 

 

 

  • You can use the store in hive configuration options to determinw how the data is stored, partitioned, if it should use external tables, customer storage and custom SerDe.
  • The Store in hive operator also allows to drop first table if it exists.
  • In case where you need to append to existing hive tables, use the Append to hive operator instead.

2016-08-09 12_16_48-_new process__ – RapidMiner Studio Developer 7.2.000 @ RMUS-BPATIL.png/

 

 

  • To run the process now, you can hit the blue play button at the top

play.png

 

  • You can also schedule the process to run if you have RapidMiner server installed and configured.

 

2016-08-09 12_19_53-__localserver_delete this_run process on server – RapidMiner Studio Developer 7..png

 

 

You can absolutely add more than one of these read- store pairs like seen below. 

Sometimes there may be a need to do some data prep before it is actually  stored. You can build those workflows easily with RapidMiner as seen in screen shot below

 

2016-08-09 13_13_07-__localserver_delete this_run process on server_ – RapidMiner Studio Developer 7.png

 

 

 

You can download the two products from here 

https://my.rapidminer.com/nexus/account/index.html#downloads

or get in touch with us at https://rapidminer.com/contact-sales-request-demo/

 

 

 

 

 

 

Store in Hive using Delimited Row Format

by RMStaff on ‎08-04-2016 02:44 PM - edited on ‎09-20-2016 06:06 AM by Community Manager

RapidMiner Radoop’s “Store in Hive” operator is a versatile operator to allow you to save data in hive or external tables.. This article describes how to enable custom storage and use a DELIMITED row format while storing.

Please ensure that the advanced parameters are enabled when you need to use DELIMITED format.

2016-08-04 19_37_45-Cortana.png

 

 

 

Click on the “Custom Storage” checkbox to expose additional options. You cannot use the custom storage handler option when using custom row format.

Then in the row format dialog change from “default format” to DELIMITED. In the additonal settings you provide the detailed settings for the target table row format.  A sample screen shot of the parameters looks like below.

Please note that older hive versions may not support all of the settings.

 

2016-08-04 19_41_39-Cortana.png

 

 Download Rapidminer Radoop for free today from http://bit.ly/RadoopDL

Custom storage handlers on Hadoop when using Radoop "Store in Hive"

by RMStaff ‎08-04-2016 02:33 PM - edited ‎08-09-2016 08:22 AM

When using RapidMiner Radoop "Store in Hive" operator there may be a need to use some custom storage handlers.

Storage handlers make it possibe to allow Hive to access data stored and managed by other systems.

 RapidMiner’s “Store in Hive” operator provides a lot of flexibility when it comes to saving the data in hive or external tables in HDFS of Amazon S3.

Additionally custom storage handles may allow you to use Hypertable, Cassandra, JDBC, MongoDB, Google Spreadsheets as documented here

 

To enable custom Storage ensure you have the advanced parameters visible like below.

Now click on the “Custom Storage” checkbox to explore options for using custom storage handlers

 

store in hive Radoop .png

 

Once you click on the "custom storage" option, additional options are made available as below .

When providing the custom storage handle you need to ensure that it must exist in the CLASSPATH of the hive server.

 

 

2016-08-04 19_26_08-Cortana.png

 

The user defined SerDe properties can be then added by clicking the “Edit List” button.

Please note that the SerDe properties are case sensitive

2016-08-04 19_28_14-_new process__ – RapidMiner Studio Developer 7.2.000 @ RMUS-BPATIL.png

 

 

 

  Download Rapidminer Radoop for free today from http://bit.ly/RadoopDL

Store in hive using custom SerDe

by RMStaff ‎08-04-2016 03:01 PM - edited ‎08-04-2016 03:05 PM

RapidMiner Radoop’s “Store in Hive” operator is a versatile operator to allow you to save data in hive or external tables.. This article describes how to enable custom storage and use a DELIMITED row format while storing.

Please ensure that the advanced parameters are enabled when you need to use DELIMITED format.

2016-08-04 19_37_45-Cortana.png

Once the custom storage option is clicked you will have addtional options, change the row format box to "Custom SerDe" as highlighted below

2016-08-04 19_54_46-RapidMiner - EY processes review and best practices - Meeting.png

 

Then provide the serde classname. Please ensure that exist in the classpath of the hive server.

Additional serde properties can be set by clicking on the "Edit List' option. These case sensitive key value pairs are passed on to the tables serde.

 

 

List of built in serde and how to write your own serde look at this link https://cwiki.apache.org/confluence/display/Hive/SerDe 

 

You can also select addtional hive file format settings or impala file format settings in the addtional options available. Please note that older hive versions may not support some of the file formats. The default hive file formats supported as of version 7.2(Aug 2016) of Radoop are TEXTFILE, RCFILE, ORC, SEQUENCEFILE, PARQUET AND custom format.

 

Additional options for inputformat and output format for when using customformat is exposed on selecting that option

Change storage location on Hadoop

by RMStaff on ‎08-04-2016 01:44 PM

RapidMiner Radoop allows you to do code free data prep, blending, cleansing in a distributed fashion on Hadoop. A lot of times there is a need to store this data in Hadoop after the data cleansing steps are completed, Radoop’s “Store in hive” operator is an excellent way to store data in hive generally. But sometimes there is a need control the location(directory)  of where it is stored rather than relying on Hive to do the management.

 

To see the options needed for this make sure, you can selected to show the advanced parameters for the operator.

 

Store in Hive.png

To specify custom location one can still use the “Store in Hive Operator” and specify a custom location in the box highlighted below

 

2016-08-04 18_41_26-Cortana.png

 

 

The path can be an external location on HDFS or on amazon s3. For amazon Use the  s3://<bucket>/<path> or s3n://<bucket>/<path> format to specify the destination directory (it will be created if it does not exist). Please note that in this case the target directory can not be checked or emptied beforehand, since it can not be accessed directly without AWS credentials

Store in Hadoop Hive using Custom partitioning

by RMStaff on ‎08-04-2016 01:33 PM

When working with Hadoop, RapidMiner Radoop provides a code free way to store data in hive. Several times for retrieve performance reason, especially if you are filtering data based on specific columns you can achieve a lot of gain by storing data in a different partitions .

 

RapidMiner Radoop Extension for Hadoop processing provides ability to define partition rules based on one or more columns. Rows with different values are then handled separately by hive. This article describes steps to enable partitioning during the Store in Hive step.

 

To see the option click on the "Show advanced Parameters" option in the parameters view of the store operator

2016-08-04 18_05_30-_new process__ – RapidMiner Studio Developer 7.2.000 @ RMUS-BPATIL.png

 Then click on the "Select Attributes" option for the partition by parameter

2016-08-04 18_06_31-_new process__ – RapidMiner Studio Developer 7.2.000 @ RMUS-BPATIL.png

 

 In the Pop-Up Window then move the attributes you want to partition by from the left list to the right list

 

2016-08-04 18_07_49-_new process__ – RapidMiner Studio Developer 7.2.000 @ RMUS-BPATIL.png

 

 You can also change the order of partitioning by moving it up or down on the right side

 

If your attribute is not visible on the left side list, you can manually type in the highlighted box below and then click on the plus icon to add it to the list

 

 

2016-08-04 18_30_25-_new process__ – RapidMiner Studio Developer 7.2.000 @ RMUS-BPATIL.png