Radoop on Amazon EMR fails to initialize

AndrewAndrew RapidMiner Certified Expert, RapidMiner Certified Master, Member Posts: 47 Guru
edited August 2020 in Help

I'm very close to being able to get Radoop working with an Amazon EMR cluster. My set up involves RapidMiner Studio and Radoop on a Windows laptop which has full unfettered firewall access to the EMR machines. I am not using SOCKS (although I started with this). I am using the absolute latest Spark, Hive and Hadoop components that Amazon makes available.

 

The full connection test fails at the point where components are being uploaded to the /tmp/radoop/_shared/db_default/ HDFS location. I can see that the data nodes are being contacted on port 50010 and it looks like this fails from my laptop because the ip addresses are not known. I have tried the dfs.client.use.datanode.hostname true/false workaround and I see this changes the name that it attempts to use - in one setting the node is <name>/<ipaddress>:50010 (which is odd) while in the other it is <ipaddress>:50010 (which is believable but doesn't resolve).

 

I don't have the luxury of installing RapidMiner components on the EMR cluster so my question is what is the best way to get the name nodes exposed to the PC running RapidMiner Studio and Radoop?

Tagged:

Best Answer

  • AndrewAndrew RapidMiner Certified Expert, RapidMiner Certified Master, Member Posts: 47 Guru
    Solution Accepted

    Hello Peter,

    I'm happy to say the Spark suggestion worked and now I can get Radoop connections working completely.

     

    As promised here is the list of things to do to get to this happy place.

     

    Create an EMR cluster and use the advanced options to select Hadoop, Pig, Spark, Hive and Mahout.

     

    Log on to the master node and determine the internal IP address of the eth0 interface using the command line. 

    ifconfig

     

    While logged in, there are some configuration steps needed to make the environment work. These are described in the Radoop documentation here. I observed that Java did not need any special configuration, EMR is up to date. The commands to create various staging locations in HDFS are required. I've repeated them below 

    hadoop fs -mkdir -p /tmp/hadoop-yarn/staging/history
    hadoop fs -chmod -R 777 /tmp/hadoop-yarn
    hadoop fs -mkdir /user
    hadoop fs -chmod 777 /user

    An earlier version of Spark needs to be installed. Here are the steps.

     

    wget -O /home/hadoop/spark-1.6.3-bin/hadoop2.6.tgz  https://d3kbcqa49mib13.cloudfront.net/spark-1.6.3-bin-hadoop2.6.tgz
    cd /home/hadoop
    tar -xzvf spark-1.6.3-bin-hadoop2.6.tgz
    hadoop fs -put spark-1.6.3-bin-hadoop2.6/lib/spark-assembly-1.6.3-hadoop2.6.0.jar /tmp/

     

    Continue to follow the instructions to set up the network connection. Use the IP address found above as the NameNode address, Resource Manager Address and JobHistory Server Address. Don't be tempted to use any other name or IP address since it will not work.

     

    Set the Hive Server address to localhost.

     

    Set the Hive port to 1235.

     

    Set the Spark version to Spark 1.6 and set the assembly jar location to

    hdfs:///tmp/spark-assembly-1.6.3-hadoop2.6.0.jar

     

    Set the advanced Hadoop parameters as follows

    dfs.client.use.legacy.blockreader  true
    hadoop.rpc.socket.factory.class.default org.apache.hadoop.net.SocksSocketFactory
    hadoop.socks.server localhost:1234

    Now create the SOCKS connection. On Linux the command is like this.

    ssh -i <yourkey>.pem -N -D 1234 -L localhost:1235:<ifconfig ip address>:10000  hadoop@<nameofmaster>

    In the command above, things between <> need to be provided by information from the environment you are in.

     

    On Windows, use Putty to create the SOCKS connection.  The Radoop documentation gives a nice picture here. Make sure you replace hive-internal-address with the ipaddress determined using the ifconfig command.

     

    Now you can run the Radoop connection tests and with luck, all will be well...

     

    yay!

     

    Andrew

     

     

Answers

  • zprekopcsakzprekopcsak RapidMiner Certified Expert, Member Posts: 47 Guru

    Hi Andrew,

    You will need to use some networking trick, because the datanode IP addresses that you are receiving from the cluster are AWS internal IP addresses that your PC cannot route to. The dfs.client.use.datanode.hostname will not do the trick as Hadoop services are not exposed on the public-facing IPs.

    If you can start another EC2 instance in the same local network (VPC in AWS lingo) as the EMR cluster, then I suggest installing a RapidMiner Server on that EC2 instance and enabling the Radoop Proxy. See here for more details: https://docs.rapidminer.com/radoop/installation/networking-setup.html#radoop-proxy

    If you cannot start another instance then you either need to set up the SOCKS proxy or a VPN.

    Best, Zoltan

  • AndrewAndrew RapidMiner Certified Expert, RapidMiner Certified Master, Member Posts: 47 Guru

    Hello Zoltan

     

    I initially tried with SOCKS but I couldn't make it work, a mis-configuration of some sort. Can I be confident that it will eventually be possible using the SOCKS approach? I just need to be sure that I will get it working before I spend time on it.  I promise to write about what I did.

     

    regards

     

    Andrew

  • AndrewAndrew RapidMiner Certified Expert, RapidMiner Certified Master, Member Posts: 47 Guru

    I have almost got it working - the last part is now a failure in the Spark location

     

    [Jun 9, 2017 12:11:17 PM] SEVERE: The Spark job could not succeed for any supported Spark Version. It seems that the specified assembly jar or its location is incorrect: local:///usr/lib/spark/jars

     

    And yet on the EMR master node, I can see local jar files at that location. Is there a specific file that is needed?

  • phellingerphellinger Employee, Member Posts: 103 RM Engineering

    Hi Andrew,

     

    I was able to reproduce your problem on EMR-5.6.0 with Spark 2.1.

    It's important to note that Amazon is quite agile in pushing new EMR versions out :smileyhappy:, sometimes latest versions have changes that affects the initial RapidMiner connection setup. Let me take a look at this one, but it may take some time.

     

    Meanwhile, you can always use Spark 1.6 on this cluster as well, just download it from http://spark.apache.org, put the assembly on HDFS and change the Radoop connection to point to that. For example, run these commands as hadoop user on the master (I hope I have no typos there):

     

    wget -O /home/hadoop/spark-1.6.3-bin-hadoop2.6.tgz https://d3kbcqa49mib13.cloudfront.net/spark-1.6.3-bin-hadoop2.6.tgz
    cd /home/hadoop
    tar -xzvf spark-1.6.3-bin-hadoop2.6.tgz
    hadoop fs -put spark-1.6.3-bin-hadoop2.6/lib/spark-assembly-1.6.3-hadoop2.6.0.jar /tmp/

    Screen Shot 2017-06-09 at 16.53.32.png

     

    Best,

    Peter

  • AndrewAndrew RapidMiner Certified Expert, RapidMiner Certified Master, Member Posts: 47 Guru

    oops - I made a typo in the instructions

    it should be

    wget -O /home/hadoop/spark-1.6.3-bin-hadoop2.6.tgz https://d3kbcqa49mib13.cloudfront.net/spark-1.6.3-bin-hadoop2.6.tgz

    and also, the SOCKS instructions for Windows Putty are incorrect. The address to use is localhost - confusing - but it seems to work.

  • phellingerphellinger Employee, Member Posts: 103 RM Engineering

    Hi Andrew,

     

    thanks for the great summary!

     

    The only thing I did not get is the localhost address comment on Windows. Do you mean you had to use "localhost" as the address (with port 10000) instead of the Hive node's IP address? I would expect that to only work if the HiveServer2 ran on the master node.

     

    Best,

    Peter

  • AndrewAndrew RapidMiner Certified Expert, RapidMiner Certified Master, Member Posts: 47 Guru

    Hello Peter

     

    I have these Putty settings.

     

     

    Capture.PNGPutty settings

     

    If I change the local port 1235 setting to other likely candidate names or ip addresses, I get a failure in the Quick Test of the Radoop connection.

     

    regards

     

    Andrew

  • phellingerphellinger Employee, Member Posts: 103 RM Engineering

    We've made a small update to the Amazon EMR guide at https://docs.rapidminer.com/radoop/installation/distribution-notes.html.

     

    Both Spark 1.x and Spark 2.x can be used easily. The most efficient configuration is described: upload Spark assembly / Spark jars to HDFS in a compressed format and provide the HDFS URL in the Radoop connection.

     

    (The error came from the fact that Spark libraries are only installed on the master node, so the submitted jobs could not find them on worker nodes.)

     

    Best,

    Peter

Sign In or Register to comment.