RapidMiner 9.7 is Now Available

Lots of amazing new improvements including true version control! Learn more about what's new here.

CLICK HERE TO DOWNLOAD

Connecting to CDH5 in an EC2 instance

pau_fernandez_qpau_fernandez_q Member Posts: 2 Contributor I
edited September 2019 in Help

Dear all,

 

I have recently launched an EC2 instance with a CDH 5.11 within it. All services seem to be up and running. I have passed several tests to validate the installation.

 

I have also installed RapidMiner Studio on my desktop as well as the Radoop extension. Currently, I am trying to connect to my hadoop cluster. The EC2 instance is not configured to use Elastic IPs, I am ussing tunnels through ssh session. 

 

I am currently trying to pass the full test to validate the connection. Initially, configuration was imported from Cloudera Manager. Then I modified several properties to adjust to my environment. Hive, Java version, Map Reduce, NNode networking test connections have been passed successfully but I am stucked with the upload of a jar file to HDFS. I guess the problem is given by a previous warning when doing DataNode networking test:

 

 WARNING: Reverse DNS lookup failed! Expected hostname for ip <public-ip>: <fqdn>, but received <public DNS>.

 WARNING: DataNode port 50010 on the ip/hostname <fqdn> cannot be reached. Please check that you can access the DataNodes of your cluster.

 

I guess that tunnel on port 50010 is working fine but there is something I am missing. Output of netstat command shows this port is listening to all IPs (0.0.0.0).

 

Things I have tried:

 

- Edit my local hosts file to resolve public ip to internal server hostname. Then Radoop complains because server is unreachable.

- Format namenode previously deleting all data in hdfs data directory

- Edit dfs.client.use.datanode.hostname and dfs.datanode.use.datanode.hostname on the client configuration to true.

- Try to upload a file using another client such as toad. Same error.

- Edit dfs.datanode.address in server to be like hostname:port is not allowed by Cloudera Manager. Only can be set as the port number.

- Edit dfs.datanode.address in the client conf does not change Radoop behaviour.

 

The error when trying to upload the jar file is the following:

[----] SEVERE: File /tmp/radoop/_shared/db_default/radoop_hive-v4_UPLOADING_1498636293395_dy8gaul.jar could only be replicated to 0 nodes instead of minReplication (=1).  There are 1 datanode(s) running and 1 node(s) are excluded in this operation.

 

Somehow the client knows the number of datanodes in hdfs service. Could I say ssh tunnel on port 50010 is working fine? Can someone point me to the right direction?

 

Thank you!!

Tagged:

Best Answer

Answers

  • pau_fernandez_qpau_fernandez_q Member Posts: 2 Contributor I

    Hi phellinger,

     

    Thank you a lot, this was helpful. I did not read this documentation and I was trying 1 thousand tunnels.

     

    I am now able to pass the quick test. Full test fails in hive table load. The error tells me to check user permissions on LOAD or CREATE statements, which I have already done and seems to be ok.

     

    Can you point me to the right direction? 

     

    Thank you in advance!

     

    Best,
    Pau

  • phellingerphellinger Employee, Member Posts: 95   RM Engineering

    Hi Pau,

     

    great!

     

    The Hive load test uploads an HDFS file to a temp dir, and uses the LOAD DATA Hive statement that will effectively move the file to the Hive warehouse directory.

    If you enable the Log panel in Studio (View -> Show Panel -> Log) and set the log level (right click on the panel -> Set log level -> FINER), you will see the details.

     

    Can you share more details (log) in PM or here?

     

    Best,

    Peter

Sign In or Register to comment.