Due to recent updates, all users are required to create an Altair One account to login to the RapidMiner community. Click the Register button to create your account using the same email that you have previously used to login to the RapidMiner community. This will ensure that any previously created content will be synced to your Altair One account. Once you login, you will be asked to provide a username that identifies you to other Community users. Email us at Community with questions.

Process Failed: HiveQL problem

kumar_anant0308kumar_anant0308 Member Posts: 5 Contributor II
edited December 2018 in Product Feedback - Resolved

I've installed Hadoop using Ambari Server (Hortonworks Data Platform  HDP-2.6). The connection to radoop passed without any errors . I am able to store and retrieve in hive from rapidminer using the connection made. However, when I'm running any process related to Spark; For eg: the tutorial process for k-means in radoop, I get the following error :-

com.rapidminer.operator.OperatorException: HiveQL problem (org.apache.hive.service.cli.HiveSQLException: Error while compiling statement: FAILED: SemanticException Line 1:17 Invalid path ''/tmp/radoop/admin/tmp_1525790714419a_csh8jcj/'': No files matching path hdfs://node.server.com:8020/tmp/radoop/admin/tmp_1525790714419a_csh8jcj)
SEVERE: Process failed: HiveQL problem (org.apache.hive.service.cli.HiveSQLException: Error while compiling statement: FAILED: SemanticException Line 1:17 Invalid path ''/tmp/radoop/admin/tmp_1525790714419a_csh8jcj/'': No files matching path hdfs://node.server.com:8020/tmp/radoop/admin/tmp_1525790714419a_csh8jcj)

 

I'm also able to run programs in Spark via the terminal but I get this error if I run it through Radoop.

Can anybody get me a solution for this?

Tagged:
0
0 votes

Fixed and Released · Last Updated

This is a Hive bug with a current workaround until Hive is remedied from their end. Please see https://docs.rapidminer.com/latest/radoop/troubleshooting/known-errors.html for more details. RAD-1657

Comments

  • kumar_anant0308kumar_anant0308 Member Posts: 5 Contributor II

    The process works fine when we run the k-means clustering(not the radoop operator) inside SparkRM and checking/setting the "merge output" and "resolve schema conflicts" .

    But still the same error persists for the tutorial process.

  • phellingerphellinger Employee, Member Posts: 103 RM Engineering

    Hi,

     

    are you referring to the "Full Test" when saying "The connection to radoop passed without any errors"?

    It would be helpful to run the Full Test (may take some time) that includes Spark job test as well.

     

    If Full Test succeeds, there may be a hidden error in the submitted Spark job, which may result in a missing output, but that is just a guess. Do you have access to the Resource Manager UI (typically at port :8088) of the cluster.

     

    Best,

    Peter

  • kumar_anant0308kumar_anant0308 Member Posts: 5 Contributor II

    The full test passed without any errors and I'm able to access the Resource Manager too. The process(Spark job) shows status as "SUCCEEDED". I ran into a problem with Radoop K-means Tutorial process.

    In the hadoop data tab,we are able to view the clustered data but still the process ends with the same SemanticException error.

    8.png 167.5K
  • mborbelymborbely Member Posts: 14 Contributor II

    Hi,

     

    This problem occurs when Radoop tries to load the output of the Spark job into a Hive table. My first assumption would be that the user running Hive jobs cannot access the files created by the Spark job. (owned by user 'admin') We have to find the exact reason behind this, but for that, we need to know a bit more about your configuration.

    1. Is your cluster secured?
    2. What kind of authorization model do you use in Hive?
    3. Do you have hive.server2.enable.doAs enabled? You can find this in Ambari as "Run as end user instead of Hive user" under Hive settings.

    On the other hand, your sentence "In the hadoop data tab,we are able to view the clustered data " makes me doubt this assumption. Just to be sure: you can see it among the tables, and not through import files from HDFS option, right? Are you 100% sure, you see the result of the currently executed process? Because if you are, we have to look in entirely different directions.
    If you disable "cleaning" option on the Radoop nest and after running the process, list the contents of the folder (belonging to the Invalid path, with user admin), do you see your data files? Or do you just see an empty folder?

  • kumar_anant0308kumar_anant0308 Member Posts: 5 Contributor II

    1. We've not enabled any kind of security.

    2. There is no authentication (hive.server2.authentication is set to NONE)

    3. The "hive.server2.enable.doAs" attribute is set to "true" but "enabled" is set to 'F'(In the configRadoop_xml.txt) 

    We can see the clustered data among the tables.(Please see ClusteredData.png)

    We tried with "cleaning" option disabled and we are able to list the contents of the folder '/tmp/radoop/admin/'(HDFSls.png) but the folder on which the HiveQL Exception occurs is empty.(RstudioLog.png)  

     

  • mborbelymborbely Member Posts: 14 Contributor II
    Solution Accepted

    Hi,

     

    Thanks for your response. This seems to be a Hive bug. I need to share a few details to explain the problem. First off, if such an operation fails, Radoop simply retries it. If the second try fails, you can see an error message, but only for the second try. This is rather unfortunate if the error is something different for the first time. This can of course only happen if the first try has some unwanted side-effects, which is exactly what happened in this case. Since you didn't have a chance to see and report this original problem, I reproduced this issue in one of our test clusters. The issue seems to be caused by "hive.warehouse.subdir.inherit.perms" setting. Because of this, Hive tries to take ownership of the data files located under its warehouse directory.  After issuing the load command, Hive first moves the file to the warehouse directory, then changes its ownership. This is where the problem occurs, since in your setup this operation is not permitted for user hive, since it's not the owner of the file, nor is a superuser. However, this shouldn't be a problem, because Hive docs (https://cwiki.apache.org/confluence/display/Hive/Permission+Inheritance+in+Hive) state that "Failure by Hive to inherit will not cause operation to fail." Therefore it's a bug in Hive.

    However by this time the files are already moved into the warehouse directory, and can actually be queried through Hive. This is why you can see the tables in the Hadoop Data view. And the second try fails with a different error, since the files to be imported are no longer present at their original location.

    Luckily there are some workarounds for this problem:

    1. Since you have an unsecured cluster, you can set the Hadoop username on the Radoop connection to "hive". This is very simple because you don't have to change anything in your cluster configurations, which you might not have access to. On the other hand, all your jobs (Spark, MR, etc.) will be run in the name of user hive, and all the output files are going to be owned by this account. If it doesn't cause any problems for you, I'd suggest this option.
    2. You can set hive.warehouse.subdir.inherit.perms to false in Ambari. (and restart all affected services) This way Hive won't try to change the owner of these files to "hive". Please note that this change is going to affect all other users of the cluster, so I suggest you only change this if you are a 100% sure you don't break anything else.
    3. You can also set hive.mv.files.thread to 0 in Ambari. This will disable certain optimizations, hence bypass the problematic code. However this might have some performance impact on the cluster.

    Side note: you mentioned "The "hive.server2.enable.doAs" attribute is set to "true" but "enabled" is set to 'F'(In the configRadoop_xml.txt)". Thing is, these settings don't come from Radoop, they are cluster-side configurations which are added to the connection entry during the Ambari import as disabled Advanced Settings, mainly for informative reasons. In fact, Radoop cannot even override this property dynamically. So this is the actual configuration your cluster uses, regardless of it being disabled on the Radoop connection.

  • kumar_anant0308kumar_anant0308 Member Posts: 5 Contributor II
    Thanks for being so responsive and contributing so much of your valuable time and knowledge. That was an excellent solution and the number 1 option worked for me.
Sign In or Register to comment.