06-19-2017 05:18 AM
Trying to get our Ubuntu 16 RM server to access the internet behind a proxy but it keeps failing. I did enter the right info (as far as I know) but no success.
I've added as follows on bin/standalone.conf
JAVA_OPTS="$JAVA_OPTS -Dhttp.proxySet=true -Dhttp.proxyHost=http://[our_proxy_ip]:[our_port]"
JAVA_OPTS="$JAVA_OPTS -Dhttps.proxySet=true -Dhttps.proxyHost=http://[our_proxy_ip]:[our_port]"
JAVA_OPTS="$JAVA_OPTS -Dhttp.proxySet=true -Dhttp.proxyHost=http://[our_proxy_ip] -Dhttp.proxyPort=[our_port]"
JAVA_OPTS="$JAVA_OPTS -Dhttps.proxySet=true -Dhttps.proxyHost=http://[our_proxy_ip] -Dhttps.proxyPort=[our_port]"
but neither will make any difference, it keeps failing on getting me access to the internet (for web mining).
Any other ideas on how I can troubleshoot? I cannot find anything else as above so I'm stuck now
06-19-2017 10:37 AM
Indeed, using the '-Dhttp.proxy' parameters is not guaranteed to work depending on the Java version used. A good way to set server proxy details is to do it with a process. I have attached a proxy enforcing process which will directly set RapidMiner Server proxy settings. Please just run the operator included after the server started or prior to your web crawling process. With this operator you can even dynamically set proxy parameters if required, i.e. when there are multiple proxies or you need to turn it on and off.
06-23-2017 12:56 PM - edited 06-23-2017 01:02 PM
@homburg, would you perhaps have any other ideas? This is quite a serious problem for us since it means we can not access the internet. Given that we need RMS to do webmining it is crucial to go online.
If it helps we get the following error message when trying to use a python urllib command with RMS :
<urlopen error [Errno -5] No address associated with hostname>
Note that curling the test url, or calling it in python directly works fine on the server, so we are able to access the internet outside of RMS, just not inside RMS. Also everything else (like apt-get etc) works fine, it is only using RMS to access the internet that fails. Is rapidminer using non default ways to communicate to the internet that might be blocked by our network?
Are there ways I can test the routing when using RMS to access the internet, to see where it get's blocked?
06-23-2017 01:16 PM - edited 06-23-2017 01:19 PM
Well, that is a different situation. When you are running urllib the Python executor tries to directly connect to the given resource, no RapidMiner proxy is involved then (as it would be i.e. you were runing the web mining extension).
In this case please enrich your Python script with a proper proxy config by creating a proxy enabled URL opener.
06-23-2017 04:22 PM
Thanks @homburg, adding the proxy to the python code does the trick indeed.
Worst case scenario I can fetch now url's this way, but I'm still puzzled on how to get the web operators running as these do offer quite some value also.
I think by now I tried every possible combination with no luck, so any tip or trick that I can use to do some troubleshooting is welcome.
06-24-2017 04:52 PM
Hi @homburg, not sure if it was you or someone else who marked this topic is now solved but I opened it again as it is not really solved.
There is a workaround using python and no longer use the whole web operator set, but that does not fix the problem. I remain unable to use the toolset behind a proxy server and it is still pretty important to understand why so we would be able to use the web crawler logic as intended.
Some questions : the script you provided seems to work fine for the studio setup, but is it also applicable for the actual server? Note that our physical server is not having a studio version installed, these are on other machines. Does that have an impact? The server is nothing but a bare setting running only the server software.
I believe the main components behind the web page extrators are based on crawler4j, and these have the default proxy settings set to null. Do the JBoss setting have an impact on these, or should I use something similar as what you created but addressing the web component code directly? And if so, how would that work?
Note to whomever reading this, do not close this before it is actually solved, thanks!
06-28-2017 12:07 PM - edited 06-28-2017 12:09 PM
Hi @kayman, usually setting these RapidMiner proxy parameters should be sufficient to control web mining extension operators and the proxy used. I have successfully applied proxy parameters in combination with a "Get Page" operator.
You mentioned that your settings work in Studio but not for RM Server. Does you server run inside the same network segment using the same proxy / dialout permissons?