server proxy issues

kaymankayman Member Posts: 662 Unicorn
edited December 2018 in Help

Trying to get our Ubuntu 16 RM server to access the internet behind a proxy but it keeps failing. I did enter the right info (as far as I know) but no success. 

 

I've added as follows on bin/standalone.conf

 

JAVA_OPTS="$JAVA_OPTS -Dhttp.proxySet=true -Dhttp.proxyHost=http://[our_proxy_ip]:[our_port]"
JAVA_OPTS="$JAVA_OPTS -Dhttps.proxySet=true -Dhttps.proxyHost=http://[our_proxy_ip]:[our_port]"

 

or this

 

JAVA_OPTS="$JAVA_OPTS -Dhttp.proxySet=true -Dhttp.proxyHost=http://[our_proxy_ip] -Dhttp.proxyPort=[our_port]"
JAVA_OPTS="$JAVA_OPTS -Dhttps.proxySet=true -Dhttps.proxyHost=http://[our_proxy_ip] -Dhttps.proxyPort=[our_port]"

 

but neither will make any difference, it keeps failing on getting me access to the internet (for web mining).

 

Any other ideas on how I can troubleshoot? I cannot find anything else as above so I'm stuck now

Tagged:

Answers

  • Thomas_OttThomas_Ott RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,761 Unicorn

    Proxy issues is not my strong suitm maybe @homburg can help?

  • homburghomburg Moderator, Employee, Member Posts: 114 RM Data Scientist

    Indeed, using the '-Dhttp.proxy' parameters is not guaranteed to work depending on the Java version used. A good way to set server proxy details is to do it with a process. I have attached a proxy enforcing process which will directly set RapidMiner Server proxy settings. Please just run the operator included after the server started or prior to your web crawling process. With this operator you can even dynamically set proxy parameters if required, i.e. when there are multiple proxies or you need to turn it on and off.

  • kaymankayman Member Posts: 662 Unicorn

    Cool. I can't test now but it seems pretty promising for sure. I'll update once I was able to test 

  • kaymankayman Member Posts: 662 Unicorn

    Doesn't seem to make a difference tbh. Is there a specific port used by the server to communicate with websites?

  • kaymankayman Member Posts: 662 Unicorn

    @homburg, would you perhaps have any other ideas? This is quite a serious problem for us since it means we can not access the internet. Given that we need RMS to do webmining it is crucial to go online.

     

    If it helps we get the following error message when trying to use a python urllib command with RMS :

    <urlopen error [Errno -5] No address associated with hostname>

     

    Note that curling the test url, or calling it in python directly works fine on the server, so we are able to access the internet outside of RMS, just not inside RMS. Also everything else (like apt-get etc) works fine, it is only using RMS to access the internet that fails. Is rapidminer using non default ways to communicate to the internet that might be blocked by our network?

     

    Are there ways I can test the routing when using RMS to access the internet, to see where it get's blocked?

     

  • homburghomburg Moderator, Employee, Member Posts: 114 RM Data Scientist

    Well, that is a different situation. When you are running urllib the Python executor tries to directly connect to the given resource, no RapidMiner proxy is involved then (as it would be i.e. you were runing the web mining extension). 

     

    In this case please enrich your Python script with a proper proxy config by creating a proxy enabled URL opener.

  • kaymankayman Member Posts: 662 Unicorn

    Thanks @homburg, adding the proxy to the python code does the trick indeed.

    Worst case scenario I can fetch now url's this way, but I'm still puzzled on how to get the web operators running as these do offer quite some value also.

     

    I think by now I tried every possible combination with no luck, so any tip or trick that I can use to do some troubleshooting is welcome.

     

     

     

  • kaymankayman Member Posts: 662 Unicorn

    Hi @homburg, not sure if it was you or someone else who marked this topic is now solved but I opened it again as it is not really solved.

     

    There is a workaround using python and no longer use the whole web operator set, but that does not fix the problem. I remain unable to use the toolset behind a proxy server and it is still pretty important to understand why so we would be able to use the web crawler logic as intended.

     

    Some questions : the script you provided seems to work fine for the studio setup, but is it also applicable for the actual server? Note that our physical server is not having a studio version installed, these are on other machines. Does that have an impact? The server is nothing but a bare setting running only the server software.

     

    I believe the main components behind the web page extrators are based on crawler4j, and these have the default proxy settings set to null. Do the JBoss setting have an impact on these, or should I use something similar as what you created but addressing the web component code directly? And if so, how would that work?

     

     

    Note to whomever reading this, do not close this before it is actually solved, thanks!

     

  • homburghomburg Moderator, Employee, Member Posts: 114 RM Data Scientist

    Hi @kayman, usually setting these RapidMiner proxy parameters should be sufficient to control web mining extension operators and the proxy used. I have successfully applied proxy parameters in combination with a "Get Page" operator. 

    You mentioned that your settings work in Studio but not for RM Server. Does you server run inside the same network segment using the same proxy / dialout permissons?

     

    Cheers,

    Helge

  • kaymankayman Member Posts: 662 Unicorn

    hi @homburg,

     

    Both are indeed running behind the same proxy but with some small differences in setup. Whereas my local desktop requires me to authenticate the server is able to communicate with the outside without authentication (or at least not on server level). It's a bit beyond my knowhow how all these things work tbh.

     

    Funny (yet annoying) thing is that I need to turn the proxy off to be able to communicate with the server, and turn it on to allow me to crawl so I loose connection. The script does howver modify the settings on my studio panel so it works for Studio, just not sure if the same goes for the server instance.

     

    What I did notice earlier today was that for instance also the dropbox connection fails on the server instance, and it gave me 'error: java.net.UnknownhostException : api.dropbox.com, which pointed me to some posts on other fora mentioning JBOSS actually expects to find the hostname rather than the ip to work with. Since our server is not named (yet) but only using IP I'm wondering if the root cause might be over there. 

     

    Our network team is spending some time on it as we speak, but anything that can point them in the right direction is welcome.

  • homburghomburg Moderator, Employee, Member Posts: 114 RM Data Scientist

    Configuring a proxy for RapidMiner Studio causes the application to route every traffic via this proxy. In case of server connections which usually reside inside the internal network it is therefor required to setup proxy exceptions. You may add IPs of local entities to the "No proxy for" list separated by a "|" character. This way you do need to turn off proxy configuration for server connections. Please note that this field (as well as all others within this tab) is only active when proxy configuration is set to manual mode.

     

    With regard to the server proxy issue I assume the global server proxy is not yet correctly configured. Probably this is also the case for the operation system itself. Is it i.e. possible to dial out using a simple curl command?

     

     

  • kaymankayman Member Posts: 662 Unicorn

    Hi @homburg,

     

    Yes, I can curl out without a problem, also webbrowsers running on the system work as expected and I can have internet connections using python .Actually this is the temp solution I have to allow me to use the app in the meantime, instead of using the web operators I'm using a python operator. Bit clumsy but it does the trick.

    All of my paths lead to Java and probably one of it's hidden/obscure settings but no success so far.

Sign In or Register to comment.