RapidMiner

Get page error in Web Mining Package

RM Partner
RM Partner

Get page error in Web Mining Package

Hi All,

 

I'm working in Client's company as a project member. Because client wanted us to get the information by Web site, I try to use "get page" operator. Due to Https address, it was problem to access the web page we need.

 

Even though the url starting https can be accessed by web browser such as IE or Chrome, we cannnot access a URL by "get page" operator with below message

error.PNG

I'm sure it's network security issue of this companay. ( I can access the same page by "get page" operator in other places )

What I want to know is what do I ask to Network manager of the company in order to solve this blocking.

 

looking forward to your better knowledge.

 

Thanks

 

 

7 REPLIES
RM Staff
RM Staff

Re: Get page error in Web Mining Package

Hi,

 

if i try to connect from cmd line i get this:

Resolving www.naver.com... 104.121.126.27
Connecting to www.naver.com|104.121.126.27|:443... connected.
ERROR: cannot verify www.naver.com's certificate, issued by `/C=US/O=GeoTrust Inc./CN=GeoTrust SSL CA - G3':
Unable to locally verify the issuer's authority.

 

which might explain the error.

--------------------------------------------------------------------------
Head of Data Science Services at RapidMiner
Community Manager Community Manager
Community Manager

Re: Get page error in Web Mining Package

Hmm very strange.  I have no problem here:

 

Screen Shot 2017-09-04 at 3.21.06 PM.pngScreen Shot 2017-09-04 at 3.21.19 PM.png

 

I agree with Martin - try the cmd line.  If you're in Unix, I would do "curl -v https://www.naver.com" to see the handshaking.  You should see a 200 OK response and so forth:

 

$ curl -v https://www.naver.com
* Rebuilt URL to: https://www.naver.com/
*   Trying 23.66.210.98...
* TCP_NODELAY set
* Connected to www.naver.com (23.66.210.98) port 443 (#0)
* TLS 1.2 connection using TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256
* Server certificate: ssl.pstatic.net
* Server certificate: GeoTrust SSL CA - G3
* Server certificate: GeoTrust Global CA
> GET / HTTP/1.1
> Host: www.naver.com
> User-Agent: curl/7.54.0
> Accept: */*
> 
< HTTP/1.1 200 OK
< Server: NWS
< Content-Type: text/html; charset=UTF-8
< Cache-Control: no-cache, no-store, must-revalidate
< Pragma: no-cache
< P3P: CP="CAO DSP CURa ADMa TAIa PSAa OUR LAW STP PHY ONL UNI PUR FIN COM NAV INT DEM STA PRE"
< X-Frame-Options: SAMEORIGIN
< X-EdgeConnect-MidMile-RTT: 30
< X-EdgeConnect-Origin-MEX-Latency: 8
< X-EdgeConnect-MidMile-RTT: 203
< X-EdgeConnect-Origin-MEX-Latency: 8
< X-EdgeConnect-Cache-Status: 0
< Date: Mon, 04 Sep 2017 19:25:10 GMT
< Transfer-Encoding:  chunked
< Connection: keep-alive
< Connection: Transfer-Encoding
< 
<!doctype html>



















<html lang="ko" class="svgless">
<head>
<meta charset="utf-8">
<meta name="Referrer" content="origin">
<meta http-equiv="Content-Script-Type" content="text/javascript">
<meta http-equiv="Content-Style-Type" content="text/css">
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<meta name="viewport" content="width=1100">
<meta name="apple-mobile-web-app-title" content="NAVER" />
<meta property="og:title" content="네이버">

If that works, you can use the Execute Program operator in RapidMiner and just insert the same curl statement instead of using the Get Page operator.  Pretty much the same thing.  Smiley Happy

 

Scott

Scott Genzer
Senior Community Manager
RapidMiner, Inc.
RM Partner
RM Partner

Re: Get page error in Web Mining Package

thank you for good answers.

But, I don't have enough knowledge to understand your smart reply.

Could you please explain cmd sciprt in Window OS which would be worked as similar get page Operator?

if it's authority problem, What do I ask to Network Manage of this company?

 

Community Manager Community Manager
Community Manager

Re: Get page error in Web Mining Package

hello @user194372 - well as @mschmitz said, it is likely an SSL certificate that is expired or something like that.  In very general terms, your computer is trying to protect you by not allowing you to connect to a website that is trying to offer an SSL connection but does not have a properly registered SSL certificate.  There is a LOT of information on SSL certificates, and error messages to this effect, on the internet.

 

As for cURL statements on Windows, it is my understanding that it is not native but people often use this.  But I will defer to other Windows users here on the forum...

 

Scott

Scott Genzer
Senior Community Manager
RapidMiner, Inc.
RM Partner
RM Partner

Re: Get page error in Web Mining Package

Dear Team,

 

I tried to access the url through Curl, then get the below message. 

 

* Rebuilt URL to: https://www.naver.com/

*   Trying 202.179.177.21...

* TCP_NODELAY set

* Connected to www.naver.com (202.179.177.21) port 443 (#0)

* ALPN, offering h2

* ALPN, offering http/1.1

* Cipher selection: ALL:!EXPORT:!EXPORT40:!EXPORT56:!aNULL:!LOW:!RC4:@STRENGTH

* successfully set certificate verify locations:

*   CAfile: D:\Profiles\apjt0187\Downloads\curl-7.55.1-win32-mingw\bin\curl-ca-b

undle.crt

  CApath: none

* TLSv1.2 (OUT), TLS handshake, Client hello (1):

* TLSv1.2 (IN), TLS handshake, Server hello (2):

* TLSv1.2 (IN), TLS handshake, Certificate (11):

* TLSv1.2 (OUT), TLS alert, Server hello (2):

* SSL certificate problem: self signed certificate in certificate chain

* stopped the pause stream!

* Closing connection 0

curl: (60) SSL certificate problem: self signed certificate in certificate chain

 

More details here: https://curl.haxx.se/docs/sslcerts.html

 

curl performs SSL certificate verification by default, using a "bundle"

of Certificate Authority (CA) public keys (CA certs). If the default

bundle file isn't adequate, you can specify an alternate file

using the --cacert option.

If this HTTPS server uses a certificate signed by a CA represented in

the bundle, the certificate verification probably failed due to a

problem with the certificate (it might be expired, or the name might

not match the domain name in the URL).

If you'd like to turn off curl's verification of the certificate, use

the -k (or --insecure) option.

HTTPS-proxy has similar options --proxy-cacert and --proxy-insecure.

 

Can you check the problem on this network environment?

Community Manager Community Manager
Community Manager

Re: Get page error in Web Mining Package

so yes that just confirms that it is an SSL certificate issue: "curl: (60) SSL certificate problem: self signed certificate in certificate chain"  The site is using a "self-signed" certificate so it is not externally verified, etc...  Time to go to GoDaddy and buy a real one.

 

Scott

Scott Genzer
Senior Community Manager
RapidMiner, Inc.
Highlighted
RM Certified Expert
RM Certified Expert

Re: Get page error in Web Mining Package

... or go to Let's Encrypt and get one for free ;-)

--
Balázs Bárány
Data Scientist, Vienna
https://datascientist.at