"Get page error in Web Mining Package"

user194372user194372 Member Posts: 14 Contributor II
edited June 2019 in Help

Hi All,

 

I'm working in Client's company as a project member. Because client wanted us to get the information by Web site, I try to use "get page" operator. Due to Https address, it was problem to access the web page we need.

 

Even though the url starting https can be accessed by web browser such as IE or Chrome, we cannnot access a URL by "get page" operator with below message

error.PNG

I'm sure it's network security issue of this companay. ( I can access the same page by "get page" operator in other places )

What I want to know is what do I ask to Network manager of the company in order to solve this blocking.

 

looking forward to your better knowledge.

 

Thanks

 

 

Tagged:

Answers

  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,503 RM Data Scientist

    Hi,

     

    if i try to connect from cmd line i get this:

    Resolving www.naver.com... 104.121.126.27
    Connecting to www.naver.com|104.121.126.27|:443... connected.
    ERROR: cannot verify www.naver.com's certificate, issued by `/C=US/O=GeoTrust Inc./CN=GeoTrust SSL CA - G3':
    Unable to locally verify the issuer's authority.

     

    which might explain the error.

    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,959 Community Manager

    Hmm very strange.  I have no problem here:

     

    Screen Shot 2017-09-04 at 3.21.06 PM.pngScreen Shot 2017-09-04 at 3.21.19 PM.png

     

    I agree with Martin - try the cmd line.  If you're in Unix, I would do "curl -v https://www.naver.com" to see the handshaking.  You should see a 200 OK response and so forth:

     

    $ curl -v https://www.naver.com
    * Rebuilt URL to: https://www.naver.com/
    * Trying 23.66.210.98...
    * TCP_NODELAY set
    * Connected to www.naver.com (23.66.210.98) port 443 (#0)
    * TLS 1.2 connection using TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256
    * Server certificate: ssl.pstatic.net
    * Server certificate: GeoTrust SSL CA - G3
    * Server certificate: GeoTrust Global CA
    > GET / HTTP/1.1
    > Host: www.naver.com
    > User-Agent: curl/7.54.0
    > Accept: */*
    >
    < HTTP/1.1 200 OK
    < Server: NWS
    < Content-Type: text/html; charset=UTF-8
    < Cache-Control: no-cache, no-store, must-revalidate
    < Pragma: no-cache
    < P3P: CP="CAO DSP CURa ADMa TAIa PSAa OUR LAW STP PHY ONL UNI PUR FIN COM NAV INT DEM STA PRE"
    < X-Frame-Options: SAMEORIGIN
    < X-EdgeConnect-MidMile-RTT: 30
    < X-EdgeConnect-Origin-MEX-Latency: 8
    < X-EdgeConnect-MidMile-RTT: 203
    < X-EdgeConnect-Origin-MEX-Latency: 8
    < X-EdgeConnect-Cache-Status: 0
    < Date: Mon, 04 Sep 2017 19:25:10 GMT
    < Transfer-Encoding: chunked
    < Connection: keep-alive
    < Connection: Transfer-Encoding
    <
    <!doctype html>



















    <html lang="ko" class="svgless">
    <head>
    <meta charset="utf-8">
    <meta name="Referrer" content="origin">
    <meta http-equiv="Content-Script-Type" content="text/javascript">
    <meta http-equiv="Content-Style-Type" content="text/css">
    <meta http-equiv="X-UA-Compatible" content="IE=edge">
    <meta name="viewport" content="width=1100">
    <meta name="apple-mobile-web-app-title" content="NAVER" />
    <meta property="og:title" content="네이버">

    If that works, you can use the Execute Program operator in RapidMiner and just insert the same curl statement instead of using the Get Page operator.  Pretty much the same thing.  :)

     

    Scott

  • user194372user194372 Member Posts: 14 Contributor II

    thank you for good answers.

    But, I don't have enough knowledge to understand your smart reply.

    Could you please explain cmd sciprt in Window OS which would be worked as similar get page Operator?

    if it's authority problem, What do I ask to Network Manage of this company?

     

  • sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,959 Community Manager

    hello @user194372 - well as @mschmitz said, it is likely an SSL certificate that is expired or something like that.  In very general terms, your computer is trying to protect you by not allowing you to connect to a website that is trying to offer an SSL connection but does not have a properly registered SSL certificate.  There is a LOT of information on SSL certificates, and error messages to this effect, on the internet.

     

    As for cURL statements on Windows, it is my understanding that it is not native but people often use this.  But I will defer to other Windows users here on the forum...

     

    Scott

  • user194372user194372 Member Posts: 14 Contributor II

    Dear Team,

     

    I tried to access the url through Curl, then get the below message. 

     

    * Rebuilt URL to: https://www.naver.com/

    *   Trying 202.179.177.21...

    * TCP_NODELAY set

    * Connected to www.naver.com (202.179.177.21) port 443 (#0)

    * ALPN, offering h2

    * ALPN, offering http/1.1

    * Cipher selection: ALL:!EXPORT:!EXPORT40:!EXPORT56:!aNULL:!LOW:!RC4:@STRENGTH

    * successfully set certificate verify locations:

    *   CAfile: D:\Profiles\apjt0187\Downloads\curl-7.55.1-win32-mingw\bin\curl-ca-b

    undle.crt

      CApath: none

    * TLSv1.2 (OUT), TLS handshake, Client hello (1):

    * TLSv1.2 (IN), TLS handshake, Server hello (2):

    * TLSv1.2 (IN), TLS handshake, Certificate (11):

    * TLSv1.2 (OUT), TLS alert, Server hello (2):

    * SSL certificate problem: self signed certificate in certificate chain

    * stopped the pause stream!

    * Closing connection 0

    curl: (60) SSL certificate problem: self signed certificate in certificate chain

     

    More details here: https://curl.haxx.se/docs/sslcerts.html

     

    curl performs SSL certificate verification by default, using a "bundle"

    of Certificate Authority (CA) public keys (CA certs). If the default

    bundle file isn't adequate, you can specify an alternate file

    using the --cacert option.

    If this HTTPS server uses a certificate signed by a CA represented in

    the bundle, the certificate verification probably failed due to a

    problem with the certificate (it might be expired, or the name might

    not match the domain name in the URL).

    If you'd like to turn off curl's verification of the certificate, use

    the -k (or --insecure) option.

    HTTPS-proxy has similar options --proxy-cacert and --proxy-insecure.

     

    Can you check the problem on this network environment?

  • sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,959 Community Manager

    so yes that just confirms that it is an SSL certificate issue: "curl: (60) SSL certificate problem: self signed certificate in certificate chain"  The site is using a "self-signed" certificate so it is not externally verified, etc...  Time to go to GoDaddy and buy a real one.

     

    Scott

  • BalazsBaranyBalazsBarany Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert Posts: 955 Unicorn

    ... or go to Let's Encrypt and get one for free ;-)

Sign In or Register to comment.