The RapidMiner community is on read-only mode until further notice. Technical support via cases will continue to work as is. For any urgent licensing related requests from Students/Faculty members, please use the Altair academic forum here.

Retreiving HTML Pages Requiring Login

byarnobyarno Member Posts: 1 Contributor I
edited November 2018 in Help

Hello,

I am trying to download a series of HTML pages that require a user login using the Get Pages operator.  I tried the solution proposed in the post below, but the data I need did not come through (all non-password protected portions of the page did).  I also tried setting the user agent, and I am currently logged in to the site in my browser.  I am trying to download pages from insider.espn.com, so maybe they have more stringent security settings, or maybe it has to do with the fact that only certain portions of the page are password protected.  Any ideas on how to get the password protected information is apprecated.

 

Here is an example of the link I am working with.  The information I'm interested in is the "status report" section:

http://insider.espn.com/nfl/draft/player/_/id/8743/draftyear/2005

 

Here is the solution I tried:

http://community.rapidminer.com/t5/RapidMiner-Studio/Need-to-crawl-webpages-requiring-login-details/m-p/20711/highlight/true#M15386

 

--Brian

 

 

 

 

Answers

  • Marco_BoeckMarco_Boeck Administrator, Moderator, Employee, Member, University Professor Posts: 1,996 RM Engineering

    Hi,

     

    due to Cross Site Request Forgery prevention and other security measures, crawling password protected pages of modern web sites can be quite difficult indeed. Unfortunately, you won't have much luck in those cases I'm afraid.. That's also the reason why our new Web Crawler operator ("Crawl Web") only supports basic auth (the one with the ugly dialog popup asking for credentials).

     

    Regards,

    Marco

Sign In or Register to comment.