"Crawl Web -- Enable Basic Auth"

carlcarl Member Posts: 30 Guru
edited June 2019 in Help

Hi - I'm brand new to Rapidminer as of this week (using Studio 7.3).  I'm using Crawl Web to access the web page http://www.thetimes.co.uk/search?q= (with added search parameters), and I can successfully return a set of news articles.  However each search result is returning only the first few paragraphs of each article because my login has not been recognized.  I've entered the correct account credentials in "Enable Basic Auth".  Any ideas please?

Tagged:

Best Answer

  • Thomas_OttThomas_Ott RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,761 Unicorn
    Solution Accepted

    I remember watching this video https://www.youtube.com/watch?v=-Sr3i7klRHM a while back and I believe they past Twitter OAuth credentials in RapidMiner using Generate User Data and something else. This was right before we came out with the Twitter operators, but if you hack this you might be able to get into your login. 

     

Answers

  • sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,959 Community Manager

     

    hi...yes I have had the same issues with this operator.  You're doing everything correctly but I have found that "basic auth" feature rather hit-or-miss.  It's basic <grin>.  Note the help documentation says only to use this over https because it places the auth credentials in the header.  But in a news site like New York Times (where I have a subscription), that's not how it works.  I am not an expert in authentication so will defer to others on the differences here.

     

    That said, I have gotten this kind of thing to work in RapidMiner but it will not be one click like you are hoping...


    Scott

  • Thomas_OttThomas_Ott RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,761 Unicorn

    Might I suggest checking out our Mozenda extension. Although you pay some $ to Mozenda, you can scrap things way easier. 

  • sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,959 Community Manager

    Thanks, Tom.  I forgot about Mozenda because it is never an option for me (it requires a Windows client) and it is very expensive.  But for those with Windows and a budget of $99+ per month, it is certainly a good option.


    Scott

  • carlcarl Member Posts: 30 Guru

    Thanks for the responses.  I did take a brief look at Mozenda (looks interesting), but was hoping there might be an alternative approach for the same reasons as Scott, i.e. because I use a MacBook and because of the cost.  I know it is an option if I install a virtal machine program like VMware Fusion, so I may yet have to reconsider.  The Times login is https://login.thetimes.co.uk, so I had hoped that maybe I'd just misstepped in my set-up of Enable Basic Auth.

  • sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,959 Community Manager

    That's a very helpful video.  Thanks, Tom.

  • carlcarl Member Posts: 30 Guru

    Thanks Thomas.  I haven't had chance to try the video idea as I'm wrestling with Process Documents from the Web at the moment.  But will take a look when I get chance.

  • JEdwardJEdward RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 578 Unicorn

    That's a great Youtube video!  Looks like it's also using one of my example processes from back in the day too!  #FeelingProud

    http://community.rapidminer.com/t5/RapidMiner-Server/SOLVED-Open-File-with-basic-authentication-in-RapidAnalytics/m-p/24073

    You might need to change a bit of the XML on this link to convert it from 5.3 to 7.3 formatting. 

     

    I have a whole set of template processes somewhere around that setup OAuth integration for a couple of email marketing APIs (Silverpop & DotMailer) as well as Twitter authentication. 

     

  • Marco_BoeckMarco_Boeck Administrator, Moderator, Employee, Member, University Professor Posts: 1,993 RM Engineering

    Hi,

     

    basic auth is the authentication where in your browser you'd get the ugly input dialog box overlay.

    If you have a form login (embedded login in web page), that's not basic auth anymore. The problem is that those logins would be theoretically be supported, but due to Cross Site Request Forgery prevention, it almost never works :( Thus it was excluded from the operator.

     

    Regards,

    Marco

Sign In or Register to comment.