Never results in "Process Documents from Web"

tu_162092tu_162092 Member Posts: 13 Contributor I
edited December 2018 in Help
Hello,

I have a problem with the operator "Process Documents from web". No matter how I set the operator, URLs will never be found, although a few months ago the process still worked and the URL structure didn't change.

I tried it with different domains, unfortunately the Rapidminer never finds URLs.

What could be the reason? It would be great if someone could help me :)!

Greetings
Tim

Best Answer

Answers

  • kaymankayman Member Posts: 662 Unicorn
    Are you behind a firewall or proxy?
    As far as I know they work still as before (at least for me :-)) so unless your network has changed it should be ok.

    Can you still access the marketplace? This usually is a good indication that you can at least access the internet through RM. If not check your preferences -> proxy

    Another possible scenario is that your site changed protocol, and is no longer using http but https. So while the url might still look the same at first glance your request might get blocked. 
  • tu_162092tu_162092 Member Posts: 13 Contributor I

    Thanks for your answer.

    I am not using a proxy and there is also a connection to the marketplace. But no matter which URL I want to crawl, URL's will never be found.

    If I check the URL structure of old processes and it hasn't changed, it should still work, right? 

    I really can't explain why.
  • kaymankayman Member Posts: 662 Unicorn
    Could you share your process? There is indeed no reason why it wouldn't work, but without more details it's hard to see which direction to look at.
  • tu_162092tu_162092 Member Posts: 13 Contributor I
    I can't post pictures or links because I'm still new in the community. Can I send you pictures by e-mail?
  • tu_162092tu_162092 Member Posts: 13 Contributor I
    Ok here are the pictures....
  • kaymankayman Member Posts: 662 Unicorn
    could you try using .* as pattern?
    Your current expression is /*, which basically means take everything that ends with a /, and I don't care how many slashes there are.

    Using /.* (dot star) you state 'give me anything available behind the slash, as many times it occurs.'

    One thing I always recommend is to get at least the main page, or one of the links directly before trying the crawl logic. This way you are assured you can already get the page one way or another
  • tu_162092tu_162092 Member Posts: 13 Contributor I
  • tu_162092tu_162092 Member Posts: 13 Contributor I
  • tu_162092tu_162092 Member Posts: 13 Contributor I
  • tu_162092tu_162092 Member Posts: 13 Contributor I
    URL: https://www.gelbeseiten.de/reisebueros/berlin
    URL i want to crawl: https://www.gelbeseiten.de/gsbiz/

    I can't post pictures because i'm still new in the community. So here is a link to Google Drive. There are pictures of the process.

    https://drive.google.com/drive/folders/1PWt9zS2azBoR5DAhwI8Y17zetBTauUJ1?usp=sharing
  • sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,959 Community Manager
    hi @tu_162092 I am sorry about the permissions but we are getting an increasing amount of spammers here and this is the only way to block them. If you have more issues please send me a DM.

    Scott

  • kaymankayman Member Posts: 662 Unicorn
    hi @tu_162092, did you try my suggestion (so using .* instead of *) as your screenshots say otherwise.
  • tu_162092tu_162092 Member Posts: 13 Contributor I
    Okay, this is embarrassing :D . You were right, with .* it works. Many thanks for your help!!!!

    But now I have another question that you can help me with.

    How do I adjust the crawling rules so that he follows the links on the entry page and then goes to the next page and does the same there again?

    Greetings
    Tim
  • kaymankayman Member Posts: 662 Unicorn
    No problem, happens to me also on a regular base :-)

    As for rules, If I recall right this is handled with setting the max crawl depth, try by changing it to 3 or more.
    When 2 it will take main page and the next one, with 3 it will also take the next one and so on.

  • tu_162092tu_162092 Member Posts: 13 Contributor I
    @kayman Thanks again for your tip. I will test it :).

    You can ignore the contributions with the screenshots! The posts were first blocked and now all unlocked.

    @sgenzer No problem!
  • Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 Unicorn
    As @kayman says, the crawl depth will determine how many consecutive pages will be followed.  But be careful with this because it can greatly increase the number of results that are returned, and this operator can be quite slow. You can try making the crawling rule more page specific and that sometimes helps.  You also should determine whether you need to do both rules (follow and store, you have both in the screenshot above)---typically both are not needed.  Storing is useful if you want to keep all the raw html files, but if you are processing it all in RapidMiner and then converting it into an exampleset then usually you don't need both.

    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
  • tu_162092tu_162092 Member Posts: 13 Contributor I
    @Telcontar120 Thank you for your help.

    I would like the rapidminer to open each profile in the listing of yellow pages and pull the information there. If he has opened all profiles on one page, he should go to the next page and open all profiles there again, until I have all information of all profiles on all pages.

    With the above process I can either open all profiles on one page or open all listing pages. I can't do both.

    Unfortunately I am a real beginner. Can I solve the problem with this operator or do I need other operators?
  • tu_162092tu_162092 Member Posts: 13 Contributor I
    @Telcontar120 Thank you for your help.

    I would like the rapidminer to open each profile (e.g. https://www.gelbeseiten.de/gsbiz/f0c65462-3748-48d8-85be-8635269ca1fd) in the listing (e.g. https://www.gelbeseiten.de/reisebueros/frankfurt-am-main) of yellow pages and pull the information there. If he has opened all profiles on one page, he should go to the next page and open all profiles there again, until I have all information of all profiles on all pages.

    With the above process I can either open all profiles on one page or open all listing pages. I can't do both.

    Unfortunately I am a real beginner. Can I solve the problem with this operator or do I need other operators?
  • Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 Unicorn
    Did you see the suggested solution that @kayman provided?  It seems like that is doing what you are requesting (or could be adapted pretty easily).  Using the Loop with the page number in the URL query is definitely a workable solution, I have used it several times in the past myself.

    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
  • tu_162092tu_162092 Member Posts: 13 Contributor I

    I'm really sorry I didn't get back to you until now. I have tested @kayman process and it works. Thank you very much for your help! This is a very cool community.

    I already wish you a Merry Christmas :)!
  • sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,959 Community Manager
    very glad to have you here @tu_162092. Happy holidays to you as well!

    Scott
Sign In or Register to comment.