Due to recent updates, all users are required to create an Altair One account to login to the RapidMiner community. Click the Register button to create your account using the same email that you have previously used to login to the RapidMiner community. This will ensure that any previously created content will be synced to your Altair One account. Once you login, you will be asked to provide a username that identifies you to other Community users. Email us at Community with questions.

Crawl Web - empty results (PHP script)

mspiessmspiess Member Posts: 4 Contributor I
edited November 2018 in Help

Hello there!

 

I'm a social scientist learning to use RapidMiner for data/text mining and text analysis. 

 

I've been trying to apply "Crawl Web" for the following address http://www.scielo.br/scielo.php?script=sci_issuetoc&pid=0102-690920180001&lng=pt&nrm=iso with no crawling rules applied and depth of 1, but I keep getting empty results.

 

I wonder if this is caused by the target page's php script. If so, does anyone know I workaround for this issue?

 

Also, any hints on setting the crawling rules so I get only the links with a specific link text. For example, in the URL above, I'm mostly interested in the pages with the text "Texto em Português".

 

Greeting from Brazil,

Maiko Spiess

Best Answer

  • sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,959 Community Manager
    Solution Accepted

    so it seems that there is a bot block on that site. If you uncheck "ignore robot exclusion", you get results (I did only two pages just to test). So ethically I cannot tell you do this unless you own the site OR have explicit permission from the owner to crawl his/her site.


    Scott

     

Answers

  • sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,959 Community Manager

    hello @mspiess - welcome to the community. Have you tried looking at other threads in the community? A quick search revealed a thread that may be useful. https://community.rapidminer.com/t5/RapidMiner-Studio-Forum/Crawl-Web-with-follow-link-with-matching-url-returning-empty/m-p/38561#M26480

     

    Scott

     

  • mspiessmspiess Member Posts: 4 Contributor I


    Hi @sgenzer! Thanks for replying.

     

    I have checked the thread you mentioned before posting my own but kept getting empty results. I figured if I try the operator without any rules it should return all the pages within the specified depth. Then I've tried this with a different URL and it worked okay. However, in this particular page I am still getting empty results.

     

    So, crawl rules aside, I'm still wondering if this is something related to the page's php script. Any thoughts?

     

    Greetings,

    Maiko

  • mspiessmspiess Member Posts: 4 Contributor I

    Okay! Got it!

     

    Thank you for your attention.

Sign In or Register to comment.