Crawl Web - empty results (PHP script)

mspiessmspiess Member Posts: 4 Contributor I
edited November 2018 in Help

Hello there!

 

I'm a social scientist learning to use RapidMiner for data/text mining and text analysis. 

 

I've been trying to apply "Crawl Web" for the following address http://www.scielo.br/scielo.php?script=sci_issuetoc&pid=0102-690920180001&lng=pt&nrm=iso with no crawling rules applied and depth of 1, but I keep getting empty results.

 

I wonder if this is caused by the target page's php script. If so, does anyone know I workaround for this issue?

 

Also, any hints on setting the crawling rules so I get only the links with a specific link text. For example, in the URL above, I'm mostly interested in the pages with the text "Texto em Português".

 

Greeting from Brazil,

Maiko Spiess

Best Answer

  • sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,959 Community Manager
    Solution Accepted

    so it seems that there is a bot block on that site. If you uncheck "ignore robot exclusion", you get results (I did only two pages just to test). So ethically I cannot tell you do this unless you own the site OR have explicit permission from the owner to crawl his/her site.


    Scott

     

Answers

  • sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,959 Community Manager

    hello @mspiess - welcome to the community. Have you tried looking at other threads in the community? A quick search revealed a thread that may be useful. https://community.rapidminer.com/t5/RapidMiner-Studio-Forum/Crawl-Web-with-follow-link-with-matching-url-returning-empty/m-p/38561#M26480

     

    Scott

     

  • mspiessmspiess Member Posts: 4 Contributor I


    Hi @sgenzer! Thanks for replying.

     

    I have checked the thread you mentioned before posting my own but kept getting empty results. I figured if I try the operator without any rules it should return all the pages within the specified depth. Then I've tried this with a different URL and it worked okay. However, in this particular page I am still getting empty results.

     

    So, crawl rules aside, I'm still wondering if this is something related to the page's php script. Any thoughts?

     

    Greetings,

    Maiko

  • mspiessmspiess Member Posts: 4 Contributor I

    Okay! Got it!

     

    Thank you for your attention.

Sign In or Register to comment.