Crawl Web - empty results (PHP script)

mspiess
mspiess New Altair Community Member
edited November 2024 in Community Q&A

Hello there!

 

I'm a social scientist learning to use RapidMiner for data/text mining and text analysis. 

 

I've been trying to apply "Crawl Web" for the following address http://www.scielo.br/scielo.php?script=sci_issuetoc&pid=0102-690920180001&lng=pt&nrm=iso with no crawling rules applied and depth of 1, but I keep getting empty results.

 

I wonder if this is caused by the target page's php script. If so, does anyone know I workaround for this issue?

 

Also, any hints on setting the crawling rules so I get only the links with a specific link text. For example, in the URL above, I'm mostly interested in the pages with the text "Texto em Português".

 

Greeting from Brazil,

Maiko Spiess

Tagged:

Welcome!

It looks like you're new here. Sign in or register to get started.

Best Answer

  • sgenzer
    sgenzer
    Altair Employee
    Answer ✓

    so it seems that there is a bot block on that site. If you uncheck "ignore robot exclusion", you get results (I did only two pages just to test). So ethically I cannot tell you do this unless you own the site OR have explicit permission from the owner to crawl his/her site.


    Scott

     

Answers

  • sgenzer
    sgenzer
    Altair Employee

    hello @mspiess - welcome to the community. Have you tried looking at other threads in the community? A quick search revealed a thread that may be useful. https://community.rapidminer.com/t5/RapidMiner-Studio-Forum/Crawl-Web-with-follow-link-with-matching-url-returning-empty/m-p/38561#M26480

     

    Scott

     

  • mspiess
    mspiess New Altair Community Member


    Hi @sgenzer! Thanks for replying.

     

    I have checked the thread you mentioned before posting my own but kept getting empty results. I figured if I try the operator without any rules it should return all the pages within the specified depth. Then I've tried this with a different URL and it worked okay. However, in this particular page I am still getting empty results.

     

    So, crawl rules aside, I'm still wondering if this is something related to the page's php script. Any thoughts?

     

    Greetings,

    Maiko

  • sgenzer
    sgenzer
    Altair Employee
    Answer ✓

    so it seems that there is a bot block on that site. If you uncheck "ignore robot exclusion", you get results (I did only two pages just to test). So ethically I cannot tell you do this unless you own the site OR have explicit permission from the owner to crawl his/her site.


    Scott

     

  • mspiess
    mspiess New Altair Community Member

    Okay! Got it!

     

    Thank you for your attention.

Welcome!

It looks like you're new here. Sign in or register to get started.

Welcome!

It looks like you're new here. Sign in or register to get started.