Crawl Web - empty results (PHP script)

Hello there!
I'm a social scientist learning to use RapidMiner for data/text mining and text analysis.
I've been trying to apply "Crawl Web" for the following address http://www.scielo.br/scielo.php?script=sci_issuetoc&pid=0102-690920180001&lng=pt&nrm=iso with no crawling rules applied and depth of 1, but I keep getting empty results.
I wonder if this is caused by the target page's php script. If so, does anyone know I workaround for this issue?
Also, any hints on setting the crawling rules so I get only the links with a specific link text. For example, in the URL above, I'm mostly interested in the pages with the text "Texto em Português".
Greeting from Brazil,
Maiko Spiess
Best Answer
-
so it seems that there is a bot block on that site. If you uncheck "ignore robot exclusion", you get results (I did only two pages just to test). So ethically I cannot tell you do this unless you own the site OR have explicit permission from the owner to crawl his/her site.
Scott0
Answers
-
hello @mspiess - welcome to the community. Have you tried looking at other threads in the community? A quick search revealed a thread that may be useful. https://community.rapidminer.com/t5/RapidMiner-Studio-Forum/Crawl-Web-with-follow-link-with-matching-url-returning-empty/m-p/38561#M26480
Scott
0 -
Hi @sgenzer! Thanks for replying.
I have checked the thread you mentioned before posting my own but kept getting empty results. I figured if I try the operator without any rules it should return all the pages within the specified depth. Then I've tried this with a different URL and it worked okay. However, in this particular page I am still getting empty results.
So, crawl rules aside, I'm still wondering if this is something related to the page's php script. Any thoughts?
Greetings,
Maiko
0 -
so it seems that there is a bot block on that site. If you uncheck "ignore robot exclusion", you get results (I did only two pages just to test). So ethically I cannot tell you do this unless you own the site OR have explicit permission from the owner to crawl his/her site.
Scott0 -
Okay! Got it!
Thank you for your attention.
1