RapidMiner

Crawl Web with follow_link_with_matching_url returning empty

SOLVED
Contributor II

Crawl Web with follow_link_with_matching_url returning empty

Hi, I discovered rapidminer recently and I am impressed with its usability.

I am crowling the website and downloading some information from there (it is in portuguese):

 

http://www.portaldatransparencia.gov.br/servidores/Servidor-ListaServidores.asp?bogus=1&Pagina=1

 

I have a rule that download all pages with the matching regular expression:

 

.+Servidor-DetalhaServidor.+|.+Servidor-DetalhaRemuneracao.+

 

And it is working great. But in this page there is a "next" button that shows more samples. The "next" button send me to the following url:

 

http://www.portaldatransparencia.gov.br/servidores/Servidor-ListaServidores.asp?bogus=1&Pagina=2

 

So I inserted a "follow_link_with_matching_url" rule with the following regular expression:

 

.+Servidor-ListaServidores.+

 

But when I insert this rule I get empty results. Why is this happening?

 

Best Regards

Alan

1 ACCEPTED SOLUTION

Accepted Solutions
RMStaff
Solution
Accepted by topic author alanbontempo
‎05-12-2017 02:51 PM

Re: Crawl Web with follow_link_with_matching_url returning empty

Hi Alan,

 

as @Thomas_Ott pointed out I highly assume that the problem results in the reason that the content of the pagination is based on JavaScript. Unfortunately, the Web crawler is not capable of accessing this.

If I get you right you want to extract the data from every page. The website seems quite easy structured so you could e.g. access every page incrementally.

In this case you could use the Loop operator and set the parameter number of iterations to the maximum number of pages.

Then you could directly access each page using the URL

http://www.portaldatransparencia.gov.br/servidores/Servidor-ListaServidores.asp?bogus=1&Pagina=%{iteration}

In this case the macro iteration reflects the number of the page.

Please turn off the Parallel execution for the Loop operator, otherwise your IP might get blacklisted.

 

I hope this gets you nearer to the expected result,

Edin

 

9 REPLIES
Moderator

Re: Crawl Web with follow_link_with_matching_url returning empty

Hi, Would you please share the process? Would make it easier to inspect. Thanks!

Contributor II

Re: Crawl Web with follow_link_with_matching_url returning empty

Hi, The process is attached! Tnaks.

Attachments

Contributor II

Re: Crawl Web with follow_link_with_matching_url returning empty

Hi Thomas,

 

When you say "process" you mean this xml document:

 

<?xml version="1.0" encoding="UTF-8"?><process version="7.5.000">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="7.5.000" expanded="true" name="Process">
    <parameter key="logverbosity" value="init"/>
    <parameter key="random_seed" value="2001"/>
    <parameter key="send_mail" value="never"/>
    <parameter key="notification_email" value=""/>
    <parameter key="process_duration_for_mail" value="30"/>
    <parameter key="encoding" value="SYSTEM"/>
    <process expanded="true">
      <operator activated="true" class="web:crawl_web_modern" compatibility="7.3.000" expanded="true" height="68" name="Crawl Web" width="90" x="112" y="34">
        <parameter key="url" value="http://www.portaldatransparencia.gov.br/servidores/Servidor-ListaServidores.asp"/>
        <list key="crawling_rules">
          <parameter key="store_with_matching_url" value=".+Servidor-DetalhaServidor.+|.+Servidor-DetalhaRemuneracao.+"/>
          <parameter key="follow_link_with_matching_url" value=".+Servidor-ListaServidores.+"/>
        </list>
        <parameter key="max_crawl_depth" value="2"/>
        <parameter key="retrieve_as_html" value="true"/>
        <parameter key="enable_basic_auth" value="false"/>
        <parameter key="add_content_as_attribute" value="false"/>
        <parameter key="write_pages_to_disk" value="false"/>
        <parameter key="include_binary_content" value="false"/>
        <parameter key="output_dir" value="D:\Users\alan\Documents\webcrowling"/>
        <parameter key="output_file_extension" value="html"/>
        <parameter key="max_pages" value="1000"/>
        <parameter key="max_page_size" value="1000"/>
        <parameter key="delay" value="200"/>
        <parameter key="max_concurrent_connections" value="100"/>
        <parameter key="max_connections_per_host" value="50"/>
        <parameter key="user_agent" value="rapidminer-web-mining-extension-crawlerMozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.96 Safari/537.36"/>
        <parameter key="ignore_robot_exclusion" value="true"/>
      </operator>
      <connect from_op="Crawl Web" from_port="example set" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

Best Regards,

Alan

 

Moderator

Re: Crawl Web with follow_link_with_matching_url returning empty

Hi, I was speaking with @Edin_Klapic regarding this. It might be related to some Javascript in that page. I believe he might have a workaround. 

RMStaff
Solution
Accepted by topic author alanbontempo
‎05-12-2017 02:51 PM

Re: Crawl Web with follow_link_with_matching_url returning empty

Hi Alan,

 

as @Thomas_Ott pointed out I highly assume that the problem results in the reason that the content of the pagination is based on JavaScript. Unfortunately, the Web crawler is not capable of accessing this.

If I get you right you want to extract the data from every page. The website seems quite easy structured so you could e.g. access every page incrementally.

In this case you could use the Loop operator and set the parameter number of iterations to the maximum number of pages.

Then you could directly access each page using the URL

http://www.portaldatransparencia.gov.br/servidores/Servidor-ListaServidores.asp?bogus=1&Pagina=%{iteration}

In this case the macro iteration reflects the number of the page.

Please turn off the Parallel execution for the Loop operator, otherwise your IP might get blacklisted.

 

I hope this gets you nearer to the expected result,

Edin

 

Contributor II

Re: Crawl Web with follow_link_with_matching_url returning empty

Hi Edin,

 

Thank you very much. The for loop worked perfecly.

 

Best Regards

Alan

Contributor II

Re: Crawl Web with follow_link_with_matching_url returning empty

Hi @Thomas_Ott

 

I used the Loop operator but I am facing a very annoying problem. In each loop interation the WebCrawler operator ovewrite the file, once the files are written, for example, 0.tml to 14.html.

 

Is is possible the WebCrawler give another name for saved files? I am trying to rename the files after the loop but I still did not find a solution.

 

Best Regards,

Alan

 

 

Highlighted
RMStaff

Re: Crawl Web with follow_link_with_matching_url returning empty

Hi Alan,

 

Let me try a wild guess - You might need to change the filename which you write by combining macros.

Or try using the macro %{a}. It reflects the number of executions for the corresponding operator.

In case this does not work, could you please share your process again?

 

Best,

Edin

Moderator

Re: Crawl Web with follow_link_with_matching_url returning empty

I think you can use a macro value to give them all a unique name. You might try appending the file extension name with txt_%{iteration} or txt_%{t}, the %{t} macro is the system time.