[SOLVED] RM5 does not store the pages according to the specified rules...

leoderja · April 2012

I am trying of crawl an online newspaper. I specified rules for navigating trough the previous editions, and I need to store only the individual news (matching_url = .+deportes/8.+), not the index pages where they are listed...

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.2.003">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="5.2.003" expanded="true" name="Process">
    <process expanded="true" height="-20" width="-50">
      <operator activated="true" class="web:crawl_web" compatibility="5.1.004" expanded="true" height="60" name="Crawl Web" width="90" x="84" y="53">
        <parameter key="url" value="http://www.pagina12.com.ar"/>
        <list key="crawling_rules">
          <parameter key="follow_link_with_matching_url" value=".+principal/index.+|.+deportes/index.+|.+deportes/8.+"/>
          <parameter key="store_with_matching_url" value=".+deportes/8.+"/>
        </list>
        <parameter key="output_dir" value="C:\Users\USR\Desktop\FILES"/>
        <parameter key="extension" value="html"/>
        <parameter key="max_depth" value="9999999"/>
        <parameter key="domain" value="server"/>
        <parameter key="user_agent" value="Mozilla/5.0 (Windows NT 6.1; WOW64; rv:11.0) Gecko/20100101 Firefox/11.0"/>
      </operator>
      <connect from_op="Crawl Web" from_port="Example Set" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

But this does not work... Please, see the log... Nothing is stored...

Apr 1, 2012 11:36:36 PM INFO: Process //NewLocalRepository/Pruebas/Crawler starts
Apr 1, 2012 11:36:36 PM INFO: Loading initial data.
Apr 1, 2012 11:36:37 PM INFO: Discarded page "http://www.pagina12.com.ar" because url does not match filter rules.
Apr 1, 2012 11:36:37 PM INFO: Following link http://www.pagina12.com.ar/diario/principal/index-2012-03-31.html
Apr 1, 2012 11:36:37 PM INFO: Following link http://www.pagina12.com.ar/diario/principal/index.html
Apr 1, 2012 11:36:37 PM INFO: Following link http://www.pagina12.com.ar/diario/deportes/index.html
Apr 1, 2012 11:36:37 PM INFO: Following link http://www.pagina12.com.ar/diario/deportes/8-190886-2012-04-01.html
Apr 1, 2012 11:36:37 PM INFO: Following link http://www.pagina12.com.ar/diario/deportes/8-190902-2012-04-01.html
Apr 1, 2012 11:36:37 PM INFO: Following link http://www.pagina12.com.ar/diario/deportes/8-190872-2012-04-01.html
Apr 1, 2012 11:36:37 PM INFO: Following link http://www.pagina12.com.ar/diario/deportes/8-190897-2012-04-01.html
Apr 1, 2012 11:36:39 PM INFO: Discarded page "http://www.pagina12.com.ar/diario/principal/index-2012-03-31.html" because url does not match filter rules.
Apr 1, 2012 11:36:39 PM INFO: Following link http://www.pagina12.com.ar/diario/principal/index-2012-03-30.html
Apr 1, 2012 11:36:39 PM INFO: Following link http://www.pagina12.com.ar/diario/deportes/index-2012-03-31.html
Apr 1, 2012 11:36:39 PM INFO: Following link http://www.pagina12.com.ar/diario/deportes/8-190801-2012-03-31.html
Apr 1, 2012 11:36:39 PM INFO: Following link http://www.pagina12.com.ar/diario/deportes/8-190811-2012-03-31.html
Apr 1, 2012 11:36:39 PM INFO: Following link http://www.pagina12.com.ar/diario/deportes/8-190840-2012-03-31.html
Apr 1, 2012 11:36:42 PM INFO: Discarded page "http://www.pagina12.com.ar/diario/principal/index-2012-03-30.html" because url does not match filter rules.
Apr 1, 2012 11:36:42 PM INFO: Following link http://www.pagina12.com.ar/diario/principal/index-2012-03-29.html
Apr 1, 2012 11:36:42 PM INFO: Following link http://www.pagina12.com.ar/diario/deportes/index-2012-03-30.html
Apr 1, 2012 11:36:42 PM INFO: Following link http://www.pagina12.com.ar/diario/deportes/8-190725-2012-03-30.html
Apr 1, 2012 11:36:42 PM INFO: Following link http://www.pagina12.com.ar/diario/deportes/8-190718-2012-03-30.html
Apr 1, 2012 11:36:42 PM INFO: Following link http://www.pagina12.com.ar/diario/deportes/8-190743-2012-03-30.html
Apr 1, 2012 11:36:44 PM INFO: Discarded page "http://www.pagina12.com.ar/diario/principal/index-2012-03-29.html" because url does not match filter rules.
Apr 1, 2012 11:36:44 PM INFO: Following link http://www.pagina12.com.ar/diario/principal/index-2012-03-28.html
Apr 1, 2012 11:36:44 PM INFO: Following link http://www.pagina12.com.ar/diario/deportes/index-2012-03-29.html
Apr 1, 2012 11:36:44 PM INFO: Following link http://www.pagina12.com.ar/diario/deportes/8-190641-2012-03-29.html
Apr 1, 2012 11:36:44 PM INFO: Following link http://www.pagina12.com.ar/diario/deportes/8-190635-2012-03-29.html
Apr 1, 2012 11:36:44 PM INFO: Following link http://www.pagina12.com.ar/diario/deportes/8-190650-2012-03-29.html
Apr 1, 2012 11:36:47 PM INFO: Discarded page "http://www.pagina12.com.ar/diario/principal/index-2012-03-28.html" because url does not match filter rules.
bla, bla, bla...
bla, bla, bla...
bla, bla, bla...

What can be wrong? There is a bug in RM5's WebCrawler? Or I am doing some wrong?

Thank you in advance.
Leonardo Der Jachadurian Gorojans

MariusHelf · April 2012

Hi Leonardo,

try to reduce the max depth and/or adjust your FOLLOW rules. The operator first descends, and on its way back up from the recursion stores the pages. You seem to recurse (almost) indefinitely deep. That could imply an error in your FOLLOW rule. Reducing the max. depth however can also help.

Best, Marius

leoderja · April 2012

Dear Marius: thanks for explaining me how RM´s crawlers works internally.

This online newspaper has a lot of circular link paths. I have rearranged the RM crawl process in order to make the date-navigation iterative with the loop operator, and from every index (of every date) I use the webcrawler to get the individual news pages. See here, please...

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.2.003">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="5.2.003" expanded="true" name="Process">
    <process expanded="true" height="541" width="165">
      <operator activated="true" class="loop" compatibility="5.2.003" expanded="true" height="76" name="Loop" width="90" x="45" y="30">
        <parameter key="set_iteration_macro" value="true"/>
        <parameter key="iterations" value="99"/>
        <process expanded="true" height="541" width="346">
          <operator activated="true" class="generate_macro" compatibility="5.2.003" expanded="true" height="76" name="Generate Macro" width="90" x="45" y="30">
            <list key="function_descriptions">
              <parameter key="URL" value="&quot;http://www.pagina12.com.ar/diario/deportes/index-&amp;quot;+date_str_custom(date_add(date_now(),-%{iteration}, DATE_UNIT_DAY), &quot;yyyy-MM-dd&quot;)+&quot;.html&quot;"/>
            </list>
          </operator>
          <operator activated="true" class="web:crawl_web" compatibility="5.1.004" expanded="true" height="60" name="Crawl Web" width="90" x="246" y="75">
            <parameter key="url" value="%{URL}"/>
            <list key="crawling_rules">
              <parameter key="follow_link_with_matching_url" value="http://www\.pagina12\.com\.ar/diario/deportes/8-.+"/>
              <parameter key="store_with_matching_url" value="http://www\.pagina12\.com\.ar/diario/deportes/8-.+"/>
            </list>
            <parameter key="output_dir" value="C:\Users\USR\Desktop\FILES"/>
            <parameter key="extension" value="html"/>
            <parameter key="domain" value="server"/>
          </operator>
          <operator activated="false" class="print_to_console" compatibility="5.2.003" expanded="true" height="60" name="Print to Console" width="90" x="112" y="345">
            <parameter key="log_value" value="%{URL}"/>
          </operator>
          <connect from_port="input 1" to_op="Generate Macro" to_port="through 1"/>
          <connect from_op="Crawl Web" from_port="Example Set" to_port="output 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="source_input 2" spacing="0"/>
          <portSpacing port="sink_output 1" spacing="0"/>
          <portSpacing port="sink_output 2" spacing="0"/>
        </process>
      </operator>
      <connect from_port="input 1" to_op="Loop" to_port="input 1"/>
      <connect from_op="Loop" from_port="output 1" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="source_input 2" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

Thank you.
Best regards, Leonardo

MariusHelf · April 2012

Hi,

what's the problem with that process? Without looking at it in detail I saw that it grabs some pages...
If it does not what you expected, did you try some of my hints in my first post? With those I got your first process working.

Best,
Marius

leoderja · April 2012

Hi Marius:

I followed your recommendations, adjusting crawling rules (I changed to be more specific to avoid unwanted paths), and trying several depths (from 0 to 9, 10, 20, 50, 99, 999, 999) and it does not work as I need.

What I need is crawl every pages about "Deportes", from today to several years ago (say 5 years), in this online newspaper (Pagina12.com)

With high depths (>=99), I have reached a top of 1062 stored pages and after this, the process stops without errors. With a depth of 9, I only get 96 pages stored...

The solution that I posted, was able to obtain all the pages.

Where can I get a more detailed documentation about RM webcrawler? (or the documentation about the library that this node uses)

Thank you.

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

[SOLVED] RM5 does not store the pages according to the specified rules...

Answers