"Crawling rules"

Xannix · May 2010

Hi,
I'm not sure if I don't understand the method but I don't know how to use the "store_with_matching_content" parameter.

I would like to store pages wich have one specific word (for example "euro"). I've tried to write:

a) Just the word: euro
b) A regular expression, for example: .*euro.*

What is the problem? Could someone explain me this?

Thanks : )

land · May 2010

Hi,
you have to enter a valid regular expression.
Please post the process, so that I can take a look at your parameters.

Greetings,
Sebastian

colo · May 2010

I tried to use this rule some days ago without success. The other rules seem to work as expected, but there might be a an issue with matching the regular expression for store_with_matching_content. I entered several expressions and even .* didn't bring up any results. Does this problem derive from usage or from a little bug?

Xannix · May 2010

Hi colo,
I have the same problem, all the other rules work fine, but not this. Here is my example, crawling Rapid-i web:


<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.0">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" expanded="true" name="Process">
    <process expanded="true" height="145" width="212">
      <operator activated="true" class="web:crawl_web" expanded="true" height="60" name="Crawl Web" width="90" x="45" y="30">
        <parameter key="url" value="http://rapid-i.com/index.php?lang=en"/>
        <list key="crawling_rules">
          <parameter key="2" value="http://rapid-i\.com/.*"/>
          <parameter key="1" value=".*Rapid.*"/>
        </list>
        <parameter key="write_pages_into_files" value="false"/>
        <parameter key="max_pages" value="2"/>
      </operator>
      <connect from_op="Crawl Web" from_port="Example Set" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

land · May 2010

Hi,
what exactly happens with this rule? Does the operator always return an empty set or doesn't it finish work at all?

Greetings,
Sebastian

colo · May 2010

Hello Sebastian,

it doesn't even result in an empty set. There simply are no results, after finishing the process the prompt for switching to results perspective shows up as usual. But there is only the empty result overview and nothing else...

Regards,
Matthias

haddock · May 2010

Greets to all,

Well, it is actually possible to get something from the webcrawler - the code below makes word vectors of the recent posts in this forum - but if you want to mine more than a few pages I'm not sure the websphinx library is that robust. The last version was released in 2002. Furthermore if I insert print statements in appropriate places and build the operators from scratch I can see results that are, shall we say, intriguing. Anyways, here's the creepy crawler...

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.0">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" expanded="true" name="Process">
    <process expanded="true" height="-20" width="-50">
      <operator activated="true" class="web:crawl_web" expanded="true" height="60" name="Crawl Web" width="90" x="53" y="53">
        <parameter key="url" value="http://rapid-i.com/rapidforum/index.php?action=recent"/>
        <list key="crawling_rules">
          <parameter key="0" value="http://rapid-i.com/rapidforum.*"/>
          <parameter key="2" value="http://rapid-i.com/rapidforum.*"/>
        </list>
        <parameter key="write_pages_into_files" value="false"/>
        <parameter key="add_pages_as_attribute" value="true"/>
        <parameter key="output_dir" value="C:\Documents and Settings\Administrator\My Documents\WebCrawler"/>
        <parameter key="max_pages" value="10"/>
        <parameter key="max_depth" value="3"/>
        <parameter key="max_threads" value="12"/>
        <parameter key="user_agent" value="haddock checking rapid-miner-crawler"/>
        <parameter key="obey_robot_exclusion" value="false"/>
        <parameter key="really_ignore_exclusion" value="true"/>
      </operator>
      <operator activated="true" class="text:process_document_from_data" expanded="true" height="76" name="Process Documents from Data" width="90" x="360" y="46">
        <list key="specify_weights"/>
        <process expanded="true" height="353" width="808">
          <operator activated="true" class="web:unescape_html" expanded="true" height="60" name="Unescape Content" width="90" x="187" y="28"/>
          <operator activated="true" class="text:tokenize" expanded="true" height="60" name="Tokenize" width="90" x="400" y="26"/>
          <operator activated="true" class="text:filter_stopwords_english" expanded="true" height="60" name="Filter Stopwords (English)" width="90" x="543" y="26"/>
          <connect from_port="document" to_op="Unescape Content" to_port="document"/>
          <connect from_op="Unescape Content" from_port="document" to_op="Tokenize" to_port="document"/>
          <connect from_op="Tokenize" from_port="document" to_op="Filter Stopwords (English)" to_port="document"/>
          <connect from_op="Filter Stopwords (English)" from_port="document" to_port="document 1"/>
          <portSpacing port="source_document" spacing="0"/>
          <portSpacing port="sink_document 1" spacing="0"/>
          <portSpacing port="sink_document 2" spacing="0"/>
        </process>
      </operator>
      <connect from_op="Crawl Web" from_port="Example Set" to_op="Process Documents from Data" to_port="example set"/>
      <connect from_op="Process Documents from Data" from_port="example set" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

Par contre, if I use a RetrievePagesOperator on the output from an RSS Feed operator all works fine.

Toodles

land · May 2010

Hi,
I switched the regular expression to dotall mode, so that . also replaces line breaks. This solves the issue that the regular expression doesn't match the document, but takes far tooooooo long time for building a regular expression of a website with 120kb. I think we will have to bury this option in the current incarnation.
Any idea how to replace it, beside simply switching to string matching anyway?

Greetings,
Sebastian

PS:
If anybody knows another, powerful open-source web crawler, that's usable from java: I would be gladly to replace that "creepy" sphinx.

haddock · May 2010

Greets Seb,

I'm cannibalising the sphinx at the moment, and working on tokens rather than strings, as well as using the header fields, description, keywords, etc., which are regex friendly, and can be pre-fetched. I've also started looking at Heretrix. Something may emerge

Ciao

land · May 2010

Hi,
thanks for the hint on Heritrix. This really seams worth the effort. Uhm, now I only need somebody to pay me for implementing this. Any volunteers?

Does anybody have negative experience with this crawler? Otherwise I will add it to the feature request list.

Greetings,
Sebastian

Xannix · May 2010

So... uhmm..., isn't posible to crawl with "store_with_matching_content" parameter?

Actually, I do in this way:

[1] Crawl Web ->
[2] Generate Extract ->
[3] Filter Examples

[1]: I don't use "store_with_matching_content"
[2]: I extract text with xPath because the parameter "attribute_value_filter" of the "Filter Examples" operator doesn't work if it find any html tag. Is that normal or not?
[3]: I select the only examples which match content

I know that works fine, but I think that is not efficient...

Any idea?

Thanks : ))

land · May 2010

Hi,
this depends on the regular expression used, but I guess you will have to switch to dotall mode, because normally there's a linebreak behind the tag and per default the . character does not include line breaks.

Greetings,
Sebastian

Xannix · May 2010

Hi,
where can I find the "dotall mode" option?

Thanks

Xannix · May 2010

Sorry, I realized I was wrong...

I've been testing again, if you want to find the word "Euro" in the content you can write:

[\S\s]*Euro[\S\s]*

maybe is a little slow, but it works.

Thanks for all : )

colo · May 2010

Hello Xannix,

if you want to use options/modifications in your expressions you can simply use them by (?x) in your regex. The "x" specifies which option to use, for the "dotall"-option this would be "s". I think it's an easy and clean way to set all options at the beginning of your regex. For your "Euro" seach it would read as follows:

(?s).*Euro.*

Xannix · June 2010

Hi, Colo, thanks, I'll try it : )

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

"Crawling rules"

Answers