Options

"Crawling rules"

XannixXannix Member Posts: 21 Contributor II
edited June 2019 in Help
Hi,
I'm not sure if I don't understand the method but I don't know how to use the "store_with_matching_content" parameter.

I would like to store pages wich have one specific word (for example "euro"). I've tried to write:

a) Just the word: euro
b) A regular expression, for example: .*euro.*

What is the problem? Could someone explain me this?

Thanks : )
Tagged:

Answers

  • Options
    landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn
    Hi,
    you have to enter a valid regular expression.
    Please post the process, so that I can take a look at your parameters.

    Greetings,
      Sebastian
  • Options
    colocolo Member Posts: 236 Maven
    I tried to use this rule some days ago without success. The other rules seem to work as expected, but there might be a an issue with matching the regular expression for store_with_matching_content. I entered several expressions and even .* didn't bring up any results. Does this problem derive from usage or from a little bug? ;)
  • Options
    XannixXannix Member Posts: 21 Contributor II
    Hi colo,
    I have the same problem, all the other rules work fine, but not this. Here is my example, crawling Rapid-i web:

    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.0">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" expanded="true" name="Process">
        <process expanded="true" height="145" width="212">
          <operator activated="true" class="web:crawl_web" expanded="true" height="60" name="Crawl Web" width="90" x="45" y="30">
            <parameter key="url" value="http://rapid-i.com/index.php?lang=en"/>
            <list key="crawling_rules">
              <parameter key="2" value="http://rapid-i\.com/.*"/>
              <parameter key="1" value=".*Rapid.*"/>
            </list>
            <parameter key="write_pages_into_files" value="false"/>
            <parameter key="max_pages" value="2"/>
          </operator>
          <connect from_op="Crawl Web" from_port="Example Set" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>

  • Options
    landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn
    Hi,
    what exactly happens with this rule? Does the operator always return an empty set or doesn't it finish work at all?

    Greetings,
      Sebastian
  • Options
    colocolo Member Posts: 236 Maven
    Hello Sebastian,

    it doesn't even result in an empty set. There simply are no results, after finishing the process the prompt for switching to results perspective shows up as usual. But there is only the empty result overview and nothing else...

    Regards,
    Matthias
  • Options
    haddockhaddock Member Posts: 849 Maven
    Greets to all,

    Well, it is actually possible to get something from the webcrawler - the code below makes word vectors of the recent posts in this forum - but if you want to mine more than a few pages I'm not sure the websphinx library is that robust. The last version was released in 2002. Furthermore if I insert print statements in appropriate places and build the operators from scratch I can see results that are, shall we say, intriguing. Anyways, here's the creepy crawler...
    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.0">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" expanded="true" name="Process">
        <process expanded="true" height="-20" width="-50">
          <operator activated="true" class="web:crawl_web" expanded="true" height="60" name="Crawl Web" width="90" x="53" y="53">
            <parameter key="url" value="http://rapid-i.com/rapidforum/index.php?action=recent"/>
            <list key="crawling_rules">
              <parameter key="0" value="http://rapid-i.com/rapidforum.*"/>
              <parameter key="2" value="http://rapid-i.com/rapidforum.*"/>
            </list>
            <parameter key="write_pages_into_files" value="false"/>
            <parameter key="add_pages_as_attribute" value="true"/>
            <parameter key="output_dir" value="C:\Documents and Settings\Administrator\My Documents\WebCrawler"/>
            <parameter key="max_pages" value="10"/>
            <parameter key="max_depth" value="3"/>
            <parameter key="max_threads" value="12"/>
            <parameter key="user_agent" value="haddock checking rapid-miner-crawler"/>
            <parameter key="obey_robot_exclusion" value="false"/>
            <parameter key="really_ignore_exclusion" value="true"/>
          </operator>
          <operator activated="true" class="text:process_document_from_data" expanded="true" height="76" name="Process Documents from Data" width="90" x="360" y="46">
            <list key="specify_weights"/>
            <process expanded="true" height="353" width="808">
              <operator activated="true" class="web:unescape_html" expanded="true" height="60" name="Unescape Content" width="90" x="187" y="28"/>
              <operator activated="true" class="text:tokenize" expanded="true" height="60" name="Tokenize" width="90" x="400" y="26"/>
              <operator activated="true" class="text:filter_stopwords_english" expanded="true" height="60" name="Filter Stopwords (English)" width="90" x="543" y="26"/>
              <connect from_port="document" to_op="Unescape Content" to_port="document"/>
              <connect from_op="Unescape Content" from_port="document" to_op="Tokenize" to_port="document"/>
              <connect from_op="Tokenize" from_port="document" to_op="Filter Stopwords (English)" to_port="document"/>
              <connect from_op="Filter Stopwords (English)" from_port="document" to_port="document 1"/>
              <portSpacing port="source_document" spacing="0"/>
              <portSpacing port="sink_document 1" spacing="0"/>
              <portSpacing port="sink_document 2" spacing="0"/>
            </process>
          </operator>
          <connect from_op="Crawl Web" from_port="Example Set" to_op="Process Documents from Data" to_port="example set"/>
          <connect from_op="Process Documents from Data" from_port="example set" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>
    Par contre, if I use a RetrievePagesOperator on the output from an RSS Feed operator all works fine.


    Toodles


  • Options
    landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn
    Hi,
    I switched the regular expression to dotall mode, so that . also replaces line breaks. This solves the issue that the regular expression doesn't match the document, but takes far tooooooo long time for building a regular expression of a website with 120kb. I think we will have to bury this option in the current incarnation.
    Any idea how to replace it, beside simply switching to string matching anyway?

    Greetings,
      Sebastian

    PS:
    If anybody knows another, powerful open-source web crawler, that's usable from java: I would be gladly to replace that "creepy" sphinx.

  • Options
    haddockhaddock Member Posts: 849 Maven
    Greets Seb,

    I'm cannibalising the sphinx at the moment, and working on tokens rather than strings, as well as using the header fields, description, keywords, etc., which are regex friendly, and can be pre-fetched. I've also started looking at Heretrix. Something may emerge  ;)

    Ciao
  • Options
    landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn
    Hi,
    thanks for the hint on Heritrix. This really seams worth the effort. Uhm, now I only need somebody to pay me for implementing this. Any volunteers? :)
    Does anybody have negative experience with this crawler? Otherwise I will add it to the feature request list.

    Greetings,
      Sebastian
  • Options
    XannixXannix Member Posts: 21 Contributor II
    So... uhmm..., isn't posible to crawl with "store_with_matching_content" parameter?

    Actually, I do in this way:

    [1] Crawl Web ->
    [2] Generate Extract ->
    [3] Filter Examples

    [1]: I don't use "store_with_matching_content"
    [2]: I extract text with xPath because the parameter "attribute_value_filter" of the "Filter Examples" operator doesn't work if it find any html tag. Is that normal or not?
    [3]: I select the only examples which match content

    I know that works fine, but I think that is not efficient...

    Any idea?

    Thanks : ))
  • Options
    landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn
    Hi,
    this depends on the regular expression used, but I guess you will have to switch to dotall mode, because normally there's a linebreak behind the tag and per default the . character does not include line breaks.

    Greetings,
      Sebastian
  • Options
    XannixXannix Member Posts: 21 Contributor II
    Hi,
    where can I find the "dotall mode" option?

    Thanks
  • Options
    XannixXannix Member Posts: 21 Contributor II
    Sorry, I realized I was wrong...

    I've been testing again, if you want to find the word "Euro" in the content you can write:

    [\S\s]*Euro[\S\s]*

    maybe is a little slow, but it works.

    Thanks for all : )
  • Options
    colocolo Member Posts: 236 Maven
    Hello Xannix,

    if you want to use options/modifications in your expressions you can simply use them by (?x) in your regex. The "x" specifies which option to use, for the "dotall"-option this would be "s". I think it's an easy and clean way to set all options at the beginning of your regex. For your "Euro" seach it would read as follows:

    (?s).*Euro.*
  • Options
    XannixXannix Member Posts: 21 Contributor II
    Hi, Colo, thanks, I'll try it : )
Sign In or Register to comment.