"crawler and emailextractor"

xtraplus · May 2012

Hi,

My program reads a list of urls from excel crawls these and should extract something. But whatever X-Query I try nothing gets displayed in the results. The log says that the results are saved, but there is anything.

This is my code:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.2.006">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="5.2.006" expanded="true" name="Process">
    <process expanded="true" height="415" width="815">
      <operator activated="true" class="read_excel" compatibility="5.2.006" expanded="true" height="60" name="Read Excel" width="90" x="112" y="75">
        <parameter key="excel_file" value="C:\Dokumente und Einstellungen\Home\Eigene Dateien\Rapidminer\test.xls"/>
        <parameter key="imported_cell_range" value="A1:A3"/>
        <parameter key="first_row_as_names" value="false"/>
        <list key="annotations">
          <parameter key="0" value="Name"/>
        </list>
        <list key="data_set_meta_data_information">
          <parameter key="0" value="url.true.nominal.attribute"/>
        </list>
        <parameter key="read_not_matching_values_as_missings" value="false"/>
      </operator>
      <operator activated="true" class="loop_examples" compatibility="5.2.006" expanded="true" height="112" name="Loop Examples" width="90" x="581" y="75">
        <process expanded="true" height="460" width="709">
          <operator activated="true" class="extract_macro" compatibility="5.2.006" expanded="true" height="60" name="Extract Macro" width="90" x="179" y="30">
            <parameter key="macro" value="weburl"/>
            <parameter key="macro_type" value="data_value"/>
            <parameter key="attribute_name" value="url"/>
            <parameter key="example_index" value="%{example}"/>
          </operator>
          <operator activated="true" class="web:process_web" compatibility="5.2.000" expanded="true" height="60" name="Process Documents from Web" width="90" x="179" y="120">
            <parameter key="url" value="%{weburl}"/>
            <list key="crawling_rules">
              <parameter key="follow_link_with_matching_url" value=".+onta.+|.+about.+|.+info.+|.+suppo.+|.+impre.+"/>
            </list>
            <parameter key="add_pages_as_attribute" value="true"/>
            <parameter key="max_depth" value="3"/>
            <parameter key="delay" value="500"/>
            <parameter key="max_threads" value="5"/>
            <parameter key="user_agent" value="Mozilla/5.0 (Windows NT 5.1; rv:12.0) Gecko/20100101 Firefox/12.0 "/>
            <process expanded="true" height="605" width="974">
              <operator activated="true" class="text:extract_information" compatibility="5.2.002" expanded="true" height="60" name="Extract Information" width="90" x="380" y="255">
                <parameter key="query_type" value="XPath"/>
                <list key="string_machting_queries"/>
                <list key="regular_expression_queries"/>
                <list key="regular_region_queries"/>
                <list key="xpath_queries">
                  <parameter key="mail" value="//h:@href"/&gt;
                </list>
                <list key="namespaces"/>
                <list key="index_queries"/>
              </operator>
              <connect from_port="document" to_op="Extract Information" to_port="document"/>
              <connect from_op="Extract Information" from_port="document" to_port="document 1"/>
              <portSpacing port="source_document" spacing="0"/>
              <portSpacing port="sink_document 1" spacing="0"/>
              <portSpacing port="sink_document 2" spacing="0"/>
            </process>
          </operator>
          <connect from_port="example set" to_op="Extract Macro" to_port="example set"/>
          <connect from_op="Extract Macro" from_port="example set" to_port="example set"/>
          <connect from_op="Process Documents from Web" from_port="example set" to_port="output 2"/>
          <portSpacing port="source_example set" spacing="0"/>
          <portSpacing port="sink_example set" spacing="0"/>
          <portSpacing port="sink_output 1" spacing="72"/>
          <portSpacing port="sink_output 2" spacing="0"/>
          <portSpacing port="sink_output 3" spacing="0"/>
        </process>
      </operator>
      <connect from_op="Read Excel" from_port="output" to_op="Loop Examples" to_port="example set"/>
      <connect from_op="Loop Examples" from_port="output 1" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

Everything work, but the extraction

Can you help me please?

Regards
Ben

MariusHelf · May 2012

Hi Ben,

is the Extract Information operator executed at all (try setting a breakpoint there)? Your Process Documents operator only has a follow rule, for the pages to be processed you probably also have to specify a store-rule. Additionally, you have connected the second output of the Loop Operator to the process output, however on the inside of the loop you connected the results to the 1st and 3rd output.

Best,
Marius

xtraplus · May 2012

Hi Marius,

thank you for your reply. I corrected the connections, add a store rule and tried the breakpoints at the extractor. It seems to be reached.

I get results for default XPath without any query specified.

When I try as query

//h:@href

it does not work and the operator gets yellow

whats wrong with my query?

I realized in the results display each source ulr is its own example set.

So I think I would have problems storing results of several urls in one file.

Do you know what I mean?

I think I have to somehow merge the results so that they get stored as one example set

But how?

MariusHelf · May 2012

If your XPath should extract the href attribute from the a tags, it should look similar to this:

//h:a/@href

if I got the syntax correctly on my mind.

xtraplus · May 2012

thanks

what do you recommend for merging the results in 1 data view list

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

"crawler and emailextractor"

Answers