The Altair Community is migrating to a new platform to provide a better experience for you. The RapidMiner Community will merge with the Altair Community at the same time. In preparation for the migration, both communities are on read-only mode from July 15th - July 24th, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here.
Options

How to stop the Get Pages module stopping the process when it cannot read a URL

davidellisdavidellis Member Posts: 4 Contributor I
I have process that reads an excel file, gets pages and then processes the results. I have a dataset of 98 records and it runs perfectly. If I add another 500 records I get random read URL errors.

I have checked all the URLs and they work perfectly and my internet connection is solid. I found a solution on the forum based on a handle exception module but it doesn't seem to make any difference and I am not sure how it works.

Any ideas how to fix the errors or if not how to skip those URLs

Answers

  • Options
    SGolbertSGolbert RapidMiner Certified Analyst, Member Posts: 344 Unicorn

    Hi David,

     

    a long time after your post I have come to the same problem. It can be remediated with looping and using Get Page inside Handle Exception:

     

    <?xml version="1.0" encoding="UTF-8"?><process version="8.2.000">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="8.2.000" expanded="true" name="Process">
    <process expanded="true">
    <operator activated="true" class="python_scripting:execute_python" compatibility="7.4.000" expanded="true" height="82" name="Execute Python" width="90" x="45" y="34">
    <parameter key="script" value="import pandas&#10;&#10;# rm_main is a mandatory function, &#10;# the number of arguments has to be the number of input ports (can be none)&#10;def rm_main():&#10; &#10; data2 = pandas.DataFrame({'link':['https://www.presseportal.de/blaulicht/pm/70116/3951184','https://www.nonexisting.ar', 'https://www.tu-dortmund.de/uni/de/Einstieg/aktuelles/meldungen/2018-01/18-01-31-Do-camp-ing/index.html']})&#10;&#10; # connect 2 output ports to see the results&#10; return data2"/>
    </operator>
    <operator activated="true" class="extract_macro" compatibility="8.2.000" expanded="true" height="68" name="Extract Macro" width="90" x="246" y="34">
    <parameter key="macro" value="number_examples"/>
    <list key="additional_macros"/>
    </operator>
    <operator activated="true" class="concurrency:loop" compatibility="8.2.000" expanded="true" height="82" name="Loop" width="90" x="447" y="34">
    <parameter key="number_of_iterations" value="%{number_examples}"/>
    <process expanded="true">
    <operator activated="true" class="filter_example_range" compatibility="8.2.000" expanded="true" height="82" name="Filter Example Range" width="90" x="112" y="34">
    <parameter key="first_example" value="%{iteration}"/>
    <parameter key="last_example" value="%{iteration}"/>
    </operator>
    <operator activated="true" class="extract_macro" compatibility="8.2.000" expanded="true" height="68" name="Extract Macro (2)" width="90" x="246" y="34">
    <parameter key="macro" value="link"/>
    <parameter key="macro_type" value="data_value"/>
    <parameter key="attribute_name" value="link"/>
    <parameter key="example_index" value="1"/>
    <list key="additional_macros"/>
    </operator>
    <operator activated="true" class="handle_exception" compatibility="8.2.000" expanded="true" height="82" name="Handle Exception" width="90" x="514" y="34">
    <process expanded="true">
    <operator activated="true" class="web:get_webpage" compatibility="7.3.000" expanded="true" height="68" name="Get Page" width="90" x="112" y="34">
    <parameter key="url" value="%{link}"/>
    <list key="query_parameters"/>
    <list key="request_properties"/>
    </operator>
    <connect from_op="Get Page" from_port="output" to_port="out 1"/>
    <portSpacing port="source_in 1" spacing="0"/>
    <portSpacing port="sink_out 1" spacing="0"/>
    <portSpacing port="sink_out 2" spacing="0"/>
    </process>
    <process expanded="true">
    <portSpacing port="source_in 1" spacing="0"/>
    <portSpacing port="sink_out 1" spacing="0"/>
    <portSpacing port="sink_out 2" spacing="0"/>
    </process>
    </operator>
    <connect from_port="input 1" to_op="Filter Example Range" to_port="example set input"/>
    <connect from_op="Filter Example Range" from_port="example set output" to_op="Extract Macro (2)" to_port="example set"/>
    <connect from_op="Handle Exception" from_port="out 1" to_port="output 1"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="source_input 2" spacing="0"/>
    <portSpacing port="sink_output 1" spacing="0"/>
    <portSpacing port="sink_output 2" spacing="0"/>
    </process>
    </operator>
    <operator activated="true" class="text:process_documents" compatibility="8.1.000" expanded="true" height="103" name="Process Documents" width="90" x="715" y="34">
    <parameter key="create_word_vector" value="false"/>
    <parameter key="keep_text" value="true"/>
    <process expanded="true">
    <operator activated="true" class="web:extract_html_text_content" compatibility="7.3.000" expanded="true" height="68" name="Extract Content" width="90" x="380" y="34"/>
    <connect from_port="document" to_op="Extract Content" to_port="document"/>
    <connect from_op="Extract Content" from_port="document" to_port="document 1"/>
    <portSpacing port="source_document" spacing="0"/>
    <portSpacing port="sink_document 1" spacing="0"/>
    <portSpacing port="sink_document 2" spacing="0"/>
    </process>
    </operator>
    <connect from_op="Execute Python" from_port="output 1" to_op="Extract Macro" to_port="example set"/>
    <connect from_op="Extract Macro" from_port="example set" to_op="Loop" to_port="input 1"/>
    <connect from_op="Loop" from_port="output 1" to_op="Process Documents" to_port="documents 1"/>
    <connect from_op="Process Documents" from_port="example set" to_port="result 1"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    </process>
    </operator>
    </process>

    Strangely enough the Loop Examples operator seems to be broken, therefore I emulated it with the normal Loop operator.

     

    It would be nice if the Get Pages operator could ignore not found responses!

     

    Regards,

    Sebastian

Sign In or Register to comment.