getpages fails, reason: iso-8859-15

Mr_MB · June 2012

Hello,

I have used rapidminer for a week now, and so far I realy like the program and everything worked fine. In the past week, after watching the tutorial-vids, I did some web-crawling and text-mining.

Now I am doing the same as always in the beginning of the process, which has worked so far, namely loading URLs from an Excel sheet and then use the getpages operator to acquire the HTML. (I didnt post my whole process here, because in general it is working) This time though some of my URLs in the Excel seem to be not working as the getpages operator fails. I get the following message:

Process failed
could not read document
Reason: "iso-8859-15"

If I pick only some random URLs from my Excel, everything works properly. I would like to know, if I can do something about this error in general or how I can find out, which URLs are not working so i can filter them.

Thanks a lot in advance

Mr.MB

Nils_Woehler · June 2012

Hi,

to test if your URLs work you can use a process like this:


<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.2.007">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="5.2.007" expanded="true" name="Process">
    <process expanded="true" height="442" width="826">
      <operator activated="true" class="read_excel" compatibility="5.2.007" expanded="true" height="60" name="Read Excel" width="90" x="179" y="120">
        <list key="annotations"/>
        <list key="data_set_meta_data_information"/>
      </operator>
      <operator activated="true" class="loop_examples" compatibility="5.2.007" expanded="true" height="76" name="Loop Examples" width="90" x="313" y="120">
        <process expanded="true" height="460" width="844">
          <operator activated="true" class="extract_macro" compatibility="5.2.007" expanded="true" height="60" name="Extract Macro" width="90" x="112" y="30">
            <parameter key="macro" value="URL"/>
            <parameter key="macro_type" value="data_value"/>
            <parameter key="attribute_name" value="att1"/>
            <parameter key="example_index" value="%{example}"/>
          </operator>
          <operator activated="true" class="print_to_console" compatibility="5.2.007" expanded="true" height="76" name="Print to Console" width="90" x="313" y="30">
            <parameter key="log_value" value="%{URL}"/>
          </operator>
          <operator activated="true" class="web:get_webpage" compatibility="5.2.001" expanded="true" height="60" name="Get Page" width="90" x="313" y="120">
            <parameter key="url" value="%{URL}"/>
            <list key="query_parameters"/>
            <list key="request_properties"/>
          </operator>
          <connect from_port="example set" to_op="Extract Macro" to_port="example set"/>
          <connect from_op="Extract Macro" from_port="example set" to_op="Print to Console" to_port="through 1"/>
          <portSpacing port="source_example set" spacing="0"/>
          <portSpacing port="sink_example set" spacing="0"/>
          <portSpacing port="sink_output 1" spacing="0"/>
        </process>
      </operator>
      <connect from_op="Read Excel" from_port="output" to_op="Loop Examples" to_port="example set"/>
      <connect from_op="Loop Examples" from_port="example set" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

There you can see in your log what pages are retrieved and what page breaks your process.

Best,
Nils

Mr_MB · June 2012

Hello,

Thanks a lot Nils. This helps me to clear out the list of compromised links for now. I still dont get why some links are not working. If I just copy-paste them into my browser, I can open them just fine. I will have to do some Web-Mining in the nxt days, where I need to get all of the links to work. Any idea how to do that?

The funny thing is, now that I filtered some comromised URLs, with an other URL (http://www.landfill.com/landfill-mining-and-reclamation/) i get the same process failed message as before, just that the reason now is: "utf-8". How can i get the getpages-operator to read all URLs?

Thanks a lot in advance

Best regards,
Mr.MB

Nils_Woehler · June 2012

Hi,

apparently the site is encoded in utf-8. But our parse decodes the encoding to "utf-8" (with quotation marks). Sadly this is not supported by the InputStream that reads the page.
I've fixed the bug and it will work with the next web extension update.. but currently it is not possible to crawl this page.

Best,
Nils

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

getpages fails, reason: iso-8859-15

Answers