RapidMiner

Contributor II jhiller
Contributor II

Problem with extensional Operator "Get Pages"

Hi,

 

I have a problem with the Operator "Get Pages" from Web Mining Extension.

It seems like that the operator is having a coding problem with UTF-8 charakters such aus "Ü".

With Mozilla Firefox I get a json-response with results after calling the URL "https://itunes.apple.com/search?term="Google Übersetzer"&entity=software&country=de&media=software&limit=5".

By calling this URL via Operator "Get Pages" I get a json-result but without an search-result.

 

Thats my test-process:

<?xml version="1.0" encoding="UTF-8"?><process version="7.5.001">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="7.5.001" expanded="true" name="Process">
    <parameter key="logverbosity" value="init"/>
    <parameter key="random_seed" value="2001"/>
    <parameter key="send_mail" value="never"/>
    <parameter key="notification_email" value=""/>
    <parameter key="process_duration_for_mail" value="30"/>
    <parameter key="encoding" value="SYSTEM"/>
    <process expanded="true">
      <operator activated="true" class="generate_data" compatibility="7.5.001" expanded="true" height="68" name="Generate Data" width="90" x="45" y="34">
        <parameter key="target_function" value="random"/>
        <parameter key="number_examples" value="1"/>
        <parameter key="number_of_attributes" value="1"/>
        <parameter key="attributes_lower_bound" value="-10.0"/>
        <parameter key="attributes_upper_bound" value="10.0"/>
        <parameter key="gaussian_standard_deviation" value="10.0"/>
        <parameter key="largest_radius" value="10.0"/>
        <parameter key="use_local_random_seed" value="false"/>
        <parameter key="local_random_seed" value="1992"/>
        <parameter key="datamanagement" value="double_array"/>
        <parameter key="data_management" value="auto"/>
      </operator>
      <operator activated="true" class="generate_attributes" compatibility="7.5.001" expanded="true" height="82" name="Generate Attributes" width="90" x="179" y="34">
        <list key="function_descriptions">
          <parameter key="att1" value="&quot;https://itunes.apple.com/search?term=\&quot;Google Übersetzer\&quot;&amp;entity=software&amp;country=de&amp;media=software&amp;limit=5&quot;"/>
        </list>
        <parameter key="keep_all" value="true"/>
      </operator>
      <operator activated="true" class="web:retrieve_webpages" compatibility="7.3.000" expanded="true" height="68" name="getPage" width="90" x="313" y="34">
        <parameter key="link_attribute" value="att1"/>
        <parameter key="page_attribute" value="html"/>
        <parameter key="random_user_agent" value="false"/>
        <parameter key="user_agent" value="Mozilla/5.0 (Windows NT 10.0; WOW64; rv:52.0) Gecko/20100101 Firefox/52.0"/>
        <parameter key="connection_timeout" value="2000"/>
        <parameter key="read_timeout" value="2000"/>
        <parameter key="follow_redirects" value="true"/>
        <parameter key="accept_cookies" value="none"/>
        <parameter key="cookie_scope" value="global"/>
        <parameter key="request_method" value="POST"/>
        <parameter key="delay" value="random"/>
        <parameter key="delay_amount" value="5000"/>
        <parameter key="min_delay_amount" value="2000"/>
        <parameter key="max_delay_amount" value="5000"/>
      </operator>
      <connect from_op="Generate Data" from_port="output" to_op="Generate Attributes" to_port="example set input"/>
      <connect from_op="Generate Attributes" from_port="example set output" to_op="getPage" to_port="Example Set"/>
      <connect from_op="getPage" from_port="Example Set" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

Can you reproduce the issue and do you think that this is a bug of the operator or do I have to escape the url and if yes in which way?

 

Regards

Johannes

5 REPLIES
RM Certified Expert
RM Certified Expert

Re: Problem with extensional Operator "Get Pages"

It's giving me a bad request (400) if I just plug in the URL into a single Get Page. I think it's Apple preventing people like use from using their stuff. Maybe @Edin_Klapic has an idea about this. 

RM Staff
RM Staff

Re: Problem with extensional Operator "Get Pages"

Hi Johannes,

 

I tried your URL with various RapidMiner Operators, which are

Get Pages, Get Page, Enrich Data by Webservice as well as Open File (from URL) in combination with Read Document.

None of them delivered the desired output. But I can confirm that I got the same result you did.

 

Regarding your Encoding question:

In your use case I tried to encode the part you mentioned - but this did not help

http://itunes.apple.com/search?term="Google Übersetzer"&entity=software&country=de&media=software&limit=5
==>
http://itunes.apple.com/search?term=Google%DCbersetzer&entity=software&country=de&media=software&limit=5

When I load the URL in my browser a .txt file is downloaded to my computer - I suspect the problem here.

If you can try this with a website where you only receive a JSON string as result we should get this going.

 

Best regards,

Edin

 

Highlighted
Contributor II jhiller
Contributor II

Re: Problem with extensional Operator "Get Pages"

Hi,

 

Thanks a lot for your work!

I'm sorry for the late response. There was a mistake in my process. The user agent must be randomized. The following process shows my problem better.

<?xml version="1.0" encoding="UTF-8"?><process version="7.5.001">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="7.5.001" expanded="true" name="Process">
    <parameter key="logverbosity" value="init"/>
    <parameter key="random_seed" value="2001"/>
    <parameter key="send_mail" value="never"/>
    <parameter key="notification_email" value=""/>
    <parameter key="process_duration_for_mail" value="30"/>
    <parameter key="encoding" value="SYSTEM"/>
    <process expanded="true">
      <operator activated="true" class="text:create_document" compatibility="7.4.001" expanded="true" height="68" name="Create Document" width="90" x="45" y="34">
        <parameter key="text" value="https://itunes.apple.com/search?term=&quot;Google Übersetzer&quot;&amp;entity=software&amp;country=de&amp;media=software&amp;limit=5"/>
        <parameter key="add label" value="false"/>
        <parameter key="label_type" value="nominal"/>
      </operator>
      <operator activated="true" class="text:create_document" compatibility="7.4.001" expanded="true" height="68" name="Create Document (2)" width="90" x="45" y="136">
        <parameter key="text" value="https://itunes.apple.com/search?term=&quot;Whatsapp&quot;&amp;entity=software&amp;country=de&amp;media=software&amp;limit=5"/>
        <parameter key="add label" value="false"/>
        <parameter key="label_type" value="nominal"/>
      </operator>
      <operator activated="true" class="text:documents_to_data" compatibility="7.4.001" expanded="true" height="103" name="Documents to Data" width="90" x="246" y="34">
        <parameter key="text_attribute" value="att1"/>
        <parameter key="add_meta_information" value="true"/>
        <parameter key="datamanagement" value="double_sparse_array"/>
      </operator>
      <operator activated="true" class="web:retrieve_webpages" compatibility="7.3.000" expanded="true" height="68" name="getPages" width="90" x="447" y="34">
        <parameter key="link_attribute" value="att1"/>
        <parameter key="page_attribute" value="html"/>
        <parameter key="random_user_agent" value="true"/>
        <parameter key="user_agent" value="Mozilla/5.0 (Windows NT 10.0; WOW64; rv:52.0) Gecko/20100101 Firefox/52.0"/>
        <parameter key="connection_timeout" value="2000"/>
        <parameter key="read_timeout" value="2000"/>
        <parameter key="follow_redirects" value="true"/>
        <parameter key="accept_cookies" value="none"/>
        <parameter key="cookie_scope" value="global"/>
        <parameter key="request_method" value="POST"/>
        <parameter key="delay" value="random"/>
        <parameter key="delay_amount" value="5000"/>
        <parameter key="min_delay_amount" value="2000"/>
        <parameter key="max_delay_amount" value="5000"/>
      </operator>
      <connect from_op="Create Document" from_port="output" to_op="Documents to Data" to_port="documents 1"/>
      <connect from_op="Create Document (2)" from_port="output" to_op="Documents to Data" to_port="documents 2"/>
      <connect from_op="Documents to Data" from_port="example set" to_op="getPages" to_port="Example Set"/>
      <connect from_op="getPages" from_port="Example Set" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

You see that the process is working with in case of the second row. It is not working with the special charakter in the first row. So I still think that this is an encoding-problem in the implementation of the "Get Pages"-operator.

 

Best Regards,

Johannes

RM Staff
RM Staff
Solution

Re: Problem with extensional Operator "Get Pages"

The link needs to be encoded as follows

https://itunes.apple.com/search?term="Google+%C3%9Cbersetzer"&entity=software&country=de&media=software&limit=5

My first suggestion %DC as encoding for the letter Ü is only partly correct - For UTF-8 ist needs to be %C3%9C.

 

You can test such URLencoding related stuff on various websites (e.g. here).

 

Best,

Edin

Contributor II jhiller
Contributor II

Re: Problem with extensional Operator "Get Pages"

Thanks a lot. The solution is working!

Polls
How can RapidMiner increase participation in our new competitions?
Twitter Feed