Problem with extensional Operator "Get Pages"

jhillerjhiller Member Posts: 12 Contributor II
edited November 2018 in Help

Hi,

 

I have a problem with the Operator "Get Pages" from Web Mining Extension.

It seems like that the operator is having a coding problem with UTF-8 charakters such aus "Ü".

With Mozilla Firefox I get a json-response with results after calling the URL "https://itunes.apple.com/search?term="Google Übersetzer"&entity=software&country=de&media=software&limit=5".

By calling this URL via Operator "Get Pages" I get a json-result but without an search-result.

 

Thats my test-process:

<?xml version="1.0" encoding="UTF-8"?><process version="7.5.001">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="7.5.001" expanded="true" name="Process">
<parameter key="logverbosity" value="init"/>
<parameter key="random_seed" value="2001"/>
<parameter key="send_mail" value="never"/>
<parameter key="notification_email" value=""/>
<parameter key="process_duration_for_mail" value="30"/>
<parameter key="encoding" value="SYSTEM"/>
<process expanded="true">
<operator activated="true" class="generate_data" compatibility="7.5.001" expanded="true" height="68" name="Generate Data" width="90" x="45" y="34">
<parameter key="target_function" value="random"/>
<parameter key="number_examples" value="1"/>
<parameter key="number_of_attributes" value="1"/>
<parameter key="attributes_lower_bound" value="-10.0"/>
<parameter key="attributes_upper_bound" value="10.0"/>
<parameter key="gaussian_standard_deviation" value="10.0"/>
<parameter key="largest_radius" value="10.0"/>
<parameter key="use_local_random_seed" value="false"/>
<parameter key="local_random_seed" value="1992"/>
<parameter key="datamanagement" value="double_array"/>
<parameter key="data_management" value="auto"/>
</operator>
<operator activated="true" class="generate_attributes" compatibility="7.5.001" expanded="true" height="82" name="Generate Attributes" width="90" x="179" y="34">
<list key="function_descriptions">
<parameter key="att1" value="&quot;https://itunes.apple.com/search?term=\&quot;Google Übersetzer\&quot;&amp;entity=software&amp;country=de&amp;media=software&amp;limit=5&quot;"/>
</list>
<parameter key="keep_all" value="true"/>
</operator>
<operator activated="true" class="web:retrieve_webpages" compatibility="7.3.000" expanded="true" height="68" name="getPage" width="90" x="313" y="34">
<parameter key="link_attribute" value="att1"/>
<parameter key="page_attribute" value="html"/>
<parameter key="random_user_agent" value="false"/>
<parameter key="user_agent" value="Mozilla/5.0 (Windows NT 10.0; WOW64; rv:52.0) Gecko/20100101 Firefox/52.0"/>
<parameter key="connection_timeout" value="2000"/>
<parameter key="read_timeout" value="2000"/>
<parameter key="follow_redirects" value="true"/>
<parameter key="accept_cookies" value="none"/>
<parameter key="cookie_scope" value="global"/>
<parameter key="request_method" value="POST"/>
<parameter key="delay" value="random"/>
<parameter key="delay_amount" value="5000"/>
<parameter key="min_delay_amount" value="2000"/>
<parameter key="max_delay_amount" value="5000"/>
</operator>
<connect from_op="Generate Data" from_port="output" to_op="Generate Attributes" to_port="example set input"/>
<connect from_op="Generate Attributes" from_port="example set output" to_op="getPage" to_port="Example Set"/>
<connect from_op="getPage" from_port="Example Set" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>

Can you reproduce the issue and do you think that this is a bug of the operator or do I have to escape the url and if yes in which way?

 

Regards

Johannes

Tagged:

Best Answer

  • Edin_KlapicEdin_Klapic Moderator, Employee, RMResearcher, Member Posts: 299 RM Data Scientist
    Solution Accepted

    The link needs to be encoded as follows

    https://itunes.apple.com/search?term="Google+%C3%9Cbersetzer"&entity=software&country=de&media=software&limit=5

    My first suggestion %DC as encoding for the letter Ü is only partly correct - For UTF-8 ist needs to be %C3%9C.

     

    You can test such URLencoding related stuff on various websites (e.g. here).

     

    Best,

    Edin

Answers

  • Thomas_OttThomas_Ott RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,761 Unicorn

    It's giving me a bad request (400) if I just plug in the URL into a single Get Page. I think it's Apple preventing people like use from using their stuff. Maybe @Edin_Klapic has an idea about this. 

  • Edin_KlapicEdin_Klapic Moderator, Employee, RMResearcher, Member Posts: 299 RM Data Scientist

    Hi Johannes,

     

    I tried your URL with various RapidMiner Operators, which are

    Get Pages, Get Page, Enrich Data by Webservice as well as Open File (from URL) in combination with Read Document.

    None of them delivered the desired output. But I can confirm that I got the same result you did.

     

    Regarding your Encoding question:

    In your use case I tried to encode the part you mentioned - but this did not help

    http://itunes.apple.com/search?term="Google Übersetzer"&entity=software&country=de&media=software&limit=5
    ==>

    When I load the URL in my browser a .txt file is downloaded to my computer - I suspect the problem here.

    If you can try this with a website where you only receive a JSON string as result we should get this going.

     

    Best regards,

    Edin

     

  • jhillerjhiller Member Posts: 12 Contributor II

    Hi,

     

    Thanks a lot for your work!

    I'm sorry for the late response. There was a mistake in my process. The user agent must be randomized. The following process shows my problem better.

    <?xml version="1.0" encoding="UTF-8"?><process version="7.5.001">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="7.5.001" expanded="true" name="Process">
    <parameter key="logverbosity" value="init"/>
    <parameter key="random_seed" value="2001"/>
    <parameter key="send_mail" value="never"/>
    <parameter key="notification_email" value=""/>
    <parameter key="process_duration_for_mail" value="30"/>
    <parameter key="encoding" value="SYSTEM"/>
    <process expanded="true">
    <operator activated="true" class="text:create_document" compatibility="7.4.001" expanded="true" height="68" name="Create Document" width="90" x="45" y="34">
    <parameter key="text" value="https://itunes.apple.com/search?term=&quot;Google Übersetzer&quot;&amp;entity=software&amp;country=de&amp;media=software&amp;limit=5"/>
    <parameter key="add label" value="false"/>
    <parameter key="label_type" value="nominal"/>
    </operator>
    <operator activated="true" class="text:create_document" compatibility="7.4.001" expanded="true" height="68" name="Create Document (2)" width="90" x="45" y="136">
    <parameter key="text" value="https://itunes.apple.com/search?term=&quot;Whatsapp&quot;&amp;entity=software&amp;country=de&amp;media=software&amp;limit=5"/>
    <parameter key="add label" value="false"/>
    <parameter key="label_type" value="nominal"/>
    </operator>
    <operator activated="true" class="text:documents_to_data" compatibility="7.4.001" expanded="true" height="103" name="Documents to Data" width="90" x="246" y="34">
    <parameter key="text_attribute" value="att1"/>
    <parameter key="add_meta_information" value="true"/>
    <parameter key="datamanagement" value="double_sparse_array"/>
    </operator>
    <operator activated="true" class="web:retrieve_webpages" compatibility="7.3.000" expanded="true" height="68" name="getPages" width="90" x="447" y="34">
    <parameter key="link_attribute" value="att1"/>
    <parameter key="page_attribute" value="html"/>
    <parameter key="random_user_agent" value="true"/>
    <parameter key="user_agent" value="Mozilla/5.0 (Windows NT 10.0; WOW64; rv:52.0) Gecko/20100101 Firefox/52.0"/>
    <parameter key="connection_timeout" value="2000"/>
    <parameter key="read_timeout" value="2000"/>
    <parameter key="follow_redirects" value="true"/>
    <parameter key="accept_cookies" value="none"/>
    <parameter key="cookie_scope" value="global"/>
    <parameter key="request_method" value="POST"/>
    <parameter key="delay" value="random"/>
    <parameter key="delay_amount" value="5000"/>
    <parameter key="min_delay_amount" value="2000"/>
    <parameter key="max_delay_amount" value="5000"/>
    </operator>
    <connect from_op="Create Document" from_port="output" to_op="Documents to Data" to_port="documents 1"/>
    <connect from_op="Create Document (2)" from_port="output" to_op="Documents to Data" to_port="documents 2"/>
    <connect from_op="Documents to Data" from_port="example set" to_op="getPages" to_port="Example Set"/>
    <connect from_op="getPages" from_port="Example Set" to_port="result 1"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    </process>
    </operator>
    </process>

    You see that the process is working with in case of the second row. It is not working with the special charakter in the first row. So I still think that this is an encoding-problem in the implementation of the "Get Pages"-operator.

     

    Best Regards,

    Johannes

  • jhillerjhiller Member Posts: 12 Contributor II

    Thanks a lot. The solution is working!

Sign In or Register to comment.