UTF-8 encoded text doesn't get right out of the Get Page operator

s_nektarijevic · December 2018

Dear RapidMiners,

I am having an issue with the Get Page operator and UTF-8 encoding.

I am scraping the content of this web page:

https://www.fda.gov/RegulatoryInformation/Guidances/default.htm

According to the html code I get out of Get Page, this page uses UTF-8:

The problem is that for example: FDA’s turns out as FDAâs.

I tried enforcing the right encoding by checking the "override encoding" box in the Get Page operator, but if I do that, I get an error message:

"Encoding 'SYSTEM' is not supported"

Any idea how to solve this (without having to manually search and replace the unwanted characters please!) ?

Many thanks in advance for any kind of input!

Snežana

Marco_Boeck · December 2018

Hi,

This works just for me:

<div><?xml version="1.0" encoding="UTF-8"?><process version="9.2.000-SNAPSHOT"></div><div>&nbsp; <context></div><div>&nbsp; &nbsp; <input/></div><div>&nbsp; &nbsp; <output/></div><div>&nbsp; &nbsp; <macros/></div><div>&nbsp; </context></div><div>&nbsp; <operator activated="true" class="process" compatibility="9.2.000-SNAPSHOT" expanded="true" name="Process"></div><div>&nbsp; &nbsp; <parameter key="logverbosity" value="init"/></div><div>&nbsp; &nbsp; <parameter key="random_seed" value="2001"/></div><div>&nbsp; &nbsp; <parameter key="send_mail" value="never"/></div><div>&nbsp; &nbsp; <parameter key="notification_email" value=""/></div><div>&nbsp; &nbsp; <parameter key="process_duration_for_mail" value="30"/></div><div>&nbsp; &nbsp; <parameter key="encoding" value="SYSTEM"/></div><div>&nbsp; &nbsp; <process expanded="true"></div><div>&nbsp; &nbsp; &nbsp; <operator activated="true" class="web:get_webpage" compatibility="9.0.000" expanded="true" height="68" name="Get Page" width="90" x="45" y="34"></div><div>&nbsp; &nbsp; &nbsp; &nbsp; <parameter key="url" value="https://www.fda.gov/RegulatoryInformation/Guidances/default.htm"/></div><div>&nbsp; &nbsp; &nbsp; &nbsp; <parameter key="random_user_agent" value="false"/></div><div>&nbsp; &nbsp; &nbsp; &nbsp; <parameter key="connection_timeout" value="10000"/></div><div>&nbsp; &nbsp; &nbsp; &nbsp; <parameter key="read_timeout" value="10000"/></div><div>&nbsp; &nbsp; &nbsp; &nbsp; <parameter key="follow_redirects" value="true"/></div><div>&nbsp; &nbsp; &nbsp; &nbsp; <parameter key="accept_cookies" value="none"/></div><div>&nbsp; &nbsp; &nbsp; &nbsp; <parameter key="cookie_scope" value="global"/></div><div>&nbsp; &nbsp; &nbsp; &nbsp; <parameter key="request_method" value="GET"/></div><div>&nbsp; &nbsp; &nbsp; &nbsp; <list key="query_parameters"/></div><div>&nbsp; &nbsp; &nbsp; &nbsp; <list key="request_properties"/></div><div>&nbsp; &nbsp; &nbsp; &nbsp; <parameter key="override_encoding" value="true"/></div><div>&nbsp; &nbsp; &nbsp; &nbsp; <parameter key="encoding" value="UTF-8"/></div><div>&nbsp; &nbsp; &nbsp; </operator></div><div>&nbsp; &nbsp; &nbsp; <connect from_op="Get Page" from_port="output" to_port="result 1"/></div><div>&nbsp; &nbsp; &nbsp; <portSpacing port="source_input 1" spacing="0"/></div><div>&nbsp; &nbsp; &nbsp; <portSpacing port="sink_result 1" spacing="0"/></div><div>&nbsp; &nbsp; &nbsp; <portSpacing port="sink_result 2" spacing="0"/></div><div>&nbsp; &nbsp; </process></div><div>&nbsp; </operator></div><div></process></div><div></div>

kayman · December 2018

Is your process itself also using UTF-8?
When you click into your main window you can also define the encoding for the process itself in the parameters. Typically I set this also to UTF-8, and do the same in settings -> preferences -> general -> encoding

s_nektarijevic · December 2018

Dear @kayman ,

Many thanks for your suggestion! However it didn't really help resolving my case :-(

I am not sure whether I am doing the things right, but I just adjusted the settings as you suggested and reran the process, and got the same result as before. I also tried restarting RapidMiner after adjusting the settings, but nothing changed. I am not exactly sure where the problem is, but no matter which encoding settings I choose (I tried SYSTEM, UTF-8 and ISO-8859-1 for fun), I get the same.

In any case, what I see straight out of Get Page is different from what I see in the final Example Set. Here is an example:

After Get Page:

CVM GFI #108 Registering with CVMâ\200\203s Electronic Submission System

In the final Example Set:

Registering with CVMâ€™s Electronic Submission System

Any idea what is still wrong?

Many thanks in advance for any kind of input!

Snezana

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

UTF-8 encoded text doesn't get right out of the Get Page operator

Best Answer

Answers