Read greek, danish etc. html pages

mike075imike075i Member Posts: 11 Contributor II
edited December 2018 in Product Feedback - Resolved

Hi guys,

 

I am new to RapidMiner Studio. I want to do a web scraping task which crawls some greek (and later danish etc) HTML sites and extracts the content. In the resulting columns, all the Greek letters are looking wired as the screenshot shows.

 

01_.JPG

 

The Process Document from Data operator contains the following two components.

02_.JPG

One Idea was to add the Keep Document Parts and add some regular expression for UTF-8 so I have inserted in the extraction regex parameters: \p{L} for all languages related to this article: Java regex for support Unicode?. But that did not fix the problem. So my questions are:

 

1. What regular expression is the right one?

2. Is there any other way to achive the columns containing the greek letter?

 

Thank you in advance for help

Tagged:
0
0 votes

Fixed and Released · Last Updated

WE-38

Comments

  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,503 RM Data Scientist

    Hi,

     

    did you try to change the main process encoding to UTF-8? you can get there by clicking into the white of "Process".

     

    Best,

    Martin

    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • mike075imike075i Member Posts: 11 Contributor II

    Hi, yes but it didn't fixed the problem. Below I have posted the screenshot of the output of the Extract Content component, too.

     

     

    01.png

    I have done the same process using the Read RSS Feed in the main process instead of the Crawl Web component and the encoding works fine. I don't know why using the Crawl Web component this problem occurs :(

  • jwpfaujwpfau Employee, Member Posts: 274 RM Engineering

    This looks like ISO-8859-7 interpreted as UTF-8 to me. Do you have the URL of the crawled website?

  • mike075imike075i Member Posts: 11 Contributor II

    I have tested ISO-8859-7, too but the same issue remains, the site is this one: https://www.google.gr/intl/el/policies/privacy/archive/. I have to crawl all the past policies politics (greek) and gather some information of every site. I want to mention that with the Read RSS Feed operator there is no such problem but I don't need a rss reader for my purpose.

  • sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,959 Community Manager

    hi @mike075i - so I can get this working on my computer but I needed to do two things:

    (a) Make sure I had Roboto font installed with Greek characters (I'm not sure this is necessary)

    (b) override the encoding to UTF-8

     

    (note that you did not post your XML process so I just did Get Page of this URL: https://www.google.gr/intl/el/policies/privacy/archive/20160325/)

     

    Scott

     

    Screen Shot 2018-04-19 at 8.53.37 PM.pngScreen Shot 2018-04-19 at 8.55.15 PM.png

  • mike075imike075i Member Posts: 11 Contributor II

    Oh sorry, my fault forgot to post my XML code so here it is:

    <?xml version="1.0" encoding="UTF-8"?><process version="8.1.001">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="8.1.001" expanded="true" name="Process">
    <parameter key="encoding" value="SYSTEM"/>
    <process expanded="true">
    <operator activated="true" class="web:crawl_web_modern" compatibility="7.3.000" expanded="true" height="68" name="Crawl Web" width="90" x="45" y="136">
    <parameter key="url" value="https://www.google.gr/intl/el/policies/privacy/archive/"/>
    <list key="crawling_rules">
    <parameter key="follow_link_with_matching_url" value=".+privacy/archive.+"/>
    </list>
    </operator>
    <operator activated="false" class="web:crawl_web_modern" compatibility="7.3.000" expanded="true" height="68" name="TEST" width="90" x="45" y="493">
    <parameter key="url" value="http://www.samos.aegean.gr/st/"/>
    <list key="crawling_rules"/>
    </operator>
    <operator activated="true" class="web:retrieve_webpages" compatibility="7.3.000" expanded="true" height="68" name="Get Pages" width="90" x="313" y="136">
    <parameter key="link_attribute" value="Link"/>
    <parameter key="random_user_agent" value="true"/>
    </operator>
    <operator activated="true" class="text:process_document_from_data" compatibility="8.1.000" expanded="true" height="82" name="Process Documents from Data" width="90" x="581" y="238">
    <parameter key="vector_creation" value="Binary Term Occurrences"/>
    <parameter key="add_meta_information" value="false"/>
    <list key="specify_weights"/>
    <process expanded="true">
    <operator activated="true" class="web:extract_html_text_content" compatibility="7.3.000" expanded="true" height="68" name="Extract Content" width="90" x="313" y="187"/>
    <operator activated="false" class="text:tokenize" compatibility="8.1.000" expanded="true" height="68" name="Tokenize" width="90" x="112" y="340"/>
    <connect from_port="document" to_op="Extract Content" to_port="document"/>
    <connect from_op="Extract Content" from_port="document" to_port="document 1"/>
    <portSpacing port="source_document" spacing="0"/>
    <portSpacing port="sink_document 1" spacing="0"/>
    <portSpacing port="sink_document 2" spacing="0"/>
    </process>
    </operator>
    <connect from_op="Crawl Web" from_port="example set" to_op="Get Pages" to_port="Example Set"/>
    <connect from_op="Get Pages" from_port="Example Set" to_op="Process Documents from Data" to_port="example set"/>
    <connect from_op="Process Documents from Data" from_port="example set" to_port="result 1"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    </process>
    </operator>
    </process>

     

    > (b) override the encoding to UTF-8

    Where is this setting located (which component), I was not able to find that :smileysad:.

     

     

     

  • jwpfaujwpfau Employee, Member Posts: 274 RM Engineering
    Solution Accepted

    I filed a bug report for the wrong encoding detection.

    I hope this is working for you 

     

    <?xml version="1.0" encoding="UTF-8"?><process version="8.1.003">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="8.1.003" expanded="true" name="Process">
    <process expanded="true">
    <operator activated="true" class="web:crawl_web_modern" compatibility="7.3.000" expanded="true" height="68" name="Crawl Web" width="90" x="45" y="34">
    <parameter key="url" value="https://www.google.gr/intl/el/policies/privacy/archive/"/>
    <list key="crawling_rules">
    <parameter key="follow_link_with_matching_url" value=".+privacy/archive.+"/>
    </list>
    </operator>
    <operator activated="true" class="concurrency:loop_values" compatibility="8.1.003" expanded="true" height="82" name="Loop Values" width="90" x="246" y="34">
    <parameter key="attribute" value="Link"/>
    <parameter key="iteration_macro" value="link"/>
    <process expanded="true">
    <operator activated="true" class="web:get_webpage" compatibility="7.3.000" expanded="true" height="68" name="Get Page" width="90" x="112" y="34">
    <parameter key="url" value="%{link}"/>
    <list key="query_parameters"/>
    <list key="request_properties"/>
    <parameter key="override_encoding" value="true"/>
    <parameter key="encoding" value="UTF-8"/>
    </operator>
    <connect from_op="Get Page" from_port="output" to_port="output 1"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="source_input 2" spacing="0"/>
    <portSpacing port="sink_output 1" spacing="0"/>
    <portSpacing port="sink_output 2" spacing="0"/>
    </process>
    </operator>
    <operator activated="true" class="text:process_documents" compatibility="8.2.000-SNAPSHOT" expanded="true" height="103" name="Process Documents" width="90" x="447" y="34">
    <parameter key="vector_creation" value="Binary Term Occurrences"/>
    <parameter key="add_meta_information" value="false"/>
    <process expanded="true">
    <operator activated="true" class="web:extract_html_text_content" compatibility="7.3.000" expanded="true" height="68" name="Extract Content (4)" width="90" x="447" y="34"/>
    <connect from_port="document" to_op="Extract Content (4)" to_port="document"/>
    <connect from_op="Extract Content (4)" from_port="document" to_port="document 1"/>
    <portSpacing port="source_document" spacing="0"/>
    <portSpacing port="sink_document 1" spacing="0"/>
    <portSpacing port="sink_document 2" spacing="0"/>
    </process>
    </operator>
    <connect from_op="Crawl Web" from_port="example set" to_op="Loop Values" to_port="input 1"/>
    <connect from_op="Loop Values" from_port="output 1" to_op="Process Documents" to_port="documents 1"/>
    <connect from_op="Process Documents" from_port="example set" to_port="result 1"/>
    <connect from_op="Process Documents" from_port="word list" to_port="result 2"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    <portSpacing port="sink_result 3" spacing="0"/>
    </process>
    </operator>
    </process> 

     

  • mike075imike075i Member Posts: 11 Contributor II
    Solution Accepted

    Thank you very much this solution has fixed my problem, thumb up :smileyhappy:.

Sign In or Register to comment.