Options

Process Web Spanish

XannixXannix Member Posts: 21 Contributor II
edited November 2018 in Help
Hi everyone!
I'm trying "Process Web" in spanish language and i'm having problems with the accents.
The web page has "charset=iso-8859-1" then i try to put encoding parameter as "iso-8859-1" but it doesn't work. (I try all usual encoding)
The curious thing is that "Crawl web" works  but only if I mark "write pages into files", because if I don't, it doesn't work too.

Is this a bug?

Does anyone know how can i solve it?

Thanks : )

Answers

  • Options
    XannixXannix Member Posts: 21 Contributor II
    I've dicovered this problem hapens only sometimes, and I don't know why.
    In this code you can see atribute "Introduccion" has diferent values depending on the method:

    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.0">
     <context>
       <input/>
       <output/>
       <macros/>
     </context>
     <operator activated="true" class="process" expanded="true" name="Process">
       <parameter key="encoding" value="ISO-8859-1"/>
       <process expanded="true" height="325" width="685">
         <operator activated="true" class="web:crawl_web" expanded="true" height="60" name="Crawl Web" width="90" x="45" y="30">
           <parameter key="url" value="http://www.madrimasd.org/informacionidi/noticias/default.asp?Page=1&amp;Tipo=2"/>
           <list key="crawling_rules">
             <parameter key="2" value="http://www.madrimasd.org/noticias/.*"/>
             <parameter key="0" value="http://www.madrimasd.org/noticias/.*"/>
           </list>
           <parameter key="add_pages_as_attribute" value="true"/>
           <parameter key="output_dir" value="C:\"/>
           <parameter key="extension" value="htm"/>
           <parameter key="max_pages" value="3"/>
           <parameter key="delay" value="100"/>
           <parameter key="max_threads" value="3"/>
           <parameter key="max_page_size" value="1000"/>
         </operator>
         <operator activated="true" class="web:process_web" expanded="true" height="60" name="Process Web" width="90" x="45" y="120">
           <parameter key="url" value="http://www.madrimasd.org/informacionidi/noticias/default.asp?Page=1&amp;Tipo=2"/>
           <list key="crawling_rules">
             <parameter key="2" value="http://www.madrimasd.org/noticias/.*"/>
             <parameter key="0" value="http://www.madrimasd.org/noticias/.*"/>
           </list>
           <parameter key="add_pages_as_attribute" value="true"/>
           <parameter key="max_pages" value="3"/>
           <parameter key="delay" value="100"/>
           <parameter key="max_threads" value="3"/>
           <process expanded="true" height="422" width="752">
             <operator activated="true" class="text:transform_cases" expanded="true" height="60" name="Transform Cases" width="90" x="112" y="30"/>
             <operator activated="true" class="text:extract_information" expanded="true" height="60" name="Extract Information" width="90" x="313" y="30">
               <parameter key="query_type" value="XPath"/>
               <list key="string_machting_queries"/>
               <list key="regular_expression_queries"/>
               <list key="regular_region_queries"/>
               <list key="xpath_queries">
                 <parameter key="Introduccion" value="//h:p/text()"/>
               </list>
               <list key="namespaces"/>
               <list key="index_queries"/>
             </operator>
             <connect from_port="document" to_op="Transform Cases" to_port="document"/>
             <connect from_op="Transform Cases" from_port="document" to_op="Extract Information" to_port="document"/>
             <connect from_op="Extract Information" from_port="document" to_port="document 1"/>
             <portSpacing port="source_document" spacing="0"/>
             <portSpacing port="sink_document 1" spacing="0"/>
             <portSpacing port="sink_document 2" spacing="0"/>
           </process>
         </operator>
         <operator activated="true" breakpoints="after" class="text:generate_extract" expanded="true" height="60" name="Generate Extract" width="90" x="246" y="30">
           <parameter key="source_attribute" value="Page"/>
           <parameter key="query_type" value="XPath"/>
           <list key="string_machting_queries">
             <parameter key="parrafismo" value="&lt;p&gt;.&lt;/p&gt;"/>
           </list>
           <list key="regular_expression_queries">
             <parameter key="Jurjur" value="Sin(.*)Blasco"/>
           </list>
           <list key="regular_region_queries"/>
           <list key="xpath_queries">
             <parameter key="Introduccion" value="//h:p/text()"/>
           </list>
           <list key="namespaces"/>
           <list key="index_queries"/>
           <parameter key="value_seperator" value="***"/>
         </operator>
         <connect from_op="Crawl Web" from_port="Example Set" to_op="Generate Extract" to_port="Example Set"/>
         <connect from_op="Process Web" from_port="example set" to_port="result 2"/>
         <connect from_op="Generate Extract" from_port="Example Set" to_port="result 1"/>
         <portSpacing port="source_input 1" spacing="0"/>
         <portSpacing port="sink_result 1" spacing="0"/>
         <portSpacing port="sink_result 2" spacing="0"/>
         <portSpacing port="sink_result 3" spacing="0"/>
       </process>
     </operator>
    </process>

  • Options
    landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn
    Hi,
    I think this is an issue with the encoding of the webpage. It's rather difficult to always read the correct encoding, if the web page doesn't specify it. We are usually assuming UTF-8 if nothing is specified in the html document.
    You could manually try to request the webpages in an appropriate terminal program and check if the encoding is correct. If not, you might add a bug to the tracker with a detailed example process. This would make my life much easier and will speed up the fixing :)

    Greetings,
      Sebastian
  • Options
    XannixXannix Member Posts: 21 Contributor II
    I'm not sure if I understand you...

    I can see this pages in my navigator, and I've seen in the source code of the page:
    <META http-equiv=Content-Type content="text/html; charset=iso-8859-1"> (I'm not sure if you refers to this)

    You told me to request the webpages in an appropiate terminal program... (navigator?, sorry I don't know what you are trying to tell me)

    In the example, you can see "Process web" operator, replaces the accents with a simbol, but with "Crawl web" operator, accent are well written (but only if is marked "write pages into files")

    I would like to help to fix it, but I don't know how

    Thanks for all
  • Options
    landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn
    Hi,
    I have added a bug to the bug tracker. We will solve it as soon as possible.

    Greetings,
      Sebastian
  • Options
    XannixXannix Member Posts: 21 Contributor II
    Thanks for all : )
Sign In or Register to comment.