How to set DTD parameter in FeatureExtraction (rapidminer UI)

skarabskarab Member Posts: 10 Contributor II
edited November 2018 in Help
because I keep  getting IOException thrown from FeatureExtraction:

Server returned HTTP response code: 503 for URL: http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd

Regards,
skarab

Answers

  • landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,526   Unicorn
    Hi,
    I'm sorry, but what exactly are you doing? It would be the easiest to post the process and do a little explanation. And for motivating all other users to answer your questions, it could be a smart move to add something like "hello" in front of your message...

    Greetings,
      Sebastian
  • skarabskarab Member Posts: 10 Contributor II
    I parse html page and here is code:
    <operator name="FeatureExtraction" class="FeatureExtraction" breakpoints="before,within,after">
                              <list key="texts">
                                <parameter key="tmp_file" value="%{parent_path}\tmp%{file_name}\%{file_name}"/>
                              </list>
                              <parameter key="default_content_type" value="html"/>
                              <parameter key="default_content_encoding" value="UTF-8"/>
                              <parameter key="default_content_language" value="pl"/>
                              <parameter key="use_content_attributes" value="true"/>
                              <parameter key="id_attribute_type" value="long"/>
                              <list key="attributes">
                                <parameter key="html" value="/h:html"/>
                              </list>
                              <list key="namespaces">
    <!-- I tried to set it in namespaces -->
                                <parameter key="html" value="C:\\workspace-rapidminer\xhtml1-transitional.dtd"/>
                              </list>
                          </operator>
  • landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,526   Unicorn
    Hi,
    I don't think, the namespace is either needed, nor is it correctly defined. So the easiest solution would be to erase this parameter...
    Anyway it is only used for XPath requests for more complicated XML objects...I have never had to use them for HTML...

    Greetings,
      Sebastian
  • skarabskarab Member Posts: 10 Contributor II
    Hi,

    Defining namespace does not matter in my case, I still get this exception... I am using Java 1.6.0.16 on VISTA.

    Regards
    Skarab
  • skarabskarab Member Posts: 10 Contributor II
    Hi,

    I solved the problem...

    First I removed
    <!DOCTYPE html PUBLIC [^>]*> using TextCleaner.

    After that I attached a path to local dtd:
    <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "C:\workspace-rapidminer\xhtml1-transitional.dtd" >
    using  SingleTextObjectInput:

    <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "C:\workspace-rapidminer\xhtml1-transitional.dtd" >%{loop_value}

    Here is my brute force solution (I get a html page as a TextObject):

     <operator name="TextCleaner" class="TextCleaner">
                           <parameter key="deletion_regex" value="&lt;!DOCTYPE html PUBLIC [^&gt;]*&gt;"/>
                       </operator>
                       <operator name="TextObject2ExampleSet" class="TextObject2ExampleSet">
                           <parameter key="keep_text_object" value="true"/>
                           <parameter key="text_attribute" value="my_doc_text"/>
                           <parameter key="label_attribute" value="my_doc_label"/>
                       </operator>
                       <operator name="ValueIterator" class="ValueIterator" expanded="yes">
                           <parameter key="attribute" value="my_doc_text"/>
                           <operator name="SingleTextObjectInput" class="SingleTextObjectInput">
                               <parameter key="text" value="&lt;!DOCTYPE html PUBLIC &quot;-//W3C//DTD XHTML 1.0 Transitional//EN&quot; &quot;C:\workspace-rapidminer\xhtml1-transitional.dtd&quot; &gt;%{loop_value}"/>
                           </operator>
                       </operator>



    Regards,
    Wojtek
Sign In or Register to comment.