[SOLVED] JDom - Comment data cannot start with a hyphen

Scotty · November 2011

Good Afternoon,

I am using xpath to extract information from html documents that have been saved on my PC from a webcrawl.

Everything seems to works OK except occasionally I get the following error

the data "-10" is not legal for a JDOM comment: Comment data cannot start with a hyphen

When inspecting the html I find



which seems to be causing the problem.

Any ideas of how to get around this?

Many Thanks
Scott

Scotty · November 2011

It would appear that this was a bug in jdom 1.0 that has been fixed in jdom 1.1.

Removing check that a comment not start with a hyphen. A careful reading
of production 15 in the XML 1.0 spec indicates leading hyphens are in
fact allowed.

taken from http://jdom.markmail.org/message/b45honrv3crcmqux posted 4 years ago.

If this is the case, what does one need to do to solve the problem?

Thanks
S

Scotty · November 2011

Here is an example of the problem.

Any ideas?

Thanks
Scott

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.1.014">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="5.1.014" expanded="true" name="Process">
    <process expanded="true" height="449" width="710">
      <operator activated="true" class="web:get_webpage" compatibility="5.1.004" expanded="true" height="60" name="Get Page" width="90" x="45" y="75">
        <parameter key="url" value="http://www.talktalkmembers.com/forums/forumdisplay.php?f=9&amp;order=desc&amp;page=13"/>
        <list key="query_parameters"/>
      </operator>
      <operator activated="true" class="text:process_documents" compatibility="5.1.003" expanded="true" height="94" name="Process Documents" width="90" x="179" y="75">
        <parameter key="create_word_vector" value="false"/>
        <process expanded="true" height="449" width="710">
          <operator activated="true" class="text:extract_information" compatibility="5.1.003" expanded="true" height="60" name="Extract Information" width="90" x="112" y="30">
            <parameter key="query_type" value="XPath"/>
            <list key="string_machting_queries"/>
            <list key="regular_expression_queries"/>
            <list key="regular_region_queries"/>
            <list key="xpath_queries">
              <parameter key="Test" value="//h:h1/text()"/>
            </list>
            <list key="namespaces"/>
            <list key="index_queries"/>
          </operator>
          <connect from_port="document" to_op="Extract Information" to_port="document"/>
          <connect from_op="Extract Information" from_port="document" to_port="document 1"/>
          <portSpacing port="source_document" spacing="0"/>
          <portSpacing port="sink_document 1" spacing="0"/>
          <portSpacing port="sink_document 2" spacing="0"/>
        </process>
      </operator>
      <connect from_op="Get Page" from_port="output" to_op="Process Documents" to_port="documents 1"/>
      <connect from_op="Process Documents" from_port="example set" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

manwann · December 2011

Hi, i have the same problem. I'm using the dataset from http://www.cs.cmu.edu/afs/cs.cmu.edu/project/theo-51/www/co-training/data/, when i try to load the data into rapidminer using the 'process documents to files' operator it gives me the same error. Then inside the operator i put the 'remove documents parts' operator and i put the following regular expression <![^>]*> as the parameter for the operator, but the error is still showing.

I will appreciate your help. thanks

Nils_Woehler · December 2011

Hi,

thanks for the hint. At the moment we are using JDom 1.0 but we will update it to the latest library version soon.

Until then you could use the 'Remove documents parts' operator with this regular expression: 
This removes every comment with a hypen at the beginning thus allowing the extract information operator to work correctly.

Regards,
Nils

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

[SOLVED] JDom - Comment data cannot start with a hyphen

Answers