[SOLVED] JDom - Comment data cannot start with a hyphen

ScottyScotty Member Posts: 6 Contributor II
edited November 2018 in Help
Good Afternoon,

I am using xpath to extract information from html documents that have been saved on my PC from a webcrawl.

Everything seems to works OK except occasionally I get the following error

the data "-10" is not legal for a JDOM comment: Comment data cannot start with a hyphen

When inspecting the html I find


which seems to be causing the problem.

Any ideas of how to get around this?

Many Thanks


  • Options
    ScottyScotty Member Posts: 6 Contributor II
    It would appear that this was a bug in jdom 1.0 that has been fixed in jdom 1.1.

    Removing check that a comment not start with a hyphen. A careful reading
    of production 15 in the XML 1.0 spec indicates leading hyphens are in
    fact allowed.

    taken from http://jdom.markmail.org/message/b45honrv3crcmqux posted 4 years ago.

    If this is the case, what does one need to do to solve the problem?

  • Options
    ScottyScotty Member Posts: 6 Contributor II
    Here is an example of the problem.

    Any ideas?

    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.1.014">
      <operator activated="true" class="process" compatibility="5.1.014" expanded="true" name="Process">
        <process expanded="true" height="449" width="710">
          <operator activated="true" class="web:get_webpage" compatibility="5.1.004" expanded="true" height="60" name="Get Page" width="90" x="45" y="75">
            <parameter key="url" value="http://www.talktalkmembers.com/forums/forumdisplay.php?f=9&amp;order=desc&amp;page=13"/>
            <list key="query_parameters"/>
          <operator activated="true" class="text:process_documents" compatibility="5.1.003" expanded="true" height="94" name="Process Documents" width="90" x="179" y="75">
            <parameter key="create_word_vector" value="false"/>
            <process expanded="true" height="449" width="710">
              <operator activated="true" class="text:extract_information" compatibility="5.1.003" expanded="true" height="60" name="Extract Information" width="90" x="112" y="30">
                <parameter key="query_type" value="XPath"/>
                <list key="string_machting_queries"/>
                <list key="regular_expression_queries"/>
                <list key="regular_region_queries"/>
                <list key="xpath_queries">
                  <parameter key="Test" value="//h:h1/text()"/>
                <list key="namespaces"/>
                <list key="index_queries"/>
              <connect from_port="document" to_op="Extract Information" to_port="document"/>
              <connect from_op="Extract Information" from_port="document" to_port="document 1"/>
              <portSpacing port="source_document" spacing="0"/>
              <portSpacing port="sink_document 1" spacing="0"/>
              <portSpacing port="sink_document 2" spacing="0"/>
          <connect from_op="Get Page" from_port="output" to_op="Process Documents" to_port="documents 1"/>
          <connect from_op="Process Documents" from_port="example set" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
  • Options
    manwannmanwann Member Posts: 7 Contributor II
    Hi, i have the same problem. I'm using the dataset from http://www.cs.cmu.edu/afs/cs.cmu.edu/project/theo-51/www/co-training/data/, when i try to load the data into rapidminer using the  'process documents to files' operator it gives me the same error. Then inside the operator i put the 'remove documents parts' operator and i put the following regular expression <![^>]*> as the parameter for the operator, but the error is still showing.

    I will appreciate your help. thanks
  • Options
    Nils_WoehlerNils_Woehler Member Posts: 463 Maven

    thanks for the hint. At the moment we are using JDom 1.0 but we will update it to the latest library version soon.

    Until then you could use the 'Remove documents parts' operator with this regular expression: <!---.*-->
    This removes every comment with a hypen at the beginning thus allowing the extract information operator to work correctly.

Sign In or Register to comment.