Due to recent updates, all users are required to create an Altair One account to login to the RapidMiner community. Click the Register button to create your account using the same email that you have previously used to login to the RapidMiner community. This will ensure that any previously created content will be synced to your Altair One account. Once you login, you will be asked to provide a username that identifies you to other Community users. Email us at Community with questions.

"HTML Content Extraction Error"

mgstickmgstick Member Posts: 5 Contributor II
edited May 2019 in Help
I'm crawling the web and then attempting to extract the content of the downloaded HTML pages using the Web Mining -> HTML Processing -> Extract Content operator.

My process successfully crawls the web and writes the HTML pages to disk; it also processes the Documents data set returned by the crawler using the Loop Collection operator, or at least seems to execute the Unescape HTML Document operator on each Document. The sub-process in the Loop Collection operator begins with the Unescape HTML Document operator and is then supposed to process each Document using the Extract Content operator. When it gets to this point I get the following error (see below for full error message output):

          Process failed: org/apache/commons/lang/StringEscapeUtils (ProcessThread.run())
            java.lang.NoClassDefFoundError: org/apache/commons/lang/StringEscapeUtils

I've downloaded the Apache Commons Lang jar file (commons-lang-2.5.jar) and attempted to make it available for RapidMiner to use; but with no luck. I tried adding it to my default CLASSPATH, adding it to the CLASSPATH from within my Terminal via the Set command, and I've tried adding it to the CLASSPATH explicitly on the RapidMiner execution command line i.e. java -classpath lib/commons-lang-2.5.jar -jar lib/rapidminer.jar and yet I still get the java.lang.NoClassDefFoundError: org/apache/commons/lang/StringEscapeUtils error.

I don't know what to try next. Any help would be greatly appreciated.

Thanks in advance for your help.


2010-10-28 11:58:03 SEVERE: Process failed: org/apache/commons/lang/StringEscapeUtils (ProcessThread.run())
  java.lang.NoClassDefFoundError: org/apache/commons/lang/StringEscapeUtils
      com.rapidminer.operator.web.html.HTMLTextExtractionOperator.doWork(HTMLTextExtractionOperator.java:324)
      com.rapidminer.operator.text.io.AbstractTokenProcessor.doWork(AbstractTokenProcessor.java:60)
      com.rapidminer.operator.Operator.execute(Operator.java:768)
      com.rapidminer.operator.execution.SimpleUnitExecutor.execute(SimpleUnitExecutor.java:51)
      com.rapidminer.operator.ExecutionUnit.execute(ExecutionUnit.java:709)
      com.rapidminer.operator.collections.CollectionIterationOperator.doWork(CollectionIterationOperator.java:90)
      com.rapidminer.operator.Operator.execute(Operator.java:768)
      com.rapidminer.operator.execution.SimpleUnitExecutor.execute(SimpleUnitExecutor.java:51)
      com.rapidminer.operator.ExecutionUnit.execute(ExecutionUnit.java:709)
      com.rapidminer.operator.OperatorChain.doWork(OperatorChain.java:368)
      com.rapidminer.operator.Operator.execute(Operator.java:768)
      com.rapidminer.Process.run(Process.java:863)
      com.rapidminer.Process.run(Process.java:770)
      com.rapidminer.Process.run(Process.java:765)
      com.rapidminer.Process.run(Process.java:755)
      com.rapidminer.gui.ProcessThread.run(ProcessThread.java:65)
Caused by:
  java.lang.ClassNotFoundException: org.apache.commons.lang.StringEscapeUtils
      java.net.URLClassLoader$1.run(URLClassLoader.java:202)
      java.security.AccessController.doPrivileged(Native Method)
      java.net.URLClassLoader.findClass(URLClassLoader.java:190)
      java.lang.ClassLoader.loadClass(ClassLoader.java:307)
      java.lang.ClassLoader.loadClass(ClassLoader.java:248)
      com.rapidminer.operator.web.html.HTMLTextExtractionOperator.doWork(HTMLTextExtractionOperator.java:324)
      com.rapidminer.operator.text.io.AbstractTokenProcessor.doWork(AbstractTokenProcessor.java:60)
      com.rapidminer.operator.Operator.execute(Operator.java:768)
      com.rapidminer.operator.execution.SimpleUnitExecutor.execute(SimpleUnitExecutor.java:51)
      com.rapidminer.operator.ExecutionUnit.execute(ExecutionUnit.java:709)
      com.rapidminer.operator.collections.CollectionIterationOperator.doWork(CollectionIterationOperator.java:90)
      com.rapidminer.operator.Operator.execute(Operator.java:768)
      com.rapidminer.operator.execution.SimpleUnitExecutor.execute(SimpleUnitExecutor.java:51)
      com.rapidminer.operator.ExecutionUnit.execute(ExecutionUnit.java:709)
      com.rapidminer.operator.OperatorChain.doWork(OperatorChain.java:368)
      com.rapidminer.operator.Operator.execute(Operator.java:768)
      com.rapidminer.Process.run(Process.java:863)
      com.rapidminer.Process.run(Process.java:770)
      com.rapidminer.Process.run(Process.java:765)
      com.rapidminer.Process.run(Process.java:755)
      com.rapidminer.gui.ProcessThread.run(ProcessThread.java:65)
2010-10-28 11:58:03 SEVERE: Here:          Process[1] (Process)
          subprocess 'Main Process'
            +- Crawl Web[1] (Crawl Web)
            +- Multiply[1] (Multiply)
            +- Data to Documents[1] (Data to Documents)
            +- Loop Collection[1] (Loop Collection)
          subprocess 'Iteration'
                  +- Unescape HTML Document[1] (Unescape HTML Document)
      ==>        +- Extract Content[1] (Extract Content)
                  +- Write Document[0] (Write Document) (ProcessThread.run())
Tagged:

Answers

  • mgstickmgstick Member Posts: 5 Contributor II
    Hi,

    I found a way to add jar files to RapidMiner's CLASSPATH by modifying the RapidMinerGUI script and then launching RapidMiner using that script.

    Thanks (though none replied :(
Sign In or Register to comment.