Problem with text processing plugin

datasunnydatasunny Member Posts: 11 Contributor II
edited November 2018 in Help
Hi all,

I encountered a problem in RM text processing plugin.
The program was working fine before but failed for some text files with non ascii characters.
The setup is using "Process Documents from Files" operator, what's in that operator are:
Transform Cases -> Tokenize -> Filter Stopwords -> Stem -> Filter Tokens (by Length)

Is it a bug in the text processing plugin or sth wrong with my setup/program? Thanks.

--------------------------------------------------------------------------------------------------
SEVERE: Process failed: operator cannot be executed (The name "lnêäûð6ûonxßvâisÿˆïqwòb-ûfåàãwcû-kžîžìeî" is not legal for JDOM/XML Namespace prefixs: Namespace prefixes cannot contain the character "ˆ".). Check the log messages...
org.jdom.IllegalNameException: The name "lnêäûð6ûonxßvâisÿˆïqwòb-ûfåàãwcû-kžîžìeî" is not legal for JDOM/XML Namespace prefixs: Namespace prefixes cannot contain the character "ˆ".
...
...
---------------------------------------------------------------------------------------------------
Exception in thread "main" org.jdom.IllegalNameException: The name "home" is not legal for JDOM/XML attributes: XML names cannot begin with the character "h".
at org.jdom.Attribute.setName(Attribute.java:361)
at org.jdom.Attribute.<init>(Attribute.java:228)
at org.jdom.Attribute.<init>(Attribute.java:276)
at org.jdom.DefaultJDOMFactory.attribute(DefaultJDOMFactory.java:93)
at org.jdom.input.SAXHandler.startElement(SAXHandler.java:544)
at org.ccil.cowan.tagsoup.Parser.push(Parser.java:794)
at org.ccil.cowan.tagsoup.Parser.rectify(Parser.java:1061)
at org.ccil.cowan.tagsoup.Parser.stagc(Parser.java:1016)
at org.ccil.cowan.tagsoup.HTMLScanner.scan(HTMLScanner.java:388)
at org.ccil.cowan.tagsoup.Parser.parse(Parser.java:449)
at org.jdom.input.SAXBuilder.build(SAXBuilder.java:453)
at org.jdom.input.SAXBuilder.build(SAXBuilder.java:851)
at com.rapidminer.operator.text.io.filereader.HTMLFileReader.readStream(HTMLFileReader.java:72)
at com.rapidminer.operator.text.io.filereader.AbstractFileReader.readFile(AbstractFileReader.java:37)
at com.rapidminer.operator.text.io.FileDocumentInputIterator.next(FileDocumentInputIterator.java:94)
at com.rapidminer.operator.text.io.FileDocumentInputIterator.next(FileDocumentInputIterator.java:43)
at com.rapidminer.operator.text.io.AbstractDocumentInputOperator.doWork(AbstractDocumentInputOperator.java:228)
at com.rapidminer.operator.Operator.execute(Operator.java:833)
at com.rapidminer.operator.execution.SimpleUnitExecutor.execute(SimpleUnitExecutor.java:51)
at com.rapidminer.operator.ExecutionUnit.execute(ExecutionUnit.java:709)
at com.rapidminer.operator.OperatorChain.doWork(OperatorChain.java:379)
at com.rapidminer.operator.Operator.execute(Operator.java:833)
at com.rapidminer.Process.run(Process.java:925)
at com.rapidminer.Process.run(Process.java:848)
at com.rapidminer.Process.run(Process.java:807)
at com.rapidminer.Process.run(Process.java:802)
at com.rapidminer.Process.run(Process.java:792)
at Filter.filter(PornFilter.java:84)
at Filter.main(PornFilter.java:128)

Answers

  • Nils_WoehlerNils_Woehler Member Posts: 463  Guru
    Hi,

    this seems to be a encoding problem. Did you try to use another encoding type? It can be set with the expert parameter "encoding".
    If this does not help can you please post a short process that helps us to reproduce the error? How to post a process is described here: http://rapid-i.com/rapidforum/index.php/topic,4654.0.html
    Furthermore is it possible to also send some part of the data that produced the error? Without it, the error is hard to reproduce.

    Best,
    Nils
  • ielhassaniielhassani Member Posts: 10 Contributor II
    Hi all,
    I have exactly the same problem

    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.2.006">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="5.2.006" expanded="true" name="Process">
        <description>Reads collections of text from a set of directories, assigning each directory to a class (as specified by parameter text_directories), and transforms them into a TF-IDF or other word vector. Finally, an SVM is applied to model the input texts.</description>
        <parameter key="send_mail" value="always"/>
        <process expanded="true" height="377" width="480">
          <operator activated="true" class="text:process_document_from_file" compatibility="5.2.003" expanded="true" height="76" name="Process Documents from Files" width="90" x="45" y="30">
            <list key="text_directories">
              <parameter key="1" value="C:\Users\ahmad.ali\Documents\dossier-rapidminer\testmining\1"/>
              <parameter key="2" value="C:\Users\ahmad.ali\Documents\dossier-rapidminer\testmining\2"/>
              <parameter key="3" value="C:\Users\ahmad.ali\Documents\dossier-rapidminer\testmining\3"/>
            </list>
            <parameter key="prune_below_rank" value="5.0"/>
            <parameter key="prune_above_rank" value="5.0"/>
            <process expanded="true" height="490" width="570">
              <operator activated="true" class="web:extract_html_text_content" compatibility="5.2.000" expanded="true" height="60" name="Extract Content" width="90" x="112" y="30"/>
              <operator activated="true" class="text:tokenize" compatibility="5.2.003" expanded="true" height="60" name="Tokenize" width="90" x="315" y="30"/>
              <operator activated="true" class="text:generate_n_grams_characters" compatibility="5.2.003" expanded="true" height="60" name="Generate n-Grams (Characters)" width="90" x="450" y="30"/>
              <connect from_port="document" to_op="Extract Content" to_port="document"/>
              <connect from_op="Extract Content" from_port="document" to_op="Tokenize" to_port="document"/>
              <connect from_op="Tokenize" from_port="document" to_op="Generate n-Grams (Characters)" to_port="document"/>
              <connect from_op="Generate n-Grams (Characters)" from_port="document" to_port="document 1"/>
              <portSpacing port="source_document" spacing="0"/>
              <portSpacing port="sink_document 1" spacing="0"/>
              <portSpacing port="sink_document 2" spacing="0"/>
            </process>
          </operator>
          <operator activated="true" class="x_validation" compatibility="5.1.002" expanded="true" height="112" name="Validation" width="90" x="313" y="30">
            <description>A cross-validation evaluating a decision tree model.</description>
            <process expanded="true" height="654" width="466">
              <operator activated="true" class="naive_bayes" compatibility="5.2.006" expanded="true" height="76" name="Naive Bayes" width="90" x="112" y="30"/>
              <connect from_port="training" to_op="Naive Bayes" to_port="training set"/>
              <connect from_op="Naive Bayes" from_port="model" to_port="model"/>
              <portSpacing port="source_training" spacing="0"/>
              <portSpacing port="sink_model" spacing="0"/>
              <portSpacing port="sink_through 1" spacing="0"/>
            </process>
            <process expanded="true" height="654" width="466">
              <operator activated="true" class="apply_model" compatibility="5.2.006" expanded="true" height="76" name="Apply Model" width="90" x="45" y="30">
                <list key="application_parameters"/>
              </operator>
              <operator activated="true" class="performance" compatibility="5.2.006" expanded="true" height="76" name="Performance" width="90" x="179" y="30"/>
              <connect from_port="model" to_op="Apply Model" to_port="model"/>
              <connect from_port="test set" to_op="Apply Model" to_port="unlabelled data"/>
              <connect from_op="Apply Model" from_port="labelled data" to_op="Performance" to_port="labelled data"/>
              <connect from_op="Performance" from_port="performance" to_port="averagable 1"/>
              <portSpacing port="source_model" spacing="0"/>
              <portSpacing port="source_test set" spacing="0"/>
              <portSpacing port="source_through 1" spacing="0"/>
              <portSpacing port="sink_averagable 1" spacing="0"/>
              <portSpacing port="sink_averagable 2" spacing="0"/>
            </process>
          </operator>
          <connect from_op="Process Documents from Files" from_port="example set" to_op="Validation" to_port="training"/>
          <connect from_op="Validation" from_port="averagable 1" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>

    and this is a part of the data that produced the error :
    http://www.4shared.com/rar/AEL4eo-k/textmining.html

    thank you.
  • Nils_WoehlerNils_Woehler Member Posts: 463  Guru
    Hi,

    what kind of encoding do the html files have? If I open them they look like

    ßÊÈ ÓÇáã ÇáÑÍÈí: ÊäØáÞ Çáíæã ÇáÏæÑÉ ÇáÈÑÇãÌíÉ ÇáÌÏíÏÉ ááÊáíÝÒíæä æÇáÇÐÇÚÉ æÈÑäÇãÌ ÇáÔÈÇÈ æÇáÊí ÊÓÊãÑ ØæÇá ÇÔåÑ
    which is no valid HTML.

    Best,
    Nils
  • vc126mvc126m Member Posts: 9 Contributor I

    Hi,

    I could able to excute the process through its .rmp file. I am gettin below error when I try to run the process

     

    <em class="error">The operator class 'IntentsOperator' is unknown.</em>
    com.rapidminer.io.process.XMLImporter.attribute_not_found_unknown
    com.rapidminer.io.process.XMLImporter.attribute_not_found_unknown
    com.rapidminer.io.process.XMLImporter.attribute_not_found_unknown
    com.rapidminer.io.process.XMLImporter.attribute_not_found_unknown
    com.rapidminer.io.process.XMLImporter.attribute_not_found_unknown
    com.rapidminer.io.process.XMLImporter.attribute_not_found_unknown
    com.rapidminer.io.process.XMLImporter.attribute_not_found_unknown
    com.rapidminer.io.process.XMLImporter.attribute_not_found_unknown
    <em class="error">The output port <var>intents</var> is unknown at operator <var>intents</var>.</em>
    -- ADDING MACROS--
    test : test
    No filename given for result file, using stdout for logging results!
    Process C:\Users\vc126m\Documents\rapidminercommandexecutor\.RapidMiner5\repositories\Local Repository\processes\intents test.rmp starts
    Process failed: The dummy operator intents (replacing IntentsOperator) cannot be executed.
    Here: Process[1] (Process)
    subprocess 'Main Process'
    ==> +- intents[1] (dummy)
    Process not successful
    341 [Thread-3] INFO org.eclipse.jetty.util.log - Logging initialized @5395ms
    384 [Thread-3] INFO spark.embeddedserver.jetty.EmbeddedJettyServer - == Spark has ignited ...
    385 [Thread-3] INFO spark.embeddedserver.jetty.EmbeddedJettyServer - >> Listening on 0.0.0.0:4567
    387 [Thread-3] INFO org.eclipse.jetty.server.Server - jetty-9.3.6.v20151106
    421 [Thread-3] INFO org.eclipse.jetty.server.ServerConnector - Started [email protected]{HTTP/1.1,[http/1.1]}{0.0.0.0:4567}
    421 [Thread-3] INFO org.eclipse.jetty.server.Server - Started @5476ms

    Process finished with exit code 1

     

  • mschmitzmschmitz Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 2,127  RM Data Scientist

    Hi,

     

    your RM engine does not find the extension with your custom operator "intents". So you need to be sure that this extension is loaded as well.

     

    Best,

    Martin

    - Head of Data Science Services at RapidMiner -
    Dortmund, Germany
  • vc126mvc126m Member Posts: 9 Contributor I

    Hello mschmitz, thank you very much for your reply. Would you please explain me in details how can i add that extensions.

     

    Thanks,

    venkat

  • vc126mvc126m Member Posts: 9 Contributor I

    Hello, Could you please help me with how to add the extensions. I am using RM version 5.3.013. 

  • Thomas_OttThomas_Ott RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,761   Unicorn

    So it sounds like you created your own extension and compiled it? Is it a JAR file? if so, you will have to place it into your /.RapdiMiner/extensions directory and then restart RapidMiner. I'm not sure where that is in v5.3.

     

    Did you use the developer's guide to make the extension? https://docs.rapidminer.com/developers/

  • vc126mvc126m Member Posts: 9 Contributor I

    I added rapidminer.jar file in ./RM/extensions directory. But still getting the same 

    May 24, 2017 2:27:20 PM com.rapidminer.io.process.XMLImporter addMessage
    INFO: <em class="error">The operator class 'entity-extract' is unknown.</em>
    May 24, 2017 2:27:20 PM com.rapidminer.io.process.XMLImporter parseOperator
    INFO: The parameter 'host_name' is unknown for operator 'entity extract' (" dummy ")."
    May 24, 2017 2:27:20 PM com.rapidminer.io.process.XMLImporter parseOperator
    INFO: The parameter 'host_port' is unknown for operator 'entity extract' (" dummy ")."
    May 24, 2017 2:27:20 PM com.rapidminer.io.process.XMLImporter parseOperator
    INFO: The parameter 'path' is unknown for operator 'entity extract' (" dummy ")."
    May 24, 2017 2:27:20 PM com.rapidminer.io.process.XMLImporter parseOperator
    INFO: The parameter 'groupid' is unknown for operator 'entity extract' (" dummy ")."
    May 24, 2017 2:27:20 PM com.rapidminer.io.process.XMLImporter parseOperator
    INFO: The parameter 'libraryname' is unknown for operator 'entity extract' (" dummy ")."
    May 24, 2017 2:27:20 PM com.rapidminer.io.process.XMLImporter parseOperator
    INFO: The parameter 'languagename' is unknown for operator 'entity extract' (" dummy ")."
    May 24, 2017 2:27:20 PM com.rapidminer.io.process.XMLImporter parseOperator
    INFO: The parameter 'Username' is unknown for operator 'entity extract' (" dummy ")."
    May 24, 2017 2:27:20 PM com.rapidminer.io.process.XMLImporter parseOperator
    INFO: The parameter 'password' is unknown for operator 'entity extract' (" dummy ")."
    May 24, 2017 2:27:20 PM com.rapidminer.io.process.XMLImporter addMessage
    INFO: <em class="error">The input port <var>write</var> is unknown at operator <var>entity extract</var>.</em>
    May 24, 2017 2:27:20 PM com.rapidminer.io.process.XMLImporter addMessage
    INFO: <em class="error">The output port <var>entityextract</var> is unknown at operator <var>entity extract</var>.</em>
    May 24, 2017 2:27:20 PM CommandLine serverLoop
    SEVERE: -- ADDING MACROS--
    May 24, 2017 2:27:20 PM CommandLine serverLoop
    SEVERE: test : test
    May 24, 2017 2:27:20 PM com.rapidminer.tools.WrapperLoggingHandler log
    INFO: No filename given for result file, using stdout for logging results!
    May 24, 2017 2:27:20 PM com.rapidminer.Process run
    INFO: Process C:\Users\vc126m\Documents\rapidminercommandexecutor\.RapidMiner5\repositories\Local Repository\processes\entityextract.rmp starts
    May 24, 2017 2:27:20 PM CommandLine serverLoop
    SEVERE: Process failed: The dummy operator entity extract (replacing entity-extract) cannot be executed.
    com.rapidminer.operator.UserError: The dummy operator entity extract (replacing entity-extract) cannot be executed.
    at com.rapidminer.operator.DummyOperator.doWork(DummyOperator.java:88)
    at com.rapidminer.operator.Operator.execute(Operator.java:867)
    at com.rapidminer.operator.execution.SimpleUnitExecutor.execute(SimpleUnitExecutor.java:51)
    at com.rapidminer.operator.ExecutionUnit.execute(ExecutionUnit.java:711)
    at com.rapidminer.operator.OperatorChain.doWork(OperatorChain.java:375)
    at com.rapidminer.operator.Operator.execute(Operator.java:867)
    at com.rapidminer.Process.run(Process.java:949)
    at com.rapidminer.Process.run(Process.java:873)
    at com.rapidminer.Process.run(Process.java:832)
    at com.rapidminer.Process.run(Process.java:827)
    at com.rapidminer.Process.run(Process.java:817)
    at CommandLine.serverLoop(CommandLine.java:153)
    at CommandLine.results(CommandLine.java:202)
    at CommandLine.main(CommandLine.java:220)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at com.intellij.rt.execution.application.AppMain.main(AppMain.java:147)

    May 24, 2017 2:27:20 PM CommandLine serverLoop
    SEVERE: Here: Process[1] (Process)
    subprocess 'Main Process'
    +- Read CSV[1] (Read CSV)
    ==> +- entity extract[1] (dummy)
    May 24, 2017 2:27:20 PM CommandLine serverLoop
    SEVERE: Process not successful

  • Thomas_OttThomas_Ott RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,761   Unicorn

    I wonder if there's a Studio versioning issue. V5.3 is pretty old, have you tried it with v7.5?

Sign In or Register to comment.