[NullPointerException] Text classification problem

DuhaDuha Member Posts: 12 Contributor II
edited October 2019 in Help
Hi!

I'm trying to apply the code from https://blog.codecentric.de/en/2013/03/java-based-machine-learning-by-classification/ on my process which tests the classification of Arabic texts. I made the training and testing in two separate processes. Now I only need the testing process.
Here's the XML
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.3.000">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="5.3.000" expanded="true" name="Process">
    <parameter key="logverbosity" value="init"/>
    <parameter key="random_seed" value="2001"/>
    <parameter key="send_mail" value="never"/>
    <parameter key="notification_email" value=""/>
    <parameter key="process_duration_for_mail" value="30"/>
    <parameter key="encoding" value="SYSTEM"/>
    <process expanded="true" height="414" width="762">
      <operator activated="true" class="retrieve" compatibility="5.3.000" expanded="true" height="60" name="Retrieve" width="90" x="112" y="75">
        <parameter key="repository_entry" value="wordlistAr"/>
      </operator>
      <operator activated="true" class="text:process_document_from_file" compatibility="5.3.002" expanded="true" height="76" name="Process Documents from Files" width="90" x="246" y="30">
        <list key="text_directories">
          <parameter key="أخبار" value="C:\Users\WINDOWS 7\Desktop\rapid2\AraTest\New folder"/>
        </list>
        <parameter key="file_pattern" value="*"/>
        <parameter key="extract_text_only" value="true"/>
        <parameter key="use_file_extension_as_type" value="true"/>
        <parameter key="content_type" value="txt"/>
        <parameter key="encoding" value="SYSTEM"/>
        <parameter key="create_word_vector" value="true"/>
        <parameter key="vector_creation" value="TF-IDF"/>
        <parameter key="add_meta_information" value="true"/>
        <parameter key="keep_text" value="false"/>
        <parameter key="prune_method" value="none"/>
        <parameter key="prune_below_percent" value="3.0"/>
        <parameter key="prune_above_percent" value="30.0"/>
        <parameter key="prune_below_rank" value="0.05"/>
        <parameter key="prune_above_rank" value="0.95"/>
        <parameter key="datamanagement" value="double_sparse_array"/>
        <process expanded="true" height="414" width="762">
          <operator activated="true" class="text:tokenize" compatibility="5.3.002" expanded="true" height="60" name="Tokenize" width="90" x="45" y="30">
            <parameter key="mode" value="non letters"/>
            <parameter key="characters" value=".:"/>
            <parameter key="language" value="English"/>
            <parameter key="max_token_length" value="3"/>
          </operator>
          <operator activated="true" class="text:filter_stopwords_arabic" compatibility="5.3.002" expanded="true" height="60" name="Filter Stopwords (Arabic)" width="90" x="45" y="165"/>
          <operator activated="true" class="text:generate_n_grams_terms" compatibility="5.3.002" expanded="true" height="60" name="Generate n-Grams (Terms)" width="90" x="45" y="255">
            <parameter key="max_length" value="1"/>
          </operator>
          <connect from_port="document" to_op="Tokenize" to_port="document"/>
          <connect from_op="Tokenize" from_port="document" to_op="Filter Stopwords (Arabic)" to_port="document"/>
          <connect from_op="Filter Stopwords (Arabic)" from_port="document" to_op="Generate n-Grams (Terms)" to_port="document"/>
          <connect from_op="Generate n-Grams (Terms)" from_port="document" to_port="document 1"/>
          <portSpacing port="source_document" spacing="0"/>
          <portSpacing port="sink_document 1" spacing="0"/>
          <portSpacing port="sink_document 2" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="retrieve" compatibility="5.3.000" expanded="true" height="60" name="Retrieve (2)" width="90" x="447" y="30">
        <parameter key="repository_entry" value="modelAr"/>
      </operator>
      <operator activated="true" class="apply_model" compatibility="5.3.000" expanded="true" height="76" name="Apply Model" width="90" x="514" y="165">
        <list key="application_parameters"/>
        <parameter key="create_view" value="false"/>
      </operator>
      <connect from_op="Retrieve" from_port="output" to_op="Process Documents from Files" to_port="word list"/>
      <connect from_op="Process Documents from Files" from_port="example set" to_op="Apply Model" to_port="unlabelled data"/>
      <connect from_op="Retrieve (2)" from_port="output" to_op="Apply Model" to_port="model"/>
      <connect from_op="Apply Model" from_port="labelled data" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>
and here is the java code
// Path to process-definition
final String processPath =
  "C:/Users/WINDOWS 7/.RapidMiner5/repositories/NewLocalRepository/TestNews.rmp";

// Init RapidMiner
RapidMiner.setExecutionMode(ExecutionMode.COMMAND_LINE);
RapidMiner.init();

        try
        {   
         
// Load process
final com.rapidminer.Process process =
  new com.rapidminer.Process(new File(processPath));
       
        // Load learned model
final RepositoryLocation locWordList = new RepositoryLocation(
  "//NewLocalRepository/modelAr.model");
       
final IOObject wordlist = ((IOObjectEntry)
  locWordList.locateEntry()).retrieveData(null);

// Load Wordlist
final RepositoryLocation locModel = new RepositoryLocation(
  "//NewLocalRepository/wordlistAr.wordlist");
final IOObject model = ((IOObjectEntry)
  locModel.locateEntry()).retrieveData(null);

final IOContainer ioInput = new IOContainer(new IOObject[] { wordlist, model });
process.run(ioInput);
process.run(ioInput);
final long start = System.currentTimeMillis();
final IOContainer ioResult = process.run();
final long end = System.currentTimeMillis();
System.out.println("T:" + (end - start));

// Print some results
final SimpleExampleSet ses = ioResult.get(SimpleExampleSet.class);
for (int i = 0; i < Math.min(5, ses.size()); i++) {
final Example example = ses.getExample(i);
final Attributes attributes = example.getAttributes();

final String id = example.getValueAsString(attributes.getId());
final String prediction = example.getValueAsString(
  attributes.getPredictedLabel());

System.out.println("Path: " + id + ":\tPrediction:" + prediction);
        }
        }
        catch(Exception e)
        {e.printStackTrace();}
}
it says the problem is with this line

final IOObject wordlist = ((IOObjectEntry)
  locWordList.locateEntry()).retrieveData(null);

Thank you in advance

Answers

  • Marco_BoeckMarco_Boeck Administrator, Moderator, Employee, Member, University Professor Posts: 1,993 RM Engineering
    Hi,

    when you debug your code you will realize that

    ((IOObjectEntry) locModel.locateEntry())
    is null. The reason is that you're trying to load a repository location which most likely does not exist. Check the path and make sure it matches 100% with the one you see in the RapidMiner Studio GUI. If I should hazard a guess I'd say it is more likely that the correct path is "//NewLocalRepository/modelAr".

    Unrelated: Why do you run the process 3 times in a row, discarding the results of the first 2 executions?

    Regards,
    Marco
  • DuhaDuha Member Posts: 12 Contributor II
    Hi!

    I did as you said and the process started. But I got the error "Cannot resolve relative repository location 'C:\Users\WINDOWS 7\.RapidMiner5\repositories\NewLocalRepository\wordlistAr'. Process is not associated with a repository."

    So I associated the process with repository and got the xml file
    RepositoryLocation pLoc = new RepositoryLocation("//NewLocalRepository/TestNews");
    ProcessEntry pEntry = (ProcessEntry) pLoc.locateEntry();
    String processXML = pEntry.retrieveXML();
    Process process = new Process(processXML);
    But I still get the same  error of "Cannot resolve relative repository location" though the path to "wordlistAr" in the process is not relative  !  :-\
    What should I do?

    Regarding the multiple runs, they were there in the original code but I forgot to comment them out.

    Thank you
  • Marco_BoeckMarco_Boeck Administrator, Moderator, Employee, Member, University Professor Posts: 1,993 RM Engineering
    Hi,

    please post the current process XML and the current Java code if you still need help. Your last post was a bit confusing ;)

    Regards,
    Marco
  • DuhaDuha Member Posts: 12 Contributor II

    This is the XML of the testing process
    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.3.000">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="5.3.000" expanded="true" name="Process">
        <parameter key="logverbosity" value="init"/>
        <parameter key="random_seed" value="2001"/>
        <parameter key="send_mail" value="never"/>
        <parameter key="notification_email" value=""/>
        <parameter key="process_duration_for_mail" value="30"/>
        <parameter key="encoding" value="SYSTEM"/>
        <process expanded="true" height="414" width="762">
          <operator activated="true" class="retrieve" compatibility="5.3.000" expanded="true" height="60" name="Retrieve" width="90" x="112" y="75">
            <parameter key="repository_entry" value="C:\Users\WINDOWS 7\.RapidMiner5\repositories\NewLocalRepository\wordlistAr"/>
          </operator>
          <operator activated="true" class="text:process_document_from_file" compatibility="5.3.002" expanded="true" height="76" name="Process Documents from Files" width="90" x="246" y="30">
            <list key="text_directories">
              <parameter key="أخبار" value="C:\Users\WINDOWS 7\Desktop\rapid2\AraTest\New folder"/>
            </list>
            <parameter key="file_pattern" value="*"/>
            <parameter key="extract_text_only" value="true"/>
            <parameter key="use_file_extension_as_type" value="true"/>
            <parameter key="content_type" value="txt"/>
            <parameter key="encoding" value="SYSTEM"/>
            <parameter key="create_word_vector" value="true"/>
            <parameter key="vector_creation" value="TF-IDF"/>
            <parameter key="add_meta_information" value="true"/>
            <parameter key="keep_text" value="false"/>
            <parameter key="prune_method" value="none"/>
            <parameter key="prune_below_percent" value="3.0"/>
            <parameter key="prune_above_percent" value="30.0"/>
            <parameter key="prune_below_rank" value="0.05"/>
            <parameter key="prune_above_rank" value="0.95"/>
            <parameter key="datamanagement" value="double_sparse_array"/>
            <process expanded="true" height="414" width="762">
              <operator activated="true" class="text:tokenize" compatibility="5.3.002" expanded="true" height="60" name="Tokenize" width="90" x="45" y="30">
                <parameter key="mode" value="non letters"/>
                <parameter key="characters" value=".:"/>
                <parameter key="language" value="English"/>
                <parameter key="max_token_length" value="3"/>
              </operator>
              <operator activated="true" class="text:filter_stopwords_arabic" compatibility="5.3.002" expanded="true" height="60" name="Filter Stopwords (Arabic)" width="90" x="45" y="165"/>
              <operator activated="true" class="text:generate_n_grams_terms" compatibility="5.3.002" expanded="true" height="60" name="Generate n-Grams (Terms)" width="90" x="45" y="255">
                <parameter key="max_length" value="1"/>
              </operator>
              <connect from_port="document" to_op="Tokenize" to_port="document"/>
              <connect from_op="Tokenize" from_port="document" to_op="Filter Stopwords (Arabic)" to_port="document"/>
              <connect from_op="Filter Stopwords (Arabic)" from_port="document" to_op="Generate n-Grams (Terms)" to_port="document"/>
              <connect from_op="Generate n-Grams (Terms)" from_port="document" to_port="document 1"/>
              <portSpacing port="source_document" spacing="0"/>
              <portSpacing port="sink_document 1" spacing="0"/>
              <portSpacing port="sink_document 2" spacing="0"/>
            </process>
          </operator>
          <operator activated="true" class="retrieve" compatibility="5.3.000" expanded="true" height="60" name="Retrieve (2)" width="90" x="447" y="30">
            <parameter key="repository_entry" value="C:\Users\WINDOWS 7\.RapidMiner5\repositories\NewLocalRepository\modelAr"/>
          </operator>
          <operator activated="true" class="apply_model" compatibility="5.3.000" expanded="true" height="76" name="Apply Model" width="90" x="514" y="165">
            <list key="application_parameters"/>
            <parameter key="create_view" value="false"/>
          </operator>
          <connect from_op="Retrieve" from_port="output" to_op="Process Documents from Files" to_port="word list"/>
          <connect from_op="Process Documents from Files" from_port="example set" to_op="Apply Model" to_port="unlabelled data"/>
          <connect from_op="Retrieve (2)" from_port="output" to_op="Apply Model" to_port="model"/>
          <connect from_op="Apply Model" from_port="labelled data" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>
    and this is the Java code
    RapidMiner.setExecutionMode(ExecutionMode.COMMAND_LINE);
    RapidMiner.init();

            try
            { 
               
              RepositoryLocation pLoc = new RepositoryLocation("//NewLocalRepository/TestNews");
    ProcessEntry pEntry = (ProcessEntry) pLoc.locateEntry();
    String processXML = pEntry.retrieveXML();
    Process process = new Process(processXML);
         
            // Load learned model
    final RepositoryLocation locWordList = new RepositoryLocation(
      "//NewLocalRepository/modelAr");
           
    final IOObject wordlist = ((IOObjectEntry)
      locWordList.locateEntry()).retrieveData(null);

    // Load Wordlist
    final RepositoryLocation locModel = new RepositoryLocation(
      "//NewLocalRepository/wordlistAr");
    final IOObject model = ((IOObjectEntry)
      locModel.locateEntry()).retrieveData(null);

    final IOContainer ioInput = new IOContainer(new IOObject[] { wordlist, model });
           
            final IOContainer ioResult = process.run(ioInput);
    // Print some results
    final SimpleExampleSet ses = ioResult.get(SimpleExampleSet.class);
    for (int i = 0; i < Math.min(5, ses.size()); i++) {
    final Example example = ses.getExample(i);
    final Attributes attributes = example.getAttributes();

    final String id = example.getValueAsString(attributes.getId());
    final String prediction = example.getValueAsString(
      attributes.getPredictedLabel());

    System.out.println("Path: " + id + ":\tPrediction:" + prediction);
            }
            }
            catch(Exception e)
            {e.printStackTrace();}
    }
    I'm getting this error  "Cannot resolve relative repository location 'C:\Users\WINDOWS 7\.RapidMiner5\repositories\NewLocalRepository\wordlistAr'. Process is not associated with a repository."


    Thank you very much
  • Marco_BoeckMarco_Boeck Administrator, Moderator, Employee, Member, University Professor Posts: 1,993 RM Engineering
    Hi,

    your "Retrieve" operators do not specify a repository location, but instead an absolute path on the file system. That is not what these operators are for, they only work with repositories. If your process and your data is located in the same folder in the same repository, you can simply change the repository entry value to "wordlistAr" and "modelAr". They will then be searched right next to the process.

    Also you are giving your process input data. That is not necessary as you have not connected the input ports on the left side of the process. That's where the input data would appear. If loading data via operators, no input data is needed.

    Regards,
    Marco
  • DuhaDuha Member Posts: 12 Contributor II
    I changed the repository entry value  in the XML for both retrieves as you said, and modify the code to not take input
    RapidMiner.setExecutionMode(ExecutionMode.COMMAND_LINE);
    RapidMiner.init();

            try
            { 
               
              RepositoryLocation pLoc = new RepositoryLocation("//NewLocalRepository/TestNews");
    ProcessEntry pEntry = (ProcessEntry) pLoc.locateEntry();
    String processXML = pEntry.retrieveXML();
    Process process = new Process(processXML);
    final IOContainer ioResult = process.run();
    // Print some results
    final SimpleExampleSet ses = ioResult.get(SimpleExampleSet.class);
    for (int i = 0; i < Math.min(5, ses.size()); i++) {
    final Example example = ses.getExample(i);
    final Attributes attributes = example.getAttributes();

    final String id = example.getValueAsString(attributes.getId());
    final String prediction = example.getValueAsString(
      attributes.getPredictedLabel());

    System.out.println("Path: " + id + ":\tPrediction:" + prediction);
            }
            }
            catch(Exception e)
            {e.printStackTrace();}
    }
    But I still get the same error though the process and the data are in the same repository folder  :'(
    Why it's "Process is not associated with a repository."? I did associate it with a repository!

    Thank you
  • Marco_BoeckMarco_Boeck Administrator, Moderator, Employee, Member, University Professor Posts: 1,993 RM Engineering
    Hi,

    I just noticed you're loading your process but don't set it's location. You basically take the XML and build a process from that - which is fine, but that process now knows nothing about where it originally came from. To fix that, add one line after your process creation:

    Process process = new Process(processXML);
    process.setProcessLocation(new RepositoryProcessLocation(pLoc));
    Regards,
    Marco
  • DuhaDuha Member Posts: 12 Contributor II
    Thank you very much it does work now, and I got the results  :D

    But there's one critical problem, I got wrong classification predictions. I calculated the accuracy of prediction , when I run it in RapidMinrer GUI it's about 80% , however when I run the same process in Java it sharply drops down to about 11%  ??? Though both of them are testing the same dataset.
    Also, I got exactly the same predictions every time I run the process in Java.


    Another question please, I'm planning to integrate the process with an Android application. I know it's not efficient, but I need it as a temporary solution.
    Anyway, I want to take user input(String) and give it to process as an input instead of reading from files in the computer. Is there such a thing in RapidMiner? How can I do that?

    Thanks a lot
  • Marco_BoeckMarco_Boeck Administrator, Moderator, Employee, Member, University Professor Posts: 1,993 RM Engineering
    Hi,

    1)
    - are you using the same version of the Text extension in both GUI mode and your code?
    - is the random seed for the process identical in both GUI mode and your code?

    2) I think you want to create a document from the user input? If so, probably the easiest way is to use a macro. Replace the "Process Documents from Files" operator with a "Create Document" operator which delivers its data to a "Process Documents" operator. Before executing the process, set the macro like so:

    process.getMacroHandler().addMacro("user_input", "yourUserData");

    For the process itself, see below:

    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="6.4.000">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="6.0.002" expanded="true" name="Process">
        <process expanded="true">
          <operator activated="true" class="retrieve" compatibility="6.4.000" expanded="true" height="60" name="Retrieve" width="90" x="45" y="30">
            <parameter key="repository_entry" value="C:\Users\WINDOWS 7\.RapidMiner5\repositories\NewLocalRepository\wordlistAr"/>
          </operator>
          <operator activated="true" class="retrieve" compatibility="6.4.000" expanded="true" height="60" name="Retrieve (2)" width="90" x="246" y="30">
            <parameter key="repository_entry" value="C:\Users\WINDOWS 7\.RapidMiner5\repositories\NewLocalRepository\modelAr"/>
          </operator>
          <operator activated="true" class="text:create_document" compatibility="6.4.000" expanded="true" height="60" name="Create Document" width="90" x="45" y="120">
            <parameter key="text" value="%{user_input}"/>
          </operator>
          <operator activated="true" class="text:process_documents" compatibility="6.4.000" expanded="true" height="94" name="Process Documents" width="90" x="246" y="120">
            <process expanded="true">
              <operator activated="true" class="text:tokenize" compatibility="6.4.000" expanded="true" height="60" name="Tokenize" width="90" x="45" y="30"/>
              <operator activated="true" class="text:filter_stopwords_arabic" compatibility="6.4.000" expanded="true" height="60" name="Filter Stopwords (Arabic)" width="90" x="179" y="30"/>
              <operator activated="true" class="text:generate_n_grams_terms" compatibility="6.4.000" expanded="true" height="60" name="Generate n-Grams (Terms)" width="90" x="313" y="30">
                <parameter key="max_length" value="1"/>
              </operator>
              <connect from_port="document" to_op="Tokenize" to_port="document"/>
              <connect from_op="Tokenize" from_port="document" to_op="Filter Stopwords (Arabic)" to_port="document"/>
              <connect from_op="Filter Stopwords (Arabic)" from_port="document" to_op="Generate n-Grams (Terms)" to_port="document"/>
              <connect from_op="Generate n-Grams (Terms)" from_port="document" to_port="document 1"/>
              <portSpacing port="source_document" spacing="0"/>
              <portSpacing port="sink_document 1" spacing="0"/>
              <portSpacing port="sink_document 2" spacing="0"/>
            </process>
          </operator>
          <operator activated="true" class="apply_model" compatibility="6.4.000" expanded="true" height="76" name="Apply Model" width="90" x="380" y="30">
            <list key="application_parameters"/>
          </operator>
          <connect from_op="Retrieve" from_port="output" to_op="Process Documents" to_port="word list"/>
          <connect from_op="Retrieve (2)" from_port="output" to_op="Apply Model" to_port="model"/>
          <connect from_op="Create Document" from_port="output" to_op="Process Documents" to_port="documents 1"/>
          <connect from_op="Process Documents" from_port="example set" to_op="Apply Model" to_port="unlabelled data"/>
          <connect from_op="Apply Model" from_port="labelled data" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>
    Regards,
    Marco
  • DuhaDuha Member Posts: 12 Contributor II
    Thanks for the process, I'll try it.

    About the extension yes it's the same version. I just put it in the lib/plugins folder. Is this all I have to do for the extension to work?
    How to check if the random seed is identical ?

    Regards,
    Duha
  • Marco_BoeckMarco_Boeck Administrator, Moderator, Employee, Member, University Professor Posts: 1,993 RM Engineering
    Hi,

    yes, the lib/plugin folder is correct. You can call

    try {
    System.out.println(myProcess.getRootOperator().getParameter(ProcessRootOperator.PARAMETER_RANDOM_SEED));
    } catch (UndefinedParameterError e) {
    // TODO Auto-generated catch block
    e.printStackTrace();
    }
    to check the random seed of the process and compare it with the random seed in the GUI. If that is not the cause, you can send me the data your process uses via PM, and I will have a look what's going on.

    Regards,
    Marco
  • Marco_BoeckMarco_Boeck Administrator, Moderator, Employee, Member, University Professor Posts: 1,993 RM Engineering
    Hi,

    I just had a quick glance at what you sent me. I seem to get identical results in GUI and Java execution mode. However I do not have the time to dig deeper into it.
    If you do have support access with your Studio license, please contact us at https://support.rapidminer.com/ and we will investigate the issue further.
    Otherwise, if you are certain it is a bug on our end, you can file a bug at http://bugs.rapidminer.com/.

    Regards,
    Marco
  • DuhaDuha Member Posts: 12 Contributor II
    Ok thank you very much.
    Please, just to make sure I want to know how can I correctly add text mining extension to Java ?

    Regards,
    Duha
  • Marco_BoeckMarco_Boeck Administrator, Moderator, Employee, Member, University Professor Posts: 1,993 RM Engineering
    Hi,

    when building your extension, you have to specify that you depend on the Text Extension. With our new Gradle extension mechanism, this looks like this:

    build.gradle

    [...]
    extensionConfig {
    name 'My Extension'

    namespace 'my_ext'

    dependencies {
    extension namespace: 'text', version: '5.3.002'
    }
    }
    [...]
    The dependency is there so a) the dependency is downloaded when downloading your extension from the marketplace and b) so you have access to its code at compile time.
    Sort of a dev kit is scheduled to be released within the next 30 days, so developing an extension for Studio 6.x should be significantly easier soon.

    Regards,
    Marco
  • DuhaDuha Member Posts: 12 Contributor II
    Thanks a lot
  • DuhaDuha Member Posts: 12 Contributor II
    Hi!
    Finally the problem is solved.
    The whole issue was with the encoding of the testing text files. It's supposed to be UTF-8, because it's in Arabic, However it was ANSI.

    If you don't mind, please I just need an explanation  regarding the behavior of the following classifiers. I tested 3 classifiers on the same testing data but used 3 different datasets for training, and the accuracy results are as the following:


    - Naive Bayes:
    Testing accuracy using Data1 for training : 49%
    Testing accuracy using Data2 for training: 76%
    Testing accuracy using Data3 for training:  89%

    - SVM:
    Testing accuracy using Data1 for training: 50%
    Testing accuracy using Data2 for training: 25%
    Testing accuracy using Data3 for training:  72%

    - K-NN
    Testing accuracy using Data1 for training: 22%
    Testing accuracy using Data2 for training: 95%
    Testing accuracy using Data3 for training:  47%

    Note: Data1(500 files,short texts), Data2(500 files, long texts) , Data3 = (1615 files, long and short texts)

    The expected result was that the testing accuracy will keep increasing from Data1 to Data3. However, this is observed only with Naive Bayes, while the other 2 are showing ups and downs in the percentages.

    Thank you,
    Duha
  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,503 RM Data Scientist
    Hello Duha,

    why do you think the thrid one should be the best to train on?

    Did you use a validation technique like X-Validation? If you do it, you will not just get the accuracy, but also it's standard deviation. This will help you to interpret those results.

    Cheers,
    Martin
    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • DuhaDuha Member Posts: 12 Contributor II
    Hi!

    I assumed they will work similar to what I found with Naive Bayes (the richer the training texts with words, the more accurate the testing results). I think Data3 is the richest in content and brings the best wordlist to be used in testing. Am I wrong?

    Regards,
    Duha
  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,503 RM Data Scientist
    Hi,

    you are partly right and partly wrong. Usually it is better to add more examples (=texts) to the trainining. The idea is, that it is better to add more information.
    In text mining you will get more attributes the more examples you add. There are most likely a lot of attributes w/o any information about the label. In this case the learner might get confused.

    If you think about the k-NN you can easily imagine that. If you add more dimensions, which are just uniformly distributed, the distance measure will get heavily influenced by those attributes and the k-NN will get confused. For the SVM i would expect, that you need a higher C to get the same results.
    You should try to do a feature selection. I would suggest using Weight by SVM with Select by Weights and then train the algorihm afterwards. Also pruning (in Process Documents) might help.


    Cheers,
    Martin
    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • DuhaDuha Member Posts: 12 Contributor II

    Thanks a lot for the explanation. Now I get it.
  • mobile_minermobile_miner Member Posts: 1 Contributor I
    Hi Duha,

    did you already manage to integrate RapidMiner with Android? Whenever I try to do: RapidMiner.setExecutionMode(RapidMiner.ExecutionMode.APPSERVER); the app already crashes.
    @RM: Apart from the file system access I would not see many obstacles to run RM on Andoid - will there be any support for this or would you recommend another way to perform the Model Apply task?
    Best
    John
  • DuhaDuha Member Posts: 12 Contributor II
    Hi!

    Sorry, actually yeah the problem was solved. But honestly speaking,  it's been a long time so I really don't remember any thing.
    You can create a new thread explaining your problem, and I'm sure the experts here will help you.

    Regards,
    Duha
Sign In or Register to comment.