Options

"FeatureExtraction from XML LibSVM Java"

jornojorno Member Posts: 7 Contributor II
edited May 2019 in Help
Hi All,

First of all i would like to thanks the Rapid Miner guys for their great product !
Thanks a lot for the examples , documentation and of course, the wizards !!
I would also like to thank Michael Wurst for his tutorial on his website (nemoz.org) !!!

----------------------------
I'm a newbie student and i have assignment  to classify urls.
I read a lot of documentation and searched in the forums , but i guess i still have 2 problems ( RapidMiner version 4.2 ) ...

I created an XML file for each url features in 2 folders.
.\train\news\www.news1.de.xml
.\train\news\www.news2.de.xml
.\train\porn\www.porn1.de.xml
.\train\porn\www.porn2.de.xml

each xml looks like:
<myXML>
      <title> my title </title>
      <keywords> my keywords </keywords>
      <numberOfPages> 6 </numberOfPages>
</myXML>
----------------------------
1. when i am running the project file ( below ) in RapidMiner - with libsvm - it says :
  "Message: This learning scheme does not have sufficient capabilities for the given data set: polynominal attributes not supported" 
    I tried to use the "06_ExtractionAndWordVecotor.xml" example - but it gave me the same error.

2. I tried to load the model using java - but i cannot understand how to load the features themselves instead of the whole text ...
  ( TextInput instead of SingleTextInput ?? ) , the simple example works - but without the features ...

I would really appreciate your help !
Thanks a lot for everything  !!
Jorno

---------------------------------------------
RAPID MINER CONFIGURATION FILE
---------------------------------------------
<?xml version="1.0" encoding="windows-1252"?>
<process version="4.4">

  <operator name="Root" class="Process" expanded="yes">   
      <parameter key="logverbosity"  value="init"/>
      <parameter key="random_seed"  value="2001"/>
      <parameter key="encoding"  value="SYSTEM"/>
      <operator name="Extractor" class="FeatureExtraction">
          <list key="texts">
            <parameter key="news"  value=".\train\news"/>
            <parameter key="porn"  value=".\train\porn"/>
          </list>
          <parameter key="default_content_type"  value=""/>
          <parameter key="default_content_encoding"  value="UTF-8"/>
          <parameter key="default_content_language"  value="english"/>
          <parameter key="use_content_attributes"  value="false"/>
          <parameter key="id_attribute_type"  value="long"/>
          <list key="attributes">
            <parameter key="title"  value="//*/title/text() "/>
            <parameter key="#numberOfPages"  value="//*/numberOfPages/text()"/>
            <parameter key="keywords"  value="//*/keywords/text()"/>
          </list>
          <list key="namespaces">
          </list>
      </operator>
      <operator name="TextInput" class="TextInput" expanded="yes">
          <list key="texts">
            <parameter key="news"  value=".\train\news"/>
            <parameter key="porn"  value=".\train\porn"/>
          </list>
          <parameter key="default_content_type"  value=""/>
          <parameter key="default_content_encoding"  value="UTF-8"/>
          <parameter key="default_content_language"  value="english"/>
          <parameter key="prune_below"  value="-1"/>
          <parameter key="prune_above"  value="-1"/>
          <parameter key="vector_creation"  value="TFIDF"/>
          <parameter key="use_content_attributes"  value="false"/>
          <parameter key="use_given_word_list"  value="false"/>
          <parameter key="return_word_list"  value="true"/>
          <parameter key="output_word_list"  value=".\train\training_words.txt"/>
          <parameter key="id_attribute_type"  value="long"/>
          <list key="namespaces">
          </list>
          <parameter key="create_text_visualizer"  value="true"/>
          <parameter key="on_the_fly_pruning"  value="-1"/>
          <parameter key="extend_exampleset"  value="true"/>
          <operator name="StringTokenizer" class="StringTokenizer">
          </operator>
          <operator name="EnglishStopwordFilter" class="EnglishStopwordFilter">
          </operator>
          <operator name="TokenLengthFilter" class="TokenLengthFilter">
              <parameter key="min_chars"  value="3"/>
              <parameter key="max_chars"  value="2147483647"/>
          </operator>
          <operator name="PorterStemmer" class="PorterStemmer">
          </operator>
      </operator>
      <operator name="LibSVMLearner" class="LibSVMLearner">
          <parameter key="keep_example_set"  value="false"/>
          <parameter key="svm_type"  value="C-SVC"/>
          <parameter key="kernel_type"  value="linear"/>
          <parameter key="degree"  value="3"/>
          <parameter key="gamma"  value="0.0"/>
          <parameter key="coef0"  value="0.0"/>
          <parameter key="C"  value="0.0"/>
          <parameter key="nu"  value="0.5"/>
          <parameter key="cache_size"  value="80"/>
          <parameter key="epsilon"  value="0.0010"/>
          <parameter key="p"  value="0.1"/>
          <list key="class_weights">
          </list>
          <parameter key="shrinking"  value="true"/>
          <parameter key="calculate_confidences"  value="false"/>
          <parameter key="confidence_for_multiclass"  value="true"/>
      </operator>
      <operator name="ModelWriter" class="ModelWriter">
          <parameter key="model_file"  value=".\train\training_model.mod"/>
          <parameter key="overwrite_existing_file"  value="true"/>
          <parameter key="output_type"  value="Binary"/>
      </operator>
  </operator>

</process>





-----------------------------------------------
JAVA CODE
-----------------------------------------------
import java.io.File;
import java.io.IOException;

import com.rapidminer.RapidMiner;
import com.rapidminer.example.Example;
import com.rapidminer.example.ExampleSet;
import com.rapidminer.operator.IOContainer;
import com.rapidminer.operator.Model;
import com.rapidminer.operator.Operator;
import com.rapidminer.operator.OperatorChain;
import com.rapidminer.operator.OperatorCreationException;
import com.rapidminer.operator.OperatorException;
import com.rapidminer.tools.OperatorService;

public class RapidMinerTextClassifier
{

  private OperatorChain wvtoolOperator;
  private Operator modelApplier;
  private Model model;

  public RapidMinerTextClassifier(File modelFile, File wordListFile)
        throws IOException, OperatorCreationException, OperatorException
  {

      //System.setProperty(RapidMiner.PROPERTY_RAPIDMINER_HOME, "C:\\Program Files\\Rapid-I\\RapidMiner\\lib"); //  "rapidminer.home"
      //System.setProperty("rapidminer.home", "D:\\Applications\\RapidMiner-4.2");
      System.setProperty("rapidminer.home", "C:\\Program Files\\Rapid-I\\RapidMiner");
     
      String pluginDirString = new File("C:\\Program Files\\Rapid-I\\RapidMiner\\lib\\plugins").getAbsolutePath();
      System.setProperty(RapidMiner.PROPERTY_RAPIDMINER_INIT_PLUGINS_LOCATION, pluginDirString);

      RapidMiner.init(false, false, false, true);
     
      // Create the text input operator and set the path to the word list you stored using Rapid Miner
      // As there is only a single text, we use the SingleTextInput operator
      wvtoolOperator = (OperatorChain) OperatorService.createOperator("SingleTextInput"); // I need TextInput ?????????????
     
      wvtoolOperator.setParameter("input_word_list", wordListFile.getAbsolutePath());

      // Add additional processing steps.
      // Note the setup must be same as the one you used when creating the classification model
      wvtoolOperator.addOperator(OperatorService.createOperator("StringTokenizer"));
      wvtoolOperator.addOperator(OperatorService.createOperator("EnglishStopwordFilter"));
      wvtoolOperator.addOperator(OperatorService.createOperator("TokenLengthFilter"));
      wvtoolOperator.addOperator(OperatorService.createOperator("PorterStemmer"));

      // Create the model applier
      modelApplier = OperatorService.createOperator("ModelApplier");

      // Load the model into a field of the class
      Operator modelLoader = OperatorService.createOperator("ModelLoader");
      modelLoader.setParameter("model_file", modelFile.getAbsolutePath());
      IOContainer container = modelLoader.apply(new IOContainer());
      model = container.get(Model.class);

  }

  public String apply(String text) throws OperatorException
  {

      // Set the text
      wvtoolOperator.setParameter("text", text);     
      //wvtoolOperator.setParameter("title", text);
      //wvtoolOperator.setParameter("keywords", text);
      //wvtoolOperator.setParameter("numberOfPages", int);
     
     
      // Call the text input operator
      IOContainer container = wvtoolOperator.apply(new IOContainer(model));

      // Call the model applier (the model was added already before calling the text input)
      container = modelApplier.apply(container);

      // Obtain the example set from the io container. It contains only a single example with our text in it.
      ExampleSet eset = container.get(ExampleSet.class);
      Example e = eset.iterator().next();

      // Compare the predicted label with the positive label     
      System.out.println(eset.getAttributes().getPredictedLabel().getMapping() + " " + e.getConfidence("porn") + " " + e.getConfidence("news"));
      return eset.getAttributes().getPredictedLabel().getMapping().mapIndex( (int)e.getPredictedLabel() );

  }

  public static void main(String args[]) throws Exception
  {
     
      // Create a text classifier
      RapidMinerTextClassifier tr = new RapidMinerTextClassifier(
            new File(
                  "C:\\Main\\eclipse\\workspace\\octopus\\RapidMiner\\train\\training_model.mod"),
            new File(
                  "C:\\Main\\eclipse\\workspace\\octopus\\RapidMiner\\train\\training_words.txt"));

      // Call the classifier with texts
      System.out.println("Test1:" + tr.apply("povrai xflick resolution gif"));
      System.out.println("Test2:" + tr.apply("workstation intel switch"));
      System.out.println("Test3:" + tr.apply("sex porn sex povrai xflick resolution gif"));

  }

}
Tagged:

Answers

  • Options
    landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn
    Hi,
    first of all: Please update to the current version 4.4. I not even remember, which problems occured back then...

    And now to your problem: You are trying to load structured information from a xml file, but you are using one of the textinput operators, which are just designed for unstructured (plain) texts. You have two possibilities: Generate comma separated files from your xml files, and use the normal exampleSource. For example, this file could look like this:
    news, my title,my keywords, 6
    news, my title2,my keywords2, 4
    ...

    Another, perhabs more easy method for extracting data from structured files is the FeatureExtractionOperator of the text plugin. You can specifiy there XPath expressions, in order to extract the content of each of your three XML nodes. Each expression is assigned another attribute. But then you would have to do this in two steps for generating the correct label, because its not inside the XML and hence cannot be extracted...

    Greetings,
      Sebastian
  • Options
    jornojorno Member Posts: 7 Contributor II
    Thanks a lot for your reply.

    I am sorry - i think i am really a newbie - because i didn't understand.

    1.  as you see in my configuration file - i used the FeatureExtraction and the xpath like u said ( I am using version 4.4 ) . I really a newbie - and i will be more than grateful if you could please help me to understand what operators/parameters i need to change in order for the Model to run.

    2.  the Java code is a different question ... how do i add features to the code ?

    Thanks a lot and sorry for the troubles ,
    Thanks again
    Jorno.
  • Options
    landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn
    Hi Jorno,
    don't worry about that. It's a very complex field, nobody understands at once.

    For your first point: You don't need the text input at all. There isn't any plain text to load! All you want is to import the information stored in your xml file. So remove the TextInput Operator.

    One hint: This is a complex setup for the beginning. Try to separate it in substeps: First only load the data, so that all your features are stored as attributes of the appropriate type. Then try to learn anything, and finally do a validation.  Do one step after the other...

    Greetings,
      Sebastian
  • Options
    jornojorno Member Posts: 7 Contributor II
    thanks a lot for all your help Sebastian !!!
    I spent over a week days/nights :( on this from your last answers and i think i learned a lot ...  ,
    and I think it is working now :) ..

    what i did ( xml attached ):
    1. i added the AttributeSubsetPreprocessing/Nominal2String for all the xpath attributes.
    2.  used the StringTextInput ( because I needed the stemmer etc.) with remove_original_attributes=yes.

    the problem is that i think that it takes the whole features Strings as one bulk/chunk of Strings and not as different strings for each feature
    ( e.g. different strings weights for strings in the "title" and different strings weights for strings in "description" ... )
    meaning : i think that the "title"/"keywords" features should influence more than "parseText"(all page text) feature... but i don't see it in the model ... :(

    Am i right ? How do i do it ?

    Thanks again !
    Jorno

    <operator name="Root" class="Process" expanded="yes">
        <description text="Octopus"/>
        <operator name="Extractor" class="FeatureExtraction">
            <list key="texts">
              <parameter key="news" value=".\train\news"/>
              <parameter key="porn" value=".\train\porn"/>
            </list>
            <parameter key="default_content_encoding" value="UTF-8"/>
            <parameter key="default_content_language" value="english"/>
            <list key="attributes">
              <parameter key="title" value="//*/title/text() "/>
              <parameter key="#redirectCount" value="//*/redirectCount/text()"/>
              <parameter key="description" value="//*/description/text()"/>
              <parameter key="keywords" value="//*/keywords/text()"/>
              <parameter key="parseText" value="//*/parseText/text()"/>
              <parameter key="metaAbstract" value="//*/metaAbstract/text()"/>
            </list>
            <list key="namespaces">
            </list>
        </operator>
        <operator name="AttributeSubsetPreprocessing" class="AttributeSubsetPreprocessing" expanded="yes">
            <parameter key="condition_class" value="attribute_name_filter"/>
            <parameter key="attribute_name_regex" value="title|description|keywords|parseText|metaAbstract"/>
            <operator name="Nominal2String" class="Nominal2String">
            </operator>
        </operator>
        <operator name="StringTextInput" class="StringTextInput" expanded="yes">
            <parameter key="remove_original_attributes" value="true"/>
            <parameter key="return_word_list" value="true"/>
            <parameter key="output_word_list" value="OctopusWordList.txt"/>
            <list key="namespaces">
            </list>
            <operator name="StringTokenizer" class="StringTokenizer">
            </operator>
            <operator name="TokenLengthFilter" class="TokenLengthFilter">
                <parameter key="min_chars" value="3"/>
            </operator>
            <operator name="ToLowerCaseConverter" class="ToLowerCaseConverter">
            </operator>
            <operator name="EnglishStopwordFilter" class="EnglishStopwordFilter">
            </operator>
            <operator name="PorterStemmer" class="PorterStemmer">
            </operator>
        </operator>
        <operator name="LibSVMLearner" class="LibSVMLearner">
            <parameter key="keep_example_set" value="true"/>
            <parameter key="kernel_type" value="linear"/>
            <list key="class_weights">
            </list>
            <parameter key="calculate_confidences" value="true"/>
        </operator>
        <operator name="ModelWriter" class="ModelWriter">
            <parameter key="model_file" value="OctopusModel.mod"/>
            <parameter key="output_type" value="Binary"/>
        </operator>
    </operator>
  • Options
    landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn
    Hi Jorno,
    nice to hear that you got it! And learning can never be bad :)

    Back to topic I have to say that your assumption is correct: The StringTextInput treats all String attribute as one. The only thing you could come arround this is to iterate over all the NominalValues which should be converted into strings, convert them one after the other. Inside this iteration, you would then have to use the StringTextInput and afterwards rename all new features, which have been generated by the TextInput. For example if you have a StringAttribute "title" then you could rename all word attributes into "title_word".

    You will probably have to make yourself familiar with the FeatureIteration operator, regular expressions in general and the ChangeAttributeNamesReplace.

    If you get this to work, you might call you an experienced rapidMiner user :)

    Greetings,
      Sebastian
  • Options
    jornojorno Member Posts: 7 Contributor II
    Hi Sabastian, thanks again !
    I guess that I am not an experienced rapidMiner user ( although i read so many documentation , forums etc.  ) ... :(

    i tried to build the project as u said but i see 3 weird issues

    1. after each iteration most of the features are changing their names and not all of them ?!?
      although i replaced all the features that don't contains the "feature_" string to "feature_<loop_feature>" using the "^[^feature_].*$" regex ...

    2. after the whole FeatureIterator iterations - i am not getting the manipulate exampleSet - but the original exampleSet with the nominal values ?!?!?
     ( and i tried to played with the work_on_input parameter with no success ...)

    3. I also wondered how it can create the "output_word_list" for all the attributes ...

    I am so desperate ...
    and to think that afterward i will also need to call the model from my java code :(:)

    thanks u so much for your help !!!
    jorno

    <operator name="Root" class="Process" expanded="yes">
        <operator name="Extractor" class="FeatureExtraction">
           <list key="texts">
             <parameter key="news" value=".\train\news"/>
             <parameter key="porn" value=".\train\porn"/>
           </list>
           <parameter key="default_content_encoding" value="UTF-8"/>
           <parameter key="default_content_language" value="english"/>
           <list key="attributes">
             <parameter key="feature_title" value="//*/title/text() "/>
             <parameter key="#feature_redirectCount" value="//*/redirectCount/text()"/>
             <parameter key="feature_description" value="//*/description/text()"/>
             <parameter key="feature_keywords" value="//*/keywords/text()"/>
             <parameter key="feature_parseText" value="//*/parseText/text()"/>
             <parameter key="feature_metaAbstract" value="//*/metaAbstract/text()"/>
           </list>
           <list key="namespaces">
           </list>
       </operator>
       <operator name="FeatureIterator" class="FeatureIterator" expanded="yes">
           <parameter key="type_filter" value="nominal"/>
           <operator name="Nominal2String on current attribute only" class="AttributeSubsetPreprocessing" expanded="yes">
               <parameter key="condition_class" value="attribute_name_filter"/>
               <parameter key="attribute_name_regex" value="%{loop_feature}"/>
               <parameter key="deliver_inner_results" value="true"/>
               <operator name="Nominal2String (2)" class="Nominal2String">
               </operator>
           </operator>
           <operator name="StringTextInput" class="StringTextInput" expanded="yes">
               <parameter key="remove_original_attributes" value="true"/>
               <parameter key="return_word_list" value="true"/>
               <parameter key="output_word_list" value="C:\Main\eclipse\workspace\octopus\RapidMiner\OctopusWordList.txt"/>
               <list key="namespaces">
               </list>
               <parameter key="create_text_visualizer" value="true"/>
               <operator name="StringTokenizer" class="StringTokenizer">
               </operator>
               <operator name="TokenLengthFilter" class="TokenLengthFilter">
                   <parameter key="min_chars" value="3"/>
               </operator>
               <operator name="ToLowerCaseConverter" class="ToLowerCaseConverter">
               </operator>
               <operator name="EnglishStopwordFilter" class="EnglishStopwordFilter">
               </operator>
               <operator name="PorterStemmer" class="PorterStemmer">
               </operator>
           </operator>
           <operator name="ChangeAttributeNamesReplace" class="ChangeAttributeNamesReplace">
               <parameter key="attributes" value="^[^feature_].*$"/>
               <parameter key="replace_what" value="^"/>
               <parameter key="replace_by" value="%{loop_feature}_"/>
               <parameter key="apply_on_special" value="false"/>
           </operator>
       </operator>
       <operator name="LibSVMLearner" class="LibSVMLearner">
           <parameter key="keep_example_set" value="true"/>
           <parameter key="kernel_type" value="linear"/>
           <list key="class_weights">
           </list>
           <parameter key="calculate_confidences" value="true"/>
       </operator>
       <operator name="ModelWriter" class="ModelWriter">
           <parameter key="model_file" value="OctopusModel.mod"/>
           <parameter key="output_type" value="Binary"/>
       </operator>
    </operator>

  • Options
    landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn
    Hi again,
    I didn't thought of this behavior. Hm. It's less elegant, but I will post a process below which shows a way around...
    <operator name="Root" class="Process" expanded="yes">
        <operator name="ExampleSetGenerator" class="ExampleSetGenerator">
            <parameter key="target_function" value="sum"/>
        </operator>
        <operator name="IOStorer" class="IOStorer">
            <parameter key="name" value="es"/>
            <parameter key="io_object" value="ExampleSet"/>
            <parameter key="remove_from_process" value="false"/>
        </operator>
        <operator name="FeatureIterator" class="FeatureIterator" expanded="yes">
            <parameter key="filter" value=".*"/>
            <operator name="IOConsumer" class="IOConsumer">
                <parameter key="io_object" value="ExampleSet"/>
            </operator>
            <operator name="IORetriever" class="IORetriever">
                <parameter key="name" value="es"/>
                <parameter key="io_object" value="ExampleSet"/>
            </operator>
            <operator name="AttributeSubsetPreprocessing" class="AttributeSubsetPreprocessing" breakpoints="after" expanded="yes">
                <parameter key="condition_class" value="attribute_name_filter"/>
                <parameter key="attribute_name_regex" value="%{loop_feature}"/>
                <operator name="BinDiscretization" class="BinDiscretization">
                </operator>
            </operator>
            <operator name="IOStorer (2)" class="IOStorer">
                <parameter key="name" value="es"/>
                <parameter key="io_object" value="ExampleSet"/>
            </operator>
        </operator>
        <operator name="IORetriever (2)" class="IORetriever">
            <parameter key="name" value="es"/>
            <parameter key="io_object" value="ExampleSet"/>
        </operator>
    </operator>
    By the way: If you rename the attributes, you have to rename it into something including the source attribute. Otherwise the attributes again might have the same name...

    The output word list is now only one part of the needed preprocessing: You might easily save it using the option in the StringTextInput operator, remember to include the source attribute name into the file, otherwise you will overwrite it...
    During applying you have to use the appropriate list for the given source attribute and afterwards do the renaming again...You might want to put the renaming into one special process and calling it during training and applying with the ProcessEmbedder, which makes the process to some sort of preprocessing model...

    Greetings,
      Sebastian
  • Options
    jornojorno Member Posts: 7 Contributor II
    Hi Sebastian,
    I am investing most of my time ( except from 6-7 sleeping hours :) ) in RapidMiner ,
    and I think I finally getting into it
    ( although i know nothing on AI/NLP algorithms and i am not a java expert ) .
    I read the website RSS regularly , so i might even contribute one day soon :) .

    I implemented your IORetriver idea , and it works fine.
    I had 2 issues , and i think i have the solutions :
    1. it creates 5 words files for each feature - so i think i will use the new "Script Operator" to append the feature name to each word and append the whole words to one file , so i could call it from my java code.
    2. For the classifying process , i need to build exampleSet ( not like the SingleTextInput classification example ) - so i thought of using the TextObject2ExampleSet.java file as example code .

    I didn't implement my java code yet - but i will do it soon.
    for the meanwhile i did some other stuff - and i think i miss-understand some basic RapidMiner concepts.

    1. the word_list concept
    --------------------
    a. i cannot understand why we need the word_list at all , why the model file isn't enough ?
    b.  as i understand from the word file , it has counters per number of documents ,   and it is weird for me ...
         ( i think it suppose to be weighted count of words per category .. - no ? - i probably miss something .. )
         i think it is the reason that when the model don't know how to categorize it gives me the category that contains the most documents ..
         ( maybe because it contains the most "http" words ? )
    c. how would i know what is the threshold - that the model is sure for its category ?
    d. maybe i should create a "general" category , that will indicate low confidence ?
    e. maybe i should take the 2-3 best categories ? how do i do it ?

    2. performance
    --------------
    I took the basic SingleTextInput classification example and put 45 classes/categories instead of 2 classes/categories .
    The model size was amazing !!! OctopusModel.mod 3.6M and OctopusWordList.txt 422K !!!
    but its apply()/classification java method is rather slow , can i do something about it ?
    i have nice amount of RAM ( it takes a lot of it .. ) , and configure java accordingly so it shouldn't be the problem ..
    it is not a big deal - but i just wonder if my parameters are ok ..

    These are the parameters :
    StringTextInput:prune_below = 10 ( i tried several parameters to reduce the size)
    992 examples
    4256 string attributes
    Total number of Support Vectors: 809
    Bias (offset): -0.321 
    number of classes: 45

    Thanks a lot for everything !!!!!
    Jorno
  • Options
    landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn
    Hi Jorno,
    I'm not quite sure, that I understand what you are going to do. But whatever it is, it seems, that you are willig to go there fast :)

    One small note: The easiest way for getting a great amount of text into an example set is constructing an exampleset with a string attribute and then use the StringTextInput operator. Thats probably easier than implementing a new operator...

    Unfortunately I cannot give you the theoretical background for understanding why all the data in the wordlist and the model must be saved, it would just exceed the scope of this forum. If you'd like, you could participate in one of our seminars or webinars for more detailed information beside: It is all needed for calculating different things...

    Ok, finally I will give you at least one small theoretical piece of information: A SVM only can distinguish between two classes, because it uses a separating hyperplane. If you have 45 classes like in your case, you will have to think about a possibility to transform the problem into one with only 2 classes. One major approach is to learn 45 models: One for every class against all other classes. During application you will have to apply these 45 models and assing the to the one class, having the highest confidence when predicted against all others. Everything clear? :)

    Greetings,
      Sebastian
  • Options
    jornojorno Member Posts: 7 Contributor II
    Thanks a lot Sebastian ,

    I didn't try anything unusaul - my tutor assignment is to classify websites to news,porn,entertainment,sports etc.
    so , i just took the http://nemoz.org/joomla/content/view/65/53/lang,de/ example and put 45 "classes"/"groups" , the SVM seems to classify rather OK for the 45 groups for the full page text. forgive me but i don't understand the theory behind it ( word_list etc. ) ..

    Then i tried to do it not for the full page text - but for each "feature" (title/keywords..)...-  ( similar to the XML i post in this thread ) . .  - and it seems very hard (1). loading the "features" to exampleSet in java (2). the words_list for each feature etc.

    believe me , one of my biggest dreams is to take one of these courses :
    http://rapid-i.com/content/view/73/148/
    http://rapid-i.com/content/view/87/149/
    http://rapid-i.com/content/view/125/150/
    but it costs too much for someone like me + flights ... , i searched for a webinar at your site but i found nothing , can u ask questions on the webinar :) ? how much it costs ? I will be happy to know ...

    i really understand your consulting model - and i appreciate it a lot !
    you gave the world the open-source - and i - and i believe that all the community thank u !
    maybe you could think on a biz model for "small" questions ( like http://www.liveperson.com )  ? just a small thought ...

    In any case , thank u so much for your great product and help so far  !!!
    Jorno
  • Options
    landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn
    Hi Jorno,
    I think if you only use one of this features, it will not contain enough information to be sure about the class. You might try it yourself: Only take a look of the content in this feature and guess the correct class. Probably you won't be too successfull...

    We are currently working with a provider for webinars to build up the infra structure. They will be announced soon.

    In fact we do have something for smaller problems: Telefon consulting, calculated per hour. And less than an hour isn't quite enough for such a complex field...

    Greetings,
      Sebastian
Sign In or Register to comment.