RapidMiner

RM Staff
RM Staff

Re: Machine Learning, Prediction-SVM,Data-Mining

Hi,

 

How does your current process look like? Did you check our getting started video for X-Val? https://rapidminer.com/training/videos/ (last video)

 

Cheers,

Martin

--------------------------------------------------------------------------
Head of Data Science Services at RapidMiner
Contributor II arnab840
Contributor II

Re: Machine Learning, Prediction-SVM,Data-Mining

I uploaded the process

RM Certified Expert
RM Certified Expert

Re: Machine Learning, Prediction-SVM,Data-Mining

Post the XML of the process and data file. 

Contributor II arnab840
Contributor II

Re: Machine Learning, Prediction-SVM,Data-Mining

<?xml version="1.0" encoding="UTF-8"?><process version="7.6.000">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="7.6.000" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="retrieve" compatibility="7.6.000" expanded="true" height="68" name="Retrieve Opn_score" width="90" x="45" y="34">
<parameter key="repository_entry" value="../Score/Opn_score"/>
</operator>
<operator activated="true" class="set_role" compatibility="7.6.000" expanded="true" height="82" name="Set Role" width="90" x="45" y="136">
<parameter key="attribute_name" value="Openness"/>
<parameter key="target_role" value="label"/>
<list key="set_additional_roles"/>
</operator>
<operator activated="true" class="nominal_to_text" compatibility="7.6.000" expanded="true" height="82" name="Nominal to Text" width="90" x="179" y="187"/>
<operator activated="true" class="textSmiley Tonguerocess_document_from_data" compatibility="7.5.000" expanded="true" height="82" name="Process Documents from Data" width="90" x="179" y="34">
<list key="specify_weights"/>
<process expanded="true">
<operator activated="true" class="text:tokenize" compatibility="7.5.000" expanded="true" height="68" name="Tokenize" width="90" x="45" y="34"/>
<operator activated="true" class="text:filter_stopwords_english" compatibility="7.5.000" expanded="true" height="68" name="Filter Stopwords (English)" width="90" x="179" y="34"/>
<connect from_port="document" to_op="Tokenize" to_port="document"/>
<connect from_op="Tokenize" from_port="document" to_op="Filter Stopwords (English)" to_port="document"/>
<connect from_op="Filter Stopwords (English)" from_port="document" to_port="document 1"/>
<portSpacing port="source_document" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<portSpacing port="sink_document 2" spacing="0"/>
</process>
</operator>
<operator activated="true" class="split_data" compatibility="7.6.000" expanded="true" height="103" name="Split Data" width="90" x="313" y="187">
<enumeration key="partitions">
<parameter key="ratio" value="0.7"/>
<parameter key="ratio" value="0.3"/>
</enumeration>
</operator>
<operator activated="true" class="support_vector_machine_linear" compatibility="7.6.000" expanded="true" height="82" name="SVM (Linear)" width="90" x="313" y="34"/>
<operator activated="true" class="apply_model" compatibility="7.6.000" expanded="true" height="82" name="Apply Model" width="90" x="447" y="34">
<list key="application_parameters"/>
</operator>
<operator activated="true" class="retrieve" compatibility="7.6.000" expanded="true" height="68" name="Retrieve Heidi_Klum" width="90" x="45" y="340">
<parameter key="repository_entry" value="../Celebrity_processed_data/Refined/Heidi_Klum"/>
</operator>
<operator activated="true" class="nominal_to_text" compatibility="7.6.000" expanded="true" height="82" name="Nominal to Text (2)" width="90" x="45" y="238"/>
<operator activated="true" class="textSmiley Tonguerocess_document_from_data" compatibility="7.5.000" expanded="true" height="82" name="Process Documents from Data (2)" width="90" x="380" y="340">
<list key="specify_weights"/>
<process expanded="true">
<operator activated="true" class="text:tokenize" compatibility="7.5.000" expanded="true" height="68" name="Tokenize (2)" width="90" x="45" y="34"/>
<operator activated="true" class="text:filter_stopwords_english" compatibility="7.5.000" expanded="true" height="68" name="Filter Stopwords (2)" width="90" x="246" y="34"/>
<connect from_port="document" to_op="Tokenize (2)" to_port="document"/>
<connect from_op="Tokenize (2)" from_port="document" to_op="Filter Stopwords (2)" to_port="document"/>
<connect from_op="Filter Stopwords (2)" from_port="document" to_port="document 1"/>
<portSpacing port="source_document" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<portSpacing port="sink_document 2" spacing="0"/>
</process>
</operator>
<operator activated="true" class="apply_model" compatibility="7.6.000" expanded="true" height="82" name="Apply Model (2)" width="90" x="447" y="187">
<list key="application_parameters"/>
</operator>
<connect from_op="Retrieve Opn_score" from_port="output" to_op="Set Role" to_port="example set input"/>
<connect from_op="Set Role" from_port="example set output" to_op="Nominal to Text" to_port="example set input"/>
<connect from_op="Nominal to Text" from_port="example set output" to_op="Process Documents from Data" to_port="example set"/>
<connect from_op="Process Documents from Data" from_port="example set" to_op="Split Data" to_port="example set"/>
<connect from_op="Process Documents from Data" from_port="word list" to_op="Process Documents from Data (2)" to_port="word list"/>
<connect from_op="Split Data" from_port="partition 1" to_op="SVM (Linear)" to_port="training set"/>
<connect from_op="Split Data" from_port="partition 2" to_op="Apply Model" to_port="unlabelled data"/>
<connect from_op="SVM (Linear)" from_port="model" to_op="Apply Model" to_port="model"/>
<connect from_op="Apply Model" from_port="model" to_op="Apply Model (2)" to_port="model"/>
<connect from_op="Retrieve Heidi_Klum" from_port="output" to_op="Nominal to Text (2)" to_port="example set input"/>
<connect from_op="Nominal to Text (2)" from_port="example set output" to_op="Process Documents from Data (2)" to_port="example set"/>
<connect from_op="Process Documents from Data (2)" from_port="example set" to_op="Apply Model (2)" to_port="unlabelled data"/>
<connect from_op="Apply Model (2)" from_port="labelled data" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>

RM Certified Expert
RM Certified Expert

Re: Machine Learning, Prediction-SVM,Data-Mining

Here's how I would do this project. I used the M5Rules model from the Weka extension and got a RSME of 0.68 by doing some basic Text Processing (only transforming cases and tokenizing w/ pruning). This can be optimized to get a lower RSME, I'm sure of it. 

 

Now go and visit my site for more RapidMiner tutorials. I would be very happy if you did and signed up for my newsletter

 

<?xml version="1.0" encoding="UTF-8"?><process version="7.6.000">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="7.6.000" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="retrieve" compatibility="7.6.000" expanded="true" height="68" name="Retrieve Training_Agreeableness_Score" width="90" x="45" y="34">
        <parameter key="repository_entry" value="../data/Training_Agreeableness_Score"/>
      </operator>
      <operator activated="true" class="set_role" compatibility="7.6.000" expanded="true" height="82" name="Set Role" width="90" x="179" y="34">
        <parameter key="attribute_name" value="Agreeableness"/>
        <parameter key="target_role" value="label"/>
        <list key="set_additional_roles"/>
      </operator>
      <operator activated="true" class="nominal_to_text" compatibility="7.6.000" expanded="true" height="82" name="Nominal to Text" width="90" x="313" y="34">
        <parameter key="attribute_filter_type" value="single"/>
        <parameter key="attribute" value="Text"/>
      </operator>
      <operator activated="true" class="text:process_document_from_data" compatibility="7.5.000" expanded="true" height="82" name="Process Documents from Data" width="90" x="447" y="34">
        <parameter key="prune_method" value="percentual"/>
        <list key="specify_weights"/>
        <process expanded="true">
          <operator activated="true" class="text:transform_cases" compatibility="7.5.000" expanded="true" height="68" name="Transform Cases" width="90" x="45" y="34"/>
          <operator activated="true" class="text:tokenize" compatibility="7.5.000" expanded="true" height="68" name="Tokenize" width="90" x="179" y="34"/>
          <connect from_port="document" to_op="Transform Cases" to_port="document"/>
          <connect from_op="Transform Cases" from_port="document" to_op="Tokenize" to_port="document"/>
          <connect from_op="Tokenize" from_port="document" to_port="document 1"/>
          <portSpacing port="source_document" spacing="0"/>
          <portSpacing port="sink_document 1" spacing="0"/>
          <portSpacing port="sink_document 2" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="concurrency:cross_validation" compatibility="7.6.000" expanded="true" height="145" name="Validation" width="90" x="581" y="34">
        <parameter key="sampling_type" value="shuffled sampling"/>
        <process expanded="true">
          <operator activated="true" class="weka:W-M5Rules" compatibility="7.3.000" expanded="true" height="82" name="W-M5Rules" width="90" x="253" y="34"/>
          <connect from_port="training set" to_op="W-M5Rules" to_port="training set"/>
          <connect from_op="W-M5Rules" from_port="model" to_port="model"/>
          <portSpacing port="source_training set" spacing="0"/>
          <portSpacing port="sink_model" spacing="0"/>
          <portSpacing port="sink_through 1" spacing="0"/>
        </process>
        <process expanded="true">
          <operator activated="true" class="apply_model" compatibility="7.6.000" expanded="true" height="82" name="Apply Model" width="90" x="45" y="34">
            <list key="application_parameters"/>
          </operator>
          <operator activated="true" class="performance" compatibility="7.6.000" expanded="true" height="82" name="Performance" width="90" x="179" y="34"/>
          <connect from_port="model" to_op="Apply Model" to_port="model"/>
          <connect from_port="test set" to_op="Apply Model" to_port="unlabelled data"/>
          <connect from_op="Apply Model" from_port="labelled data" to_op="Performance" to_port="labelled data"/>
          <connect from_op="Performance" from_port="performance" to_port="performance 1"/>
          <connect from_op="Performance" from_port="example set" to_port="test set results"/>
          <portSpacing port="source_model" spacing="0"/>
          <portSpacing port="source_test set" spacing="0"/>
          <portSpacing port="source_through 1" spacing="0"/>
          <portSpacing port="sink_test set results" spacing="0"/>
          <portSpacing port="sink_performance 1" spacing="0"/>
          <portSpacing port="sink_performance 2" spacing="0"/>
        </process>
        <description align="center" color="transparent" colored="false" width="126">A cross-validation evaluating a decision tree model.</description>
      </operator>
      <operator activated="true" class="retrieve" compatibility="7.6.000" expanded="true" height="68" name="Retrieve Testing_J_Lopez" width="90" x="45" y="238">
        <parameter key="repository_entry" value="../data/Testing_J_Lopez"/>
      </operator>
      <operator activated="true" class="nominal_to_text" compatibility="7.6.000" expanded="true" height="82" name="Nominal to Text (2)" width="90" x="313" y="289">
        <parameter key="attribute_filter_type" value="single"/>
        <parameter key="attribute" value="Text"/>
      </operator>
      <operator activated="true" class="text:process_document_from_data" compatibility="7.5.000" expanded="true" height="82" name="Process Documents from Data (2)" width="90" x="447" y="289">
        <parameter key="prune_method" value="percentual"/>
        <list key="specify_weights"/>
        <process expanded="true">
          <operator activated="true" class="text:transform_cases" compatibility="7.5.000" expanded="true" height="68" name="Transform Cases (2)" width="90" x="45" y="34"/>
          <operator activated="true" class="text:tokenize" compatibility="7.5.000" expanded="true" height="68" name="Tokenize (2)" width="90" x="179" y="34"/>
          <connect from_port="document" to_op="Transform Cases (2)" to_port="document"/>
          <connect from_op="Transform Cases (2)" from_port="document" to_op="Tokenize (2)" to_port="document"/>
          <connect from_op="Tokenize (2)" from_port="document" to_port="document 1"/>
          <portSpacing port="source_document" spacing="0"/>
          <portSpacing port="sink_document 1" spacing="0"/>
          <portSpacing port="sink_document 2" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="apply_model" compatibility="7.6.000" expanded="true" height="82" name="Apply Model (2)" width="90" x="782" y="238">
        <list key="application_parameters"/>
      </operator>
      <connect from_op="Retrieve Training_Agreeableness_Score" from_port="output" to_op="Set Role" to_port="example set input"/>
      <connect from_op="Set Role" from_port="example set output" to_op="Nominal to Text" to_port="example set input"/>
      <connect from_op="Nominal to Text" from_port="example set output" to_op="Process Documents from Data" to_port="example set"/>
      <connect from_op="Process Documents from Data" from_port="example set" to_op="Validation" to_port="example set"/>
      <connect from_op="Process Documents from Data" from_port="word list" to_op="Process Documents from Data (2)" to_port="word list"/>
      <connect from_op="Validation" from_port="model" to_op="Apply Model (2)" to_port="model"/>
      <connect from_op="Validation" from_port="performance 1" to_port="result 1"/>
      <connect from_op="Retrieve Testing_J_Lopez" from_port="output" to_op="Nominal to Text (2)" to_port="example set input"/>
      <connect from_op="Nominal to Text (2)" from_port="example set output" to_op="Process Documents from Data (2)" to_port="example set"/>
      <connect from_op="Process Documents from Data (2)" from_port="example set" to_op="Apply Model (2)" to_port="unlabelled data"/>
      <connect from_op="Apply Model (2)" from_port="labelled data" to_port="result 2"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
      <portSpacing port="sink_result 3" spacing="0"/>
    </process>
  </operator>
</process>
Contributor II arnab840
Contributor II

Re: Machine Learning, Prediction-SVM,Data-Mining

Dear Thomas,

Thank you so much for your help but I guess I didn't able to explain the problem properly! So I run the process by SVM also and I get RMSE of 0.601 which is even better than M5 rules but this is not my problem.

My training data(text+score) contains the text of 250 anonymous Facebook/Twitter user where the score of Agreeableness(in scale of 1 to 5) is the score of what those user give about themselves in a quesnnarie based psychology test and my target is to predict different twitter user(test case) personality based on their tweet by using a training model what I build from that training dataset! So now,when I get the result and if you go to results panel and go to statistics u will see the average is 3.601,I just uploaded a screenshot!

 

RM Certified Expert
RM Certified Expert

Re: Machine Learning, Prediction-SVM,Data-Mining

Ah, I see that now. It appears that in your training set you have a few examples that are low scores compared more examples in higher score range. Seems that things are a bit skewed. My next thought is to do some z-transformations maybe. Hmm.

Contributor II arnab840
Contributor II

Re: Machine Learning, Prediction-SVM,Data-Mining

Just copied 1 more time to show! This is the problem for every twitter account its giving similar type of results even after z-transformation I did now!

And finally I fetched celebrity data(tweet) from twitter as a testing set on top of that model to predict their personality! How open or how shy they on the scale of (1,5) but I get the average test result of Gal Gadot 4.229 but then when I fetched some more celebrity its give me still similar type of result like Heidi Clum- 4.207,Hillary Clinton- 4.229, Donald Trump- 4.206, Leonardo DiCaprio- 4.209! 

RM Certified Expert
RM Certified Expert

Re: Machine Learning, Prediction-SVM,Data-Mining

I was looking more at this and I think the direction to go in is the text processing. You might want to review the type of tokenization you're doing and do something else than non-letters. When I add in stop words for English and filter out tokes <4 characters and >25 characters, I get like 4 words in the resulting set.

 

I think a lot of information is being lost during tokenization and I would review that.

Contributor II arnab840
Contributor II

Re: Machine Learning, Prediction-SVM,Data-Mining

I tried this already and yes you right but I guess when I am taking the level I am not defining any negative attribute so SVM just matching the text and giving me positive result that might be also the problem

Polls
How can RapidMiner increase participation in our new competitions?
Twitter Feed