RapidMiner

Machine Learning, Prediction-SVM,Data-Mining

RMStaff

Re: Machine Learning, Prediction-SVM,Data-Mining

Hi,

 

How does your current process look like? Did you check our getting started video for X-Val? https://rapidminer.com/training/videos/ (last video)

 

Cheers,

Martin

--------------------------------------------------------------------------
Head of Data Science Services at RapidMiner
Regular Contributor

Re: Machine Learning, Prediction-SVM,Data-Mining

I uploaded the process

Attachments

Moderator

Re: Machine Learning, Prediction-SVM,Data-Mining

Post the XML of the process and data file. 

Regular Contributor

Re: Machine Learning, Prediction-SVM,Data-Mining

<?xml version="1.0" encoding="UTF-8"?><process version="7.6.000">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="7.6.000" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="retrieve" compatibility="7.6.000" expanded="true" height="68" name="Retrieve Opn_score" width="90" x="45" y="34">
<parameter key="repository_entry" value="../Score/Opn_score"/>
</operator>
<operator activated="true" class="set_role" compatibility="7.6.000" expanded="true" height="82" name="Set Role" width="90" x="45" y="136">
<parameter key="attribute_name" value="Openness"/>
<parameter key="target_role" value="label"/>
<list key="set_additional_roles"/>
</operator>
<operator activated="true" class="nominal_to_text" compatibility="7.6.000" expanded="true" height="82" name="Nominal to Text" width="90" x="179" y="187"/>
<operator activated="true" class="textSmiley Tonguerocess_document_from_data" compatibility="7.5.000" expanded="true" height="82" name="Process Documents from Data" width="90" x="179" y="34">
<list key="specify_weights"/>
<process expanded="true">
<operator activated="true" class="text:tokenize" compatibility="7.5.000" expanded="true" height="68" name="Tokenize" width="90" x="45" y="34"/>
<operator activated="true" class="text:filter_stopwords_english" compatibility="7.5.000" expanded="true" height="68" name="Filter Stopwords (English)" width="90" x="179" y="34"/>
<connect from_port="document" to_op="Tokenize" to_port="document"/>
<connect from_op="Tokenize" from_port="document" to_op="Filter Stopwords (English)" to_port="document"/>
<connect from_op="Filter Stopwords (English)" from_port="document" to_port="document 1"/>
<portSpacing port="source_document" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<portSpacing port="sink_document 2" spacing="0"/>
</process>
</operator>
<operator activated="true" class="split_data" compatibility="7.6.000" expanded="true" height="103" name="Split Data" width="90" x="313" y="187">
<enumeration key="partitions">
<parameter key="ratio" value="0.7"/>
<parameter key="ratio" value="0.3"/>
</enumeration>
</operator>
<operator activated="true" class="support_vector_machine_linear" compatibility="7.6.000" expanded="true" height="82" name="SVM (Linear)" width="90" x="313" y="34"/>
<operator activated="true" class="apply_model" compatibility="7.6.000" expanded="true" height="82" name="Apply Model" width="90" x="447" y="34">
<list key="application_parameters"/>
</operator>
<operator activated="true" class="retrieve" compatibility="7.6.000" expanded="true" height="68" name="Retrieve Heidi_Klum" width="90" x="45" y="340">
<parameter key="repository_entry" value="../Celebrity_processed_data/Refined/Heidi_Klum"/>
</operator>
<operator activated="true" class="nominal_to_text" compatibility="7.6.000" expanded="true" height="82" name="Nominal to Text (2)" width="90" x="45" y="238"/>
<operator activated="true" class="textSmiley Tonguerocess_document_from_data" compatibility="7.5.000" expanded="true" height="82" name="Process Documents from Data (2)" width="90" x="380" y="340">
<list key="specify_weights"/>
<process expanded="true">
<operator activated="true" class="text:tokenize" compatibility="7.5.000" expanded="true" height="68" name="Tokenize (2)" width="90" x="45" y="34"/>
<operator activated="true" class="text:filter_stopwords_english" compatibility="7.5.000" expanded="true" height="68" name="Filter Stopwords (2)" width="90" x="246" y="34"/>
<connect from_port="document" to_op="Tokenize (2)" to_port="document"/>
<connect from_op="Tokenize (2)" from_port="document" to_op="Filter Stopwords (2)" to_port="document"/>
<connect from_op="Filter Stopwords (2)" from_port="document" to_port="document 1"/>
<portSpacing port="source_document" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<portSpacing port="sink_document 2" spacing="0"/>
</process>
</operator>
<operator activated="true" class="apply_model" compatibility="7.6.000" expanded="true" height="82" name="Apply Model (2)" width="90" x="447" y="187">
<list key="application_parameters"/>
</operator>
<connect from_op="Retrieve Opn_score" from_port="output" to_op="Set Role" to_port="example set input"/>
<connect from_op="Set Role" from_port="example set output" to_op="Nominal to Text" to_port="example set input"/>
<connect from_op="Nominal to Text" from_port="example set output" to_op="Process Documents from Data" to_port="example set"/>
<connect from_op="Process Documents from Data" from_port="example set" to_op="Split Data" to_port="example set"/>
<connect from_op="Process Documents from Data" from_port="word list" to_op="Process Documents from Data (2)" to_port="word list"/>
<connect from_op="Split Data" from_port="partition 1" to_op="SVM (Linear)" to_port="training set"/>
<connect from_op="Split Data" from_port="partition 2" to_op="Apply Model" to_port="unlabelled data"/>
<connect from_op="SVM (Linear)" from_port="model" to_op="Apply Model" to_port="model"/>
<connect from_op="Apply Model" from_port="model" to_op="Apply Model (2)" to_port="model"/>
<connect from_op="Retrieve Heidi_Klum" from_port="output" to_op="Nominal to Text (2)" to_port="example set input"/>
<connect from_op="Nominal to Text (2)" from_port="example set output" to_op="Process Documents from Data (2)" to_port="example set"/>
<connect from_op="Process Documents from Data (2)" from_port="example set" to_op="Apply Model (2)" to_port="unlabelled data"/>
<connect from_op="Apply Model (2)" from_port="labelled data" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>

Attachments

Moderator

Re: Machine Learning, Prediction-SVM,Data-Mining

Here's how I would do this project. I used the M5Rules model from the Weka extension and got a RSME of 0.68 by doing some basic Text Processing (only transforming cases and tokenizing w/ pruning). This can be optimized to get a lower RSME, I'm sure of it. 

 

Now go and visit my site for more RapidMiner tutorials. I would be very happy if you did and signed up for my newsletter

 

<?xml version="1.0" encoding="UTF-8"?><process version="7.6.000">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="7.6.000" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="retrieve" compatibility="7.6.000" expanded="true" height="68" name="Retrieve Training_Agreeableness_Score" width="90" x="45" y="34">
        <parameter key="repository_entry" value="../data/Training_Agreeableness_Score"/>
      </operator>
      <operator activated="true" class="set_role" compatibility="7.6.000" expanded="true" height="82" name="Set Role" width="90" x="179" y="34">
        <parameter key="attribute_name" value="Agreeableness"/>
        <parameter key="target_role" value="label"/>
        <list key="set_additional_roles"/>
      </operator>
      <operator activated="true" class="nominal_to_text" compatibility="7.6.000" expanded="true" height="82" name="Nominal to Text" width="90" x="313" y="34">
        <parameter key="attribute_filter_type" value="single"/>
        <parameter key="attribute" value="Text"/>
      </operator>
      <operator activated="true" class="text:process_document_from_data" compatibility="7.5.000" expanded="true" height="82" name="Process Documents from Data" width="90" x="447" y="34">
        <parameter key="prune_method" value="percentual"/>
        <list key="specify_weights"/>
        <process expanded="true">
          <operator activated="true" class="text:transform_cases" compatibility="7.5.000" expanded="true" height="68" name="Transform Cases" width="90" x="45" y="34"/>
          <operator activated="true" class="text:tokenize" compatibility="7.5.000" expanded="true" height="68" name="Tokenize" width="90" x="179" y="34"/>
          <connect from_port="document" to_op="Transform Cases" to_port="document"/>
          <connect from_op="Transform Cases" from_port="document" to_op="Tokenize" to_port="document"/>
          <connect from_op="Tokenize" from_port="document" to_port="document 1"/>
          <portSpacing port="source_document" spacing="0"/>
          <portSpacing port="sink_document 1" spacing="0"/>
          <portSpacing port="sink_document 2" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="concurrency:cross_validation" compatibility="7.6.000" expanded="true" height="145" name="Validation" width="90" x="581" y="34">
        <parameter key="sampling_type" value="shuffled sampling"/>
        <process expanded="true">
          <operator activated="true" class="weka:W-M5Rules" compatibility="7.3.000" expanded="true" height="82" name="W-M5Rules" width="90" x="253" y="34"/>
          <connect from_port="training set" to_op="W-M5Rules" to_port="training set"/>
          <connect from_op="W-M5Rules" from_port="model" to_port="model"/>
          <portSpacing port="source_training set" spacing="0"/>
          <portSpacing port="sink_model" spacing="0"/>
          <portSpacing port="sink_through 1" spacing="0"/>
        </process>
        <process expanded="true">
          <operator activated="true" class="apply_model" compatibility="7.6.000" expanded="true" height="82" name="Apply Model" width="90" x="45" y="34">
            <list key="application_parameters"/>
          </operator>
          <operator activated="true" class="performance" compatibility="7.6.000" expanded="true" height="82" name="Performance" width="90" x="179" y="34"/>
          <connect from_port="model" to_op="Apply Model" to_port="model"/>
          <connect from_port="test set" to_op="Apply Model" to_port="unlabelled data"/>
          <connect from_op="Apply Model" from_port="labelled data" to_op="Performance" to_port="labelled data"/>
          <connect from_op="Performance" from_port="performance" to_port="performance 1"/>
          <connect from_op="Performance" from_port="example set" to_port="test set results"/>
          <portSpacing port="source_model" spacing="0"/>
          <portSpacing port="source_test set" spacing="0"/>
          <portSpacing port="source_through 1" spacing="0"/>
          <portSpacing port="sink_test set results" spacing="0"/>
          <portSpacing port="sink_performance 1" spacing="0"/>
          <portSpacing port="sink_performance 2" spacing="0"/>
        </process>
        <description align="center" color="transparent" colored="false" width="126">A cross-validation evaluating a decision tree model.</description>
      </operator>
      <operator activated="true" class="retrieve" compatibility="7.6.000" expanded="true" height="68" name="Retrieve Testing_J_Lopez" width="90" x="45" y="238">
        <parameter key="repository_entry" value="../data/Testing_J_Lopez"/>
      </operator>
      <operator activated="true" class="nominal_to_text" compatibility="7.6.000" expanded="true" height="82" name="Nominal to Text (2)" width="90" x="313" y="289">
        <parameter key="attribute_filter_type" value="single"/>
        <parameter key="attribute" value="Text"/>
      </operator>
      <operator activated="true" class="text:process_document_from_data" compatibility="7.5.000" expanded="true" height="82" name="Process Documents from Data (2)" width="90" x="447" y="289">
        <parameter key="prune_method" value="percentual"/>
        <list key="specify_weights"/>
        <process expanded="true">
          <operator activated="true" class="text:transform_cases" compatibility="7.5.000" expanded="true" height="68" name="Transform Cases (2)" width="90" x="45" y="34"/>
          <operator activated="true" class="text:tokenize" compatibility="7.5.000" expanded="true" height="68" name="Tokenize (2)" width="90" x="179" y="34"/>
          <connect from_port="document" to_op="Transform Cases (2)" to_port="document"/>
          <connect from_op="Transform Cases (2)" from_port="document" to_op="Tokenize (2)" to_port="document"/>
          <connect from_op="Tokenize (2)" from_port="document" to_port="document 1"/>
          <portSpacing port="source_document" spacing="0"/>
          <portSpacing port="sink_document 1" spacing="0"/>
          <portSpacing port="sink_document 2" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="apply_model" compatibility="7.6.000" expanded="true" height="82" name="Apply Model (2)" width="90" x="782" y="238">
        <list key="application_parameters"/>
      </operator>
      <connect from_op="Retrieve Training_Agreeableness_Score" from_port="output" to_op="Set Role" to_port="example set input"/>
      <connect from_op="Set Role" from_port="example set output" to_op="Nominal to Text" to_port="example set input"/>
      <connect from_op="Nominal to Text" from_port="example set output" to_op="Process Documents from Data" to_port="example set"/>
      <connect from_op="Process Documents from Data" from_port="example set" to_op="Validation" to_port="example set"/>
      <connect from_op="Process Documents from Data" from_port="word list" to_op="Process Documents from Data (2)" to_port="word list"/>
      <connect from_op="Validation" from_port="model" to_op="Apply Model (2)" to_port="model"/>
      <connect from_op="Validation" from_port="performance 1" to_port="result 1"/>
      <connect from_op="Retrieve Testing_J_Lopez" from_port="output" to_op="Nominal to Text (2)" to_port="example set input"/>
      <connect from_op="Nominal to Text (2)" from_port="example set output" to_op="Process Documents from Data (2)" to_port="example set"/>
      <connect from_op="Process Documents from Data (2)" from_port="example set" to_op="Apply Model (2)" to_port="unlabelled data"/>
      <connect from_op="Apply Model (2)" from_port="labelled data" to_port="result 2"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
      <portSpacing port="sink_result 3" spacing="0"/>
    </process>
  </operator>
</process>
Regular Contributor

Re: Machine Learning, Prediction-SVM,Data-Mining

Dear Thomas,

Thank you so much for your help but I guess I didn't able to explain the problem properly! So I run the process by SVM also and I get RMSE of 0.601 which is even better than M5 rules but this is not my problem.

My training data(text+score) contains the text of 250 anonymous Facebook/Twitter user where the score of Agreeableness(in scale of 1 to 5) is the score of what those user give about themselves in a quesnnarie based psychology test and my target is to predict different twitter user(test case) personality based on their tweet by using a training model what I build from that training dataset! So now,when I get the result and if you go to results panel and go to statistics u will see the average is 3.601,I just uploaded a screenshot!

 

Attachments

Moderator

Re: Machine Learning, Prediction-SVM,Data-Mining

[ Edited ]

Ah, I see that now. It appears that in your training set you have a few examples that are low scores compared more examples in higher score range. Seems that things are a bit skewed. My next thought is to do some z-transformations maybe. Hmm.

Regular Contributor

Re: Machine Learning, Prediction-SVM,Data-Mining

Just copied 1 more time to show! This is the problem for every twitter account its giving similar type of results even after z-transformation I did now!

And finally I fetched celebrity data(tweet) from twitter as a testing set on top of that model to predict their personality! How open or how shy they on the scale of (1,5) but I get the average test result of Gal Gadot 4.229 but then when I fetched some more celebrity its give me still similar type of result like Heidi Clum- 4.207,Hillary Clinton- 4.229, Donald Trump- 4.206, Leonardo DiCaprio- 4.209! 

Moderator

Re: Machine Learning, Prediction-SVM,Data-Mining

I was looking more at this and I think the direction to go in is the text processing. You might want to review the type of tokenization you're doing and do something else than non-letters. When I add in stop words for English and filter out tokes <4 characters and >25 characters, I get like 4 words in the resulting set.

 

I think a lot of information is being lost during tokenization and I would review that.

Regular Contributor

Re: Machine Learning, Prediction-SVM,Data-Mining

I tried this already and yes you right but I guess when I am taking the level I am not defining any negative attribute so SVM just matching the text and giving me positive result that might be also the problem