multiple data set relating and clustering please help me

shahabshahab Member Posts: 8 Contributor II
edited December 2018 in Help

Hi every body

I want to read 3 dataset or csv files that one of thems is users data with user ID and ,,,,, and second is movie data such as movie ID and ,,,,,,,,, and the last is rating data with user ID and movie ID and ,,,,,,

finally after reading this 3 data i want to use kmeans clustering and cluster users bas on ratings for movie.can you help me?

Best Answer

  • lionelderkrikorlionelderkrikor Moderator, RapidMiner Certified Analyst, Member Posts: 1,195 Unicorn
    Solution Accepted

    Hi @shahab,

     

    Can you share your dataset(s) (and eventually your process) in order we better understand, and help you ?

     

    Regards,

     

    Lionel

Answers

  • shahabshahab Member Posts: 8 Contributor II

    sure.

    because one of files is large i upload it on an upload center.

    http://s8.picofile.com/file/8332549950/ratings.csv.html

    I have 3 dataset :1- Users with some users data and an unique identifier(User ID).

    2- Movies with some attributes such as genres-name and an unique identifier(Movie ID).

    3-rating dataset with user ID and Movie ID and users rating to movies

    I want to cluster users base on age-gender and movie ratings with kmeans clustering.

    How i can do this.thanks

     

     

  • lionelderkrikorlionelderkrikor Moderator, RapidMiner Certified Analyst, Member Posts: 1,195 Unicorn

    Hi @shahab,

     

    Here a possible solution to cluster your data : 

    <?xml version="1.0" encoding="UTF-8"?><process version="9.0.001">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="9.0.001" expanded="true" name="Process">
    <process expanded="true">
    <operator activated="true" class="read_csv" compatibility="9.0.001" expanded="true" height="68" name="Read CSV" width="90" x="112" y="85">
    <parameter key="csv_file" value="C:\Users\Lionel\Documents\Formations_DataScience\Rapidminer\Tests_Rapidminer\Movies_ratings\users.csv"/>
    <parameter key="skip_comments" value="true"/>
    <parameter key="date_format" value="MMM d, yyyy h:mm:ss a z"/>
    <list key="annotations"/>
    <parameter key="encoding" value="windows-1252"/>
    <list key="data_set_meta_data_information">
    <parameter key="0" value="UserID.true.integer.attribute"/>
    <parameter key="1" value="Gender.true.polynominal.attribute"/>
    <parameter key="2" value="Age.true.integer.attribute"/>
    <parameter key="3" value="Occupation.true.integer.attribute"/>
    <parameter key="4" value="Zip-code.true.polynominal.attribute"/>
    </list>
    <parameter key="read_not_matching_values_as_missings" value="false"/>
    </operator>
    <operator activated="true" class="read_csv" compatibility="9.0.001" expanded="true" height="68" name="Read CSV (2)" width="90" x="112" y="187">
    <parameter key="csv_file" value="C:\Users\Lionel\Documents\Formations_DataScience\Rapidminer\Tests_Rapidminer\Movies_ratings\ratings.csv"/>
    <parameter key="skip_comments" value="true"/>
    <parameter key="date_format" value="MMM d, yyyy h:mm:ss a z"/>
    <list key="annotations"/>
    <parameter key="encoding" value="windows-1252"/>
    <list key="data_set_meta_data_information">
    <parameter key="0" value="UserID.true.integer.attribute"/>
    <parameter key="1" value="MovieID.true.integer.attribute"/>
    <parameter key="2" value="Rating.true.integer.attribute"/>
    <parameter key="3" value="Timestamp.true.integer.attribute"/>
    <parameter key="4" value="att5.true.polynominal.attribute"/>
    </list>
    <parameter key="read_not_matching_values_as_missings" value="false"/>
    </operator>
    <operator activated="true" class="concurrency:join" compatibility="9.0.001" expanded="true" height="82" name="Join" width="90" x="380" y="85">
    <parameter key="use_id_attribute_as_key" value="false"/>
    <list key="key_attributes">
    <parameter key="UserID" value="UserID"/>
    </list>
    </operator>
    <operator activated="true" class="select_attributes" compatibility="9.0.001" expanded="true" height="82" name="Select Attributes" width="90" x="514" y="85">
    <parameter key="attribute_filter_type" value="subset"/>
    <parameter key="attributes" value="Age|Gender|Rating"/>
    </operator>
    <operator activated="true" class="nominal_to_numerical" compatibility="9.0.001" expanded="true" height="103" name="Nominal to Numerical" width="90" x="648" y="85">
    <parameter key="attribute_filter_type" value="single"/>
    <parameter key="attribute" value="Gender"/>
    <list key="comparison_groups"/>
    </operator>
    <operator activated="true" class="concurrency:k_means" compatibility="9.0.001" expanded="true" height="82" name="Clustering" width="90" x="849" y="85"/>
    <operator activated="true" class="model_simulator:cluster_model_visualizer" compatibility="9.0.001" expanded="true" height="82" name="Cluster Model Visualizer" width="90" x="983" y="187"/>
    <connect from_op="Read CSV" from_port="output" to_op="Join" to_port="left"/>
    <connect from_op="Read CSV (2)" from_port="output" to_op="Join" to_port="right"/>
    <connect from_op="Join" from_port="join" to_op="Select Attributes" to_port="example set input"/>
    <connect from_op="Select Attributes" from_port="example set output" to_op="Nominal to Numerical" to_port="example set input"/>
    <connect from_op="Nominal to Numerical" from_port="example set output" to_op="Clustering" to_port="example set"/>
    <connect from_op="Clustering" from_port="cluster model" to_op="Cluster Model Visualizer" to_port="model"/>
    <connect from_op="Clustering" from_port="clustered set" to_op="Cluster Model Visualizer" to_port="clustered data"/>
    <connect from_op="Cluster Model Visualizer" from_port="visualizer output" to_port="result 1"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    </process>
    </operator>
    </process>

    I hope it helps,

     

    Regards,

     

    Lionel

  • shahabshahab Member Posts: 8 Contributor II

    Hi Mr Lionel.

    I used your solution but i couldnt to solve my problem by using this solution.

    I describe my data and application again.

    I have 3 dataset:

    Users dataset that there are some users attributes such as UserID;Gender;Age;Occupation;Zip-code

    UserID is a unique ID and Gender and Age is used for my sample and model.

    Movies Dataset   that it consist of MovieID;Title;Genres;

    in this dataset Movie ID is unique and others are movie attrbibutes.

    Ratings Dataset  this dataset have UserID-Movie ID as identifiers and ratings as user rating to movies with each genre.

    i want to cluster my users base on age -gender and favorite genre .

    here user geneder has 2 value (man-woman) and age can be 4 age ranges.

    please help me.

    thank you so much 

     

  • lionelderkrikorlionelderkrikor Moderator, RapidMiner Certified Analyst, Member Posts: 1,195 Unicorn

    Hi @shahab,

     

    "I used your solution but i couldnt to solve my problem by using this solution"

     

    Can you me more explicit ?

    Personally, I don't know what to add to the process I shared....

     

    Regards,

     

    Lionel

  • shahabshahab Member Posts: 8 Contributor II

    Hi Mr Lionel and thanks a lot

    I added datasets that i uploaded previous.

    there are on top messages.

    Do you need that i upload them again?

  • lionelderkrikorlionelderkrikor Moderator, RapidMiner Certified Analyst, Member Posts: 1,195 Unicorn

    Hi @shahab,

     

    No need to upload again your datasets : I worked with these datasets to build the process I shared.

     You said  : "I used your solution but i couldnt to solve my problem by using this solution" ==> But I don't understand why the process I shared don't answer to your problem. So could you more explicit about this "problem". As said I don't know what to add (or to remove) to the process I shared.

     

    Regards,

     

    Lionel

     

  • shahabshahab Member Posts: 8 Contributor II

    HI Lionel and thanks a lot.

    would you describe your process details

    in this process we have 2 read csv operator while we have 3 dataset totally.

    which of datasets must import in 2 operators?

    regards.

    Shahab

     

     

     

  • lionelderkrikorlionelderkrikor Moderator, RapidMiner Certified Analyst, Member Posts: 1,195 Unicorn

    Hi @shahab,

     

    You want to cluster the data based on "age-gender and movie ratings" according to your first post.

     

    In users.csv dataset, I have the following attributes :  UserID, gender and age.

    In ratings.csv dataset, I have the following attributes : UserID, Ratings

     

    Then, I apply the Join operator between these two datasets with the UserID as key-attribute.

    The resulting dataset contains the following attributes : UserId, gender, age, Ratings.

    Then, I select only these three attributes : gender, age, Ratings and apply a clustering model ...

    ... to apriori obtain what you want to do...

    So in conclusion no need of your third dataset (Movies dataset)

     

    NB : If you want to cluster data based on "age -gender and favorite genre" (in a other of your post), you have, in deed, to join

    the Movies dataset to other datasets, to have in fine in a unique dataset the following attributes : UserID, age - gender and genre.

    After you can maybe use the Aggregate operator to obtain the "favorite genre" according to  UserID (and thus age-gender).

     

    I hope it helps,

     

    Regards,

     

    Lionel

     

     

  • shahabshahab Member Posts: 8 Contributor II
    Hi Mr Lionel and thanks for your descriptions.but we have some problems in this section:
    " NB : If you want to cluster data based on "age -gender and favorite genre" (in a other of your post), you have, in deed, to join

    the Movies dataset to other datasets, to have in fine in a unique dataset the following attributes : UserID, age - gender and genre.

    After you can maybe use the Aggregate operator to obtain the "favorite genre" according to  UserID (and thus age-gender)."

    How we can use all of datasets with others and select and produce outputs based on our inputs with this format:

    clusters based on users who have ID and gender and age ranges (for example we want to categorize users based on their age ranges for example 5-10 as children (male and female )and 10-18 teenagers and ....... ) and their favorite genres .

    please help us.thanks a lot

    sincerely Shahab

  • Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 Unicorn
    I am confused by your description of your desired task.  Do you actually want to cluster based on age and gender along with movie rating, or do you actually want to cluster only based on movie rating, based on a predefined set of gender and age splits?  Because these are two different tasks.
    If you want to do the latter, then you will need to create your age and gender bins and then run a separate clustering analysis for each of them (which you can do using loops).  This will give you clusters based on movie ratings within each group defined by age/gender.
    If you want age, gender, and movie rating all to be used in clustering, make sure you have normalized your data first.  But this is not going to give you discrete user categorizations (eg., all females between 20-29) for defining your clusters as you have described above.
    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
  • shahabshahab Member Posts: 8 Contributor II
    hi and thanks.please help me to do both of these operations.

Sign In or Register to comment.