User KNN - How to get list of user ID which recommendation is generated from?

RionArisuRionArisu Member Posts: 13 Contributor I
edited January 2021 in Help
Taking the user - book rating dataset as example:
User | Book | Rating
User 1 | Book 1 | 5
User 2 | Book 1 | 4
User 2 | Book 3 | 3
...

We can use User k-NN operator under Recommender extension to find out which book we should recommend to users based on the similarity of book preference compared to other users, the output looks like:
User | Recommended book
User 1 | Book 3
User 1 | Book 5
User 2 | Book 6
...

However, is there any way to find out who are these 'similar' users where the recommendation is coming from?
Current output: We recommend Book 3 to User 1
Expected output: We recommend Book 3 to User 1 because of 85% similarity to User X

I have tried using Cross distance operator to calculate distance between different users and find out shortest distance users. However, cross distance perceived both of these scenario as similar:
 1. Two users who have read the same book
 2. Two users who have not read the same book
While user knn's similarity is solely based on #1 two users who have read the same book.
Hence, it turns out that the book recommendation are always not draw from the users who have shortest euclidean distance.

Answers

  • jacobcybulskijacobcybulski Member, University Professor Posts: 391 Unicorn
    When you apply the k-NN model you will get not only the classification (prediction) but also for each class its confidence. You can use these attributes to figure out the many-to-many relationships between users and books, e.g. using the aggregate operator. 
  • RionArisuRionArisu Member Posts: 13 Contributor I
    edited January 2021
    The standard k-NN model will provide the confidence for each class, however, the operator which I'm using for this use case is "User k-NN" under Recommender extension, if I'm not mistaken, the default output only provides the list of recommendation, but does not provide the confidence for each class.

    P/S: Sorry I have not mentioned clearly in the description earlier that it's "User k-NN", have re-edited the description now.
  • jacobcybulskijacobcybulski Member, University Professor Posts: 391 Unicorn
    Ok, I know this is a not very elegant hack but after you've got your recommendations, you could use them as labels to train a standard kNN model which could give you the confidence factors for each label rather than calculating a separate distance matrix. 
  • RionArisuRionArisu Member Posts: 13 Contributor I
    Thanks for your suggestion.
    I have tried to perform steps below:
    1. Use "User k-NN" (k=3, not weighted) to generate recommendation
    2. Use "k-NN" (k=3, not weighted, cosine similarity) on the same dataset, using user column as label

    However, after i take the 3 users who have the highest confidence factors from step 2, it seems like i'm not able to match with the results generated at step 1.

    For example, step 1 recommended book 5 to user 1.
    However, in step 2, all 3 most similar users have never rated book 5.

    Do you have any idea what could possibly caused the difference?
  • jacobcybulskijacobcybulski Member, University Professor Posts: 391 Unicorn
    edited January 2021
    I am sure you have great reasons to use the Recommender extension, but if it does not give you what is needed perhaps you can simply use k-NN if this is the recommender model you are using anyway, it would return all info you require in one hit.

    However, if you decided to use a  two step solution with two k-NNs, one for recommendations and another for identifying the likely recommender, the discrepancy you get may be due to several issues.

    (1) it seems that the initial k-NN uses Euclidean metric and the later k-NN relies on the Cosine metric, the two metrics are very different (if you were to measure similarity between two stars, the first measures the physical distance between each star and the observer, and the latter the angular separation between the stars in the eye of the observer). So if you use different similarity measures you will end up with different nearest neighbours for both - I think this explains your results.

    (2) Assume that both k-NNs use the same similarity measurements. If there is a confusion between several possible answers, you may decide to use a 1-NN in the latter case to get the best match for your recommender.

    (3) It is also possible that several neighbours have the same distance to your solution and a random equidistant neighbour may be returned, not the same for both models.

    (4) It is also possible that you are getting different answers in both because the models may be undertaking different pre-processing steps, e.g. elimination of missing values, conversion of nominals to numerical values or normalisation/standardisation of values. So if there are any pre-processing options available in both, switch them off and do the pre-processing manually.

    Note that a typical process for measuring distances using k-NN in systems, which are unlike RapidMiner to return the likelihoods of neighbour recommendations, is to create k-NN model first and then to apply it to the training set to return likelihood measurements (which often come separately).
  • jacobcybulskijacobcybulski Member, University Professor Posts: 391 Unicorn
    I suspect that k-NN in the recommender is using mixed measures as your data is a mix of both numerical and nominal attributes.
  • RionArisuRionArisu Member Posts: 13 Contributor I
    Thank you very much for your insights.
    From the description of "User k-NN", it seems to be using cosine similarity as well. I have used the same pre-processing steps for both, except there is one more extra step (pivot data) when feeding the data to "k-NN". But nonetheless, I understand your point, (3) may still happen even if other steps are the same.

    If I would like to build a user based collaborative filtering recommender system from scratch to align both results (recommendations and nearest neighbours), would this be the steps to follow? Or is there any easier way of doing this?
    Step 1. Use k-nn or cosine similarity operator to find out the top nearest neighbours for each user
    Step 2. Loop all users, for all the items not rated by each user, calculate average score which are rated by his neighbours
    Step 3. Loop all users, select the top scoring items from the result of Step 2

    P/S: 
    The key reason I'm using "User k-NN" extension is due to it's able to provide recommendation result directly (Step 3), while using "k-NN" will only provide result on who are the most similar users (Step 1). However in my case, both results are required.
Sign In or Register to comment.