Options

How to analyze negetive attribute value

atul_kotwaleatul_kotwale Member Posts: 5 Contributor II
edited December 2018 in Help

Hi,

I am trying to build prediction model to predict the category of any case by looking the description of it. I have two training data set, the first data set contains case id and description and category.

ID   Description    Category

1    "some txt"         A

2    "some text 2"    B

 

and second data set contains following rows. which is basically tells me that which case should not fall for particular category.

 

ID   Description             Category

1    "some other txt"         notA

2    "some other text 2"    notB

 

I want to tain my model using both the dataset. I am having problem to feed the second data set to my model. I want to feed the second data set in such as way that it give correct information to my model. Any help would be great. Thanks!

 

Best Answers

  • Options
    atul_kotwaleatul_kotwale Member Posts: 5 Contributor II
    Solution Accepted

    Hi @kypexin

    Thanks for reply. I am also considering, not to include negetive result but I have one more thought, if I somehow I  convert the negetive dataset to below format by assigning 0 to the category which is not possible and giving 1 to all possible category.

     

    ID   Description                A   B   C

    1    "some other txt"         0    1   1

    2    "some other text 2"    1     0  1

     

    and similarly convert the positive dataset to below

    ID   Description      A  B  C

    1    "some txt"        1  0   0

    2    "some text 2"   0   1  0

     

    If I feed above data to my model, will that data would confuse my model ?

     

    Thanks

     

     

  • Options
    MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,517 RM Data Scientist
    Solution Accepted

    Hi @atul_kotwale,

    thats one way of doing it, yes.

     

    ~Martin

    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • Options
    atul_kotwaleatul_kotwale Member Posts: 5 Contributor II
    Solution Accepted

    Thanks @mschmitz

Answers

  • Options
    kypexinkypexin Moderator, RapidMiner Certified Analyst, Member Posts: 291 Unicorn

    Hi @atul_kotwale

     

    I am afraid that you have to think on reformulating the task. You cannot have such 'negative' labels like you described. 

     

    For example, if "some other text 2" = notB, then it is either A or notA, which means third category C.

    On the other hand, "some txt" = A is also obviously notB.

     

    So you may only have an example which belongs to some category, but you cannot label an example as not belonging to some category. 

  • Options
    kypexinkypexin Moderator, RapidMiner Certified Analyst, Member Posts: 291 Unicorn

    Hi @atul_kotwale

     

    Yes.

    Your first example is marked both B and C, which again is not possible in terms of ML data. 

    There should be only one "1" in each row, in case you want to predict categories A, B or C to any given description.

    But this is a bit different task from your initial thoughts: this way you just categorize each text separately, and not much more; for example, both "some other text 2" and "some txt" are from category A (as I understood, that's not what you want to achieve).

     

    More generally speaking, you can not feed to the model 2 different datasets with different meanings of categories.  

    The model still should work with a single dataset, in our case this one, where all examples are actually different: 

     

    ID   Description                A   B   C

    1    "some other txt"         0    1   1

    2    "some other text 2"    1     0  1

     

    3    "some txt"        1  0   0

    4    "some text 2"   0   1  0

  • Options
    atul_kotwaleatul_kotwale Member Posts: 5 Contributor II

    @kypexin Thanks. I got it now.

  • Options
    MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,517 RM Data Scientist

    Hi @atul_kotwale,

    one idea to use it, to build a "Not_A model". Then you score the other data set with it and use confidence(not_a) as a new variable for further modelling.

     

    BR,

    Martin

    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • Options
    atul_kotwaleatul_kotwale Member Posts: 5 Contributor II

    Hi @mschmitz

     

    Thanks for reply. If I am getting it correctly you mean, I should build model using negative dataset and then apply this model on positive dataset. The output will produce three new coloumn (confidence(not_a), confidence(not_b), confidence(not_c)) and I should include these new coloumn for further training ? 

Sign In or Register to comment.