Due to recent updates, all users are required to create an Altair One account to login to the RapidMiner community. Click the Register button to create your account using the same email that you have previously used to login to the RapidMiner community. This will ensure that any previously created content will be synced to your Altair One account. Once you login, you will be asked to provide a username that identifies you to other Community users. Email us at Community with questions.
[SOLVED] Really basic question, I think I'm applying models wrong.
My first read database gets all of the values from the documents (20k)
My second read database(1k documents) has a value isGood = 1 if the value is good, -2 if the value is bad and a bunch of other really bad ideas.. I set isGood to label. Should I actually only be passing true/false or is an integer okay?
I use nominal to text to get the "data" field as text.
I then process the document, looking for word frequencies etc.
Is my Naive bayes even in the right place?
My end goal is that I feed it 1000 known good documents and it can find very similar documents from the first read database... I want my confidence score to be based on document similarity.
I am getting an output that contains confidence but I'm not sure how to present my output, I don't come from a statistical background so I'm learning on my feet. I appreciate I have a lot to learn so in 3 weeks time I'm going to read some books/content about how to use rapidminer and ML in general. I can only apologize for my ignorance!
TLDR;
Can I use an integer as a label?
Am I using naive bayes and apply model correctly?
How can I view my data in an easy to interpret way. Ideally something like a list of document IDs with their confidence rating.
Thanks guys!
0
Answers
As far as your variables go, I don't think there is a technical reason why you can't use integers, however the spread of your variables is odd. I would use 1 and 0 (1 is good, 0 is not good) if I were using integers. Someone else will need to say whether there needs to be a numeric to nominal process in there on your label. That is how my job is set up.
Regarding output, what you need to do is save the output of the apply model, either to a csv file or to the repository. Then you can extract the fields you need from it (ID and prediction(yes).
BTW, I'm one step less of a newbie than you are, so I hope others will jump in and correct both of us. However I am sure about your read's being backwards so you should start with fixing that.
The learning scheme naive bayes does not have sufficient capabilities for handling an example set with only one label
But "include special attributes" is ticked, so is keep text and add meta information, any idea what I could be doing wrong?
some additions from my side:
- are you sure that your training data contains more than one value for isGood? If it contains only examples of one class, that could cause the error message.
- For Text Processing it is very important to use the same word list for training and application. Thus you have to connect the "wor" output of the Process Documents operator in the training branch to the "wor" input in the application branch. That way it is guaranteed that training and application example sets contain the same word vectors.
- do your integer values in isGood imply an order, or are they actually categories? In the latter case you should convert the label to a nominal value, so Naive Bayes will perform a classification. If it is left to Integer, it will perform a regression.
Best,
Marius
I extended my DB structure to support a label field and set any that are known positive matches as true and any known negatives as false.
I use these MySQL select queries:
SELECT label, data, isGood, school_list_holiday_sources.id FROM school_list_holiday_data INNER JOIN school_list_holiday_sources ON school_list_holiday_data.id=school_list_holiday_sources.id WHERE school_list_holiday_sources.label = "true" OR isGood = -2 AND school_list_holiday_sources.label = "false" LIMIT 0,50
This select gets the items with a label true and false. Naive Bayes learnes from these.
SELECT label, data, isGood, school_list_holiday_sources.id FROM school_list_holiday_data INNER JOIN school_list_holiday_sources ON school_list_holiday_data.id=school_list_holiday_sources.id WHERE label != 1 AND isGood = 0 ORDER BY score desc LIMIT 0,10
This select gets all of the items that dont have a true or false label
This select gets all of the items that dont have a true or false label.
My output data doesn't have any confidence rating. Should it?
It looks like this:
Thanks!
PS if someone could add me on skype/other IM service I'd be happy to screen share and work on this in real time?
if you applied the classification model: yes, your output should contain predictions and confidences. It would be helpful if you posted your process as XML here, so we can check the setup. You get the XML code via the XML tab at the top of the process view in RapidMiner. Just copy the text from there into your next answer, and please use the #-button on top of the input box for that.
Best,
Marius
http://beta.etherpad.org/p/rapidminer
please try this one and let me know if it works:
Cheers,
Ingo
are you sure that you have pressed the green check icon after inserting the XML (I frequently forget this ;D ). The difference is really small: I just have connected the output port with the word list of the first operator for text processing with the input port for the word list of the second one. This is definitely necessary, since otherwise the resulting example sets would differ and a prediction is not possible then. This should actually also be stated in the log, by the way.
Another thing which cames into my mind is the fact that your query delivers an attribute label, which get the role "label" during training but not during testing. Remove this or also set the role to label before model application. Here is the suggested process: If this things are not the reason, I am afraid I would have to look into the data and the transformed data (i.e. the two example sets which are actually delivered to the learner - do they really contain regular attributes? Are those the same for training and testing?
Cheers,
Ingo
If you want I can share my screen via skype and we can make modifications in real time?
Skype: johny_mac
If somebody else has more time and wants to dive deeper into this: the next thing I would check is what is delivered to the learner (see my questions below) and to the operator Apply Model together with the log messages. If the dimension is really high, maybe another learner would also be more appropriate. Just my 2c.
Cheers,
Ingo
Would anyone be willing to just do it as a side job and not charge the 200 euros per hour but maybe 20 euros for 5 minutes of your time or maybe I can donate some money to charity or to your favorite open source project?
as Ingo said above: please check your data, and also your SQL queries. To me it seems a bit odd that you said that you want to use isGood as label, but are fetching a label column from the database. Next, in your screenshot of the data the columns for label and isGood are almost empty. Please check that you are fetching correct data sets by putting a breakpoint on the Read Database operators.
Best,
Marius
I actually get this in my results which I think means something is working right:
Can anyone please confirm?
Thanks
If you still don't get valid results, again check the following:
Did you:
- connect the wordlist output of the Process Documents output in the training branch to the input of Process Documents in the Apply branch?
- did you double check that you read correct data from both Read Database operators?
- if you don't use isGood, don't retrieve it from the database.
- find out why the label attribute is empty after Process Documents, and try to fix it. Is already empty directly after the Read Database operators?
Best, Marius
In answer to your questions:
Yes.
Yes.
Removed isGood
I'm running some more tests now, will reply once they are completed.
Thanks
Didn't you get a warning or error in the "Problems" view at the bottom of RapidMiner saying sth like "The example set must contain at least one text attribute"?
Best, Marius
Include special attributes not checked.
Didn't get any warnings..
View:
XML is this:
I refer you to earlier posts in this thread from Ingo and to the help for this operator...