Options

Correct ARFF Format?

Legacy UserLegacy User Member Posts: 0 Newbie
edited November 2018 in Help
Hi all,

I'm running a Naive Bayes classifier on a set of keyword/keyphrases and then using the produced model to predict the label attribute for an unclassified set of keywords/keyphrases.  However, I'm running into some strange problems where the result of my applied model shows a ? if I have a space between keywords.  I'm thinking that I may be formatting my ARFFs incorrectly?

Here is my training set:
@RELATION c_training

@ATTRIBUTE keywords STRING
@ATTRIBUTE change {up,down,neutral}

@DATA
'delay acquisition',down
'facing the same conundrum',down
'restructuring',down
'delay acquisition',up
'divestiture',down
'profit dissipated',down
'delay acquisition',up
'profits up', up
'profits down', down
'delay acquisition',up
'delay acquisition',up
'delay acquisition',up
'delay acquisition',up
And here is my test set:
@RELATION c_test

@ATTRIBUTE keywords STRING

@DATA
'profit dissipated'
Any help would be appreciated.

Thank you.

Answers

  • Options
    IngoRMIngoRM Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, Community Manager, RMResearcher, Member, University Professor Posts: 1,751 RM Founder
    Hi,

    as far as I remember, Arff uses double quotes (") instead of single quotes ('). Could that be the reason?

    Cheers,
    Ingo
  • Options
    Legacy UserLegacy User Member Posts: 0 Newbie
    Nope, I tried converting the single quotes to double quotes in both the training and test data. This problem is in both the gui and when using the jar as a library.

    The end result of the above training and test data (with double quotes) is
    ? down 0.24710424710424708 0.752895752895753 0.0
    instead of the expected
    profit dissipated down 0.24710424710424708 0.752895752895753 0.0
    But the problem is strange. If in the test data we change "profit dissipated" to "profit  dissipated" (with 2 spaces) it works fine.
  • Options
    Legacy UserLegacy User Member Posts: 0 Newbie
    Also, this warning come up. Not sure what it means, but perhaps it is related.
    G Apr 6, 2009 11:37:31 AM: [Warning] Distribution: The number of nominal values is not the same for training and application for attribute 'keywords', training: 5, application: 1
  • Options
    IngoRMIngoRM Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, Community Manager, RMResearcher, Member, University Professor Posts: 1,751 RM Founder
    You are right. There is no difference between single and double quotes. Both are supported. But you missed one important thing by not getting your meta data description right. You are actually not having a "string" attribute but a nominal (categorical) one where you have to define all occuring values. If you do this correctly for both the training and the test data, at least the Naive Bayes error should be gone. Since I do also not have any other issue during data loading I assume that also the output problem could be fixed by that.

    So a correct Arff for training would look like

    @RELATION c_training

    @ATTRIBUTE keywords {'delay acquisition','facing the same conundrum','restructuring','divestiture','profit dissipated','profit dissipated','profits

    down'}

    @ATTRIBUTE change {up,down,neutral}

    @DATA
    'delay acquisition',down
    'facing the same conundrum',down
    'restructuring',down
    'delay acquisition',up
    'divestiture',down
    'profit dissipated',down
    'delay acquisition',up
    'profit dissipated', up
    'profits down', down
    'delay acquisition',up
    'delay acquisition',up
    'delay acquisition',up
    'delay acquisition',up
    and for testing accordingly

    @RELATION c_test

    @ATTRIBUTE keywords {'delay acquisition','facing the same conundrum','restructuring','divestiture','profit dissipated','profit dissipated','profits

    down'}

    @DATA
    'profit dissipated'

    Please check the meta data view in order to check if everything is done correctly. Instead of using Arff you could also use the Attribute Editor of RM if you do not want to type in the different values yourself. Alternatively, you could load in the data from Arff using a string attribute and write down the data with the ExampleSetWriter (both the meta data file .aml and the data file .dat). Then you could use the same basic .aml file for your test data.

    Cheers,
    Ingo
  • Options
    Legacy UserLegacy User Member Posts: 0 Newbie
    Thanks, Ingo.  That was the problem.  All fixed now!
Sign In or Register to comment.