Due to recent updates, all users are required to create an Altair One account to login to the RapidMiner community. Click the Register button to create your account using the same email that you have previously used to login to the RapidMiner community. This will ensure that any previously created content will be synced to your Altair One account. Once you login, you will be asked to provide a username that identifies you to other Community users. Email us at Community with questions.

FrequencyDiscretization with use_sqrt_of_examples

emanueleemanuele Member Posts: 3 Contributor I
edited November 2018 in Help
I'm using RapidMiner for extracting frequent itemsets with FP-Growth.
My dataset contains also numerical attributes, and I want to discretize them with the FrequencyDiscretization operator. Moreover, I want RapidMiner choose for me the number of bins as the square root of the number of the examples.
In some circumstances my dataset can contain for some attributes null values for all the examples. Obviously, in such situations these attributes does not need to be discretized.
In spite of this, if I set to true the property "use_sqrt_of_examples" if I have an attribute (even textual) with all null values, RapidMiner does not complete the process, throwing this exception:


G Nov 24, 2009 9:25:51 AM: [Fatal] ArrayIndexOutOfBoundsException occured in 1st application of FrequencyDiscretization (FrequencyDiscretization)
G Nov 24, 2009 9:25:51 AM: [Fatal] Process failed: operator cannot be executed (-1). Check the log messages...


Does anyone know how can I perform the discretization making RapidMiner choosing for me the number of bins and avoiding the above mentioned problem?



Thanks in advance.


Emanuele

Answers

  • landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn
    Hi,
    there are two options how to treat the missing values. You could either replace all missing values using a MissingValueReplenishment operator, or you could remove the complete useless attribute by choosing RemoveUselessAttributes.
    But be careful with the later one, because depending on the data it will remove different attributes.

    Greetings,
      Sebastian
  • landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn
    Hi,
    at least in the upcoming RapidMiner 5 the problem with the Frequency discretization seems to be gone. At least I cannot reproduce it. You might post your process here, but please replace every file related operator (as ExampleSources) by a generator. Otherwise I cannot reproduce the behavior.

    Greetings,
      Sebastian
  • emanueleemanuele Member Posts: 3 Contributor I
    Thank you Sebastian for the very quick and ready answers.
    I have solved my problem inserting an "Attribute filter" operator before FrequencyDiscretization, specifying condition_class="missing_values" and max_fraction_of_missings="0.999". In this way, all the columns containing only null values have been removed avoiding the problems with FrequencyDiscretization.

    Anyway, I realized that in my application I should let the user choose statically the number of bins and their ranges. So I think I should use the UserBasedDiscretization, but in my application each attribute needs a specific number of bins with specific ranges. It seems that UserBasedDiscretization assign to each attribute the same number of bins with the same ranges.

    Is it possible to perform a user-based discretization indicating for each numerical attribute a specific number of bins, each of which with its specific range?

    Thanks in advance.

    Emanuele
  • landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn
    Hi,
    in RapidMiner 4.x it's a little bit more complicated, but you could do it this way:
    Put the BinDiscretization into an AttributeSubseetPreprocessing Operator. There you might choose just one attribute or several using Regular Expressions.
    So you would have to insert this Operator combination several time, once per attribute.

    If this is not suitable or simply not elegant enough, you could wrap one of this combinations inside a parameter iteration, switched to List mode. You would have to give it a list of settings of attributes and bin numbers. For holding the results (which might be discarded, but I don't remember exactly) you could use the IOStore and IORetrieve operator.

    Greetings,
      Sebastian
Sign In or Register to comment.