FrequencyDiscretization with use_sqrt_of_examples

emanuele
emanuele New Altair Community Member
edited November 2024 in Community Q&A
I'm using RapidMiner for extracting frequent itemsets with FP-Growth.
My dataset contains also numerical attributes, and I want to discretize them with the FrequencyDiscretization operator. Moreover, I want RapidMiner choose for me the number of bins as the square root of the number of the examples.
In some circumstances my dataset can contain for some attributes null values for all the examples. Obviously, in such situations these attributes does not need to be discretized.
In spite of this, if I set to true the property "use_sqrt_of_examples" if I have an attribute (even textual) with all null values, RapidMiner does not complete the process, throwing this exception:


G Nov 24, 2009 9:25:51 AM: [Fatal] ArrayIndexOutOfBoundsException occured in 1st application of FrequencyDiscretization (FrequencyDiscretization)
G Nov 24, 2009 9:25:51 AM: [Fatal] Process failed: operator cannot be executed (-1). Check the log messages...


Does anyone know how can I perform the discretization making RapidMiner choosing for me the number of bins and avoiding the above mentioned problem?



Thanks in advance.


Emanuele
Tagged:

Welcome!

It looks like you're new here. Sign in or register to get started.

Answers

  • land
    land New Altair Community Member
    Hi,
    there are two options how to treat the missing values. You could either replace all missing values using a MissingValueReplenishment operator, or you could remove the complete useless attribute by choosing RemoveUselessAttributes.
    But be careful with the later one, because depending on the data it will remove different attributes.

    Greetings,
      Sebastian
  • land
    land New Altair Community Member
    Hi,
    at least in the upcoming RapidMiner 5 the problem with the Frequency discretization seems to be gone. At least I cannot reproduce it. You might post your process here, but please replace every file related operator (as ExampleSources) by a generator. Otherwise I cannot reproduce the behavior.

    Greetings,
      Sebastian
  • emanuele
    emanuele New Altair Community Member
    Thank you Sebastian for the very quick and ready answers.
    I have solved my problem inserting an "Attribute filter" operator before FrequencyDiscretization, specifying condition_class="missing_values" and max_fraction_of_missings="0.999". In this way, all the columns containing only null values have been removed avoiding the problems with FrequencyDiscretization.

    Anyway, I realized that in my application I should let the user choose statically the number of bins and their ranges. So I think I should use the UserBasedDiscretization, but in my application each attribute needs a specific number of bins with specific ranges. It seems that UserBasedDiscretization assign to each attribute the same number of bins with the same ranges.

    Is it possible to perform a user-based discretization indicating for each numerical attribute a specific number of bins, each of which with its specific range?

    Thanks in advance.

    Emanuele
  • land
    land New Altair Community Member
    Hi,
    in RapidMiner 4.x it's a little bit more complicated, but you could do it this way:
    Put the BinDiscretization into an AttributeSubseetPreprocessing Operator. There you might choose just one attribute or several using Regular Expressions.
    So you would have to insert this Operator combination several time, once per attribute.

    If this is not suitable or simply not elegant enough, you could wrap one of this combinations inside a parameter iteration, switched to List mode. You would have to give it a list of settings of attributes and bin numbers. For holding the results (which might be discarded, but I don't remember exactly) you could use the IOStore and IORetrieve operator.

    Greetings,
      Sebastian

Welcome!

It looks like you're new here. Sign in or register to get started.

Welcome!

It looks like you're new here. Sign in or register to get started.