2 weeks ago
Hi! I've been using NB (Kernel) algorithm for my classification problem and I choose a greedy estimation mode.
I also used operator Optimize Parameters (Grid) in order to find the best combination of bandwidth and number of kernels. So, I put that the range of a bandwidth parameter will be from 0.01 to 0.1, and for kernel parameter from 1 to 20.
I've been wondering if these values are in good range and what exactly "number of kernels" parameter stands for? I've been searching the literature for the past few days in order to find some recommended ranges of this parameter and also to find an explanation of the "number of kernels" parameter, but it didn't result in any success.
I would appreciate your help and insights.
Solved! Go to Solution.
2 weeks ago
Let's start on the meaning of the parameters first.
I assume you know how Naive Bayes works in general. If not, I recommend the following blog post:
For nominal / categorical values, we derive probabilities for the combination of attribute values by simply counting the possible values and dividing them by the number of all possibilities. But what do we do for numerical values? In a simple implementation, the probabilities for numerical values are derived from a single distribution (usually Gaussian) which is fitted to the data.
A kernel-based distribution is now replacing this simple single-modal distribution by one consisting of an additive overlay of multiple gaussian distributions. See here fore more information: https://en.wikipedia.org/wiki/Kernel_density_estimation
The "number of kernels" is now simply the number of distributions which is used. If the number is high, the distribution becomes more complex / wiggly which might fit to a sort of overfitting to your data. If it is too small, you might miss important peaks in your data.
The width parameter is simply the width of those single kernels. Wider kernels will lead to smoother distribution curves, more narrow kernels will again wiggle more.
Of course there is not really a great range value which works for all data sets. I typically try numbers between 1 and 10 for the number of kernels and a width range between 0.1 and 0.5 so that the distribution is not getting too wiggly.
Hope this helps,