Options

# "[SOLVED] Reconstruction of the model in excel"

Imagine you have a model, learnt using W-BayesNet or W-BFTree (those from WEKA package).

Now you want to reproduce all the maths behind the model on sheet of paper or excel. Just in case you won't have Rapidminer you still might be able to us the model you learnt.

I managed to do it for k-means clustering, for W-LADTree and for binominal BF-Tree algorythms. Lets take a look at simple BF-Tree model output:

=========================================

W-BFTree

Best-First Decision Tree

indicator1 < -0.01016

| indicator2 < 0.01842: Class2(7.0/2.0)

| indicator2 >= 0.01842: Class1(103.0/29.0)

indicator1 >= -0.01016

| indicator3 < 0.00926

| | indicator4 < -7.7E-4: Class2(24.0/1.0)

| | iindicator4 >= -7.7E-4: Class2(20.0/12.0)

| indicator3 >= 0.00926: Class1(14.0/5.0)

Size of the Tree: 9

Number of Leaf Nodes: 5

=========================================

I can describe this model in Excel with the formula:

class1 probability = IF(indicator1 < -0.01016;IF(indicator2 < 0.01842;2/9;103/132);IF(indicator3 < 0.00926;IF(indicator4 < -7.7E-4;1/25;12/32);14/19))

class2 probability = 1 - class1 probability

But if I have polynominal label (class1 to class8), I can't reconstruct RapidMiner calculations because information in the outputs is insufficient. Look at this code:

=========================================

W-BFTree

Best-First Decision Tree

indicator1 < -0.01016

| indicator2 < 0.01842: Class2(7.0/2.0)

| indicator2 >= 0.01842: Class1(103.0/29.0)

indicator1 >= -0.01016

| indicator3 < 0.00926

| | indicator4 < -7.7E-4: Class3(24.0/1.0)

| | iindicator4 >= -7.7E-4: Class4(20.0/12.0)

| indicator3 >= 0.00926: Class5(14.0/5.0)

Size of the Tree: 9

Number of Leaf Nodes: 5

=========================================

Let's say we have an example where indicator1 < -0.01016 and indicator2 < 0.01842. What you can say is that it belongs to class2 with 7/9 probability. But how the rest 2/9 is distributed between other classes? You can't say that, though Rapidminer will give you confidence level for every single class in its output. I use some postprocessing and it really important for me to be able to reproduce these hidden calculations. Does anyone know how to?

Same goes to some other learning methods, for instance W-BayesNet. I was unable to determine how those output confidence levels are calculated from model output.

Now you want to reproduce all the maths behind the model on sheet of paper or excel. Just in case you won't have Rapidminer you still might be able to us the model you learnt.

I managed to do it for k-means clustering, for W-LADTree and for binominal BF-Tree algorythms. Lets take a look at simple BF-Tree model output:

=========================================

W-BFTree

Best-First Decision Tree

indicator1 < -0.01016

| indicator2 < 0.01842: Class2(7.0/2.0)

| indicator2 >= 0.01842: Class1(103.0/29.0)

indicator1 >= -0.01016

| indicator3 < 0.00926

| | indicator4 < -7.7E-4: Class2(24.0/1.0)

| | iindicator4 >= -7.7E-4: Class2(20.0/12.0)

| indicator3 >= 0.00926: Class1(14.0/5.0)

Size of the Tree: 9

Number of Leaf Nodes: 5

=========================================

I can describe this model in Excel with the formula:

class1 probability = IF(indicator1 < -0.01016;IF(indicator2 < 0.01842;2/9;103/132);IF(indicator3 < 0.00926;IF(indicator4 < -7.7E-4;1/25;12/32);14/19))

class2 probability = 1 - class1 probability

But if I have polynominal label (class1 to class8), I can't reconstruct RapidMiner calculations because information in the outputs is insufficient. Look at this code:

=========================================

W-BFTree

Best-First Decision Tree

indicator1 < -0.01016

| indicator2 < 0.01842: Class2(7.0/2.0)

| indicator2 >= 0.01842: Class1(103.0/29.0)

indicator1 >= -0.01016

| indicator3 < 0.00926

| | indicator4 < -7.7E-4: Class3(24.0/1.0)

| | iindicator4 >= -7.7E-4: Class4(20.0/12.0)

| indicator3 >= 0.00926: Class5(14.0/5.0)

Size of the Tree: 9

Number of Leaf Nodes: 5

=========================================

Let's say we have an example where indicator1 < -0.01016 and indicator2 < 0.01842. What you can say is that it belongs to class2 with 7/9 probability. But how the rest 2/9 is distributed between other classes? You can't say that, though Rapidminer will give you confidence level for every single class in its output. I use some postprocessing and it really important for me to be able to reproduce these hidden calculations. Does anyone know how to?

Same goes to some other learning methods, for instance W-BayesNet. I was unable to determine how those output confidence levels are calculated from model output.

Tagged:

0

## Answers

4Contributor IFor those who care, to have an access to all the probabilities you need to export model to xml, which will contain everything. Then you should write some program to extract data from xml. It should surf the xml file for

"<m__isLeaf>true</m__isLeaf>" string, and if that's the case, find next "<m__Distribution..." occurence, after which you will find amounts of cases attributable to every class.

Same goes to other learning methods that don't provide whole info in "Result Overview".

It's understood that if we have many classes, providing results with full distribution will overload output, but as a proposal, would be nice to have a possibility to see distribution of whole lot of probabilities among classes in RapidMiner.