Normalization Issue

DarrellDarrell Member Posts: 16 Maven
edited November 2018 in Help
OK... hopefully this will make sense to you because I'm thoroughly confused...

Using Version 4.2, the first file I use, "ModelBuider_v42.xml," builds the model to predict the change in price.  For example, the "ModelBuilder" file will import the raw data, normalize the data, create a simple linear regression model, write the model to a file, reload the model, then apply the model to the previous example set using the ModelApplier.  After I run file the "Meta Data View" shows the following statistics for the label and prediction respectively, "avg = 0.390 +/- 7.132" and "avg = 0.390 +/- 0.261."  In addition, the statistics of all the regular attributes are "avg = 0 +/- 1." Therefore, everything appears to look good thus far.

However, my second file, "ModelLoader_v42.xml," is used to import new raw data, load the model, apply the model, and save the results to a comma seperated file.  But when I run this file using the same raw data file as before, the "Meta Data View" shows the following statistics for the label and prediction respectively, "avg = 0.390 +/- 7.132" and "avg = 8.846 +/- 1.677."  In addition, the statistics for all the regular attributes do not appear to be normalized, i.e. "avg = 65.074 +/- 16.351, avg = 0.337 +/- 2.242, etc."  Therefore, even though I selected "return_preprocessing_model" in the "Normalization" operator in the model builder file--none of the regular attributes or the predictions appear to remain normalized.

Now this is when it really gets confsing.  Using Version 4.1, when I build the model using the same operators and the same raw data as before, the statistics are as follows for the label and prediction respectively, "avg = 0.390 +/- 7.132" and "avg = -1.238 +/- 2.720"  And the statistics for the regular attributes appear really off, i.e. ""avg = -4.223 +/- 0.004, avg = -0.217 +/- 0.199, etc." for the same attributes as above.  Moreover, when I load and run the model, the statistics for the label and prediction respectively are, "avg = 0.390 +/- 7.132" and "avg = 0.390 +/- 0.261."  In addition, now the statistics of all the regular attributes are normalized again, i.e. "avg = 0 +/- 1."

What is really strange is that the results I got using verion 4.2, i.e. "ModelBuider_v42.xml" but could not duplicate using the "ModelLoader" file are the same results I got after creating the model and loading the model using version 4.1.

Could I have corrupted the results while trying to repeat the process.  Or should I have uninstall Version 4.1 before I installed version 4.2.

Please let me know how I can transfer the xml and data file to you for verification...

Thanks again,

Darrell

Answers

  • TobiasMalbrechtTobiasMalbrecht Moderator, Employee, Member Posts: 294 RM Product Management
    Hi,

    first of all a short remark: it would be nice if you post subsequent answers to the same thread (if they are related to the same question and subject. Then it is more easy to follow.

    Now concerning your problem...
    Darrell wrote:

    Using Version 4.2, the first file I use, "ModelBuider_v42.xml," builds the model to predict the change in price.  For example, the "ModelBuilder" file will import the raw data, normalize the data, create a simple linear regression model, write the model to a file, reload the model, then apply the model to the previous example set using the ModelApplier.  After I run file the "Meta Data View" shows the following statistics for the label and prediction respectively, "avg = 0.390 +/- 7.132" and "avg = 0.390 +/- 0.261."  In addition, the statistics of all the regular attributes are "avg = 0 +/- 1." Therefore, everything appears to look good thus far.

    However, my second file, "ModelLoader_v42.xml," is used to import new raw data, load the model, apply the model, and save the results to a comma seperated file.  But when I run this file using the same raw data file as before, the "Meta Data View" shows the following statistics for the label and prediction respectively, "avg = 0.390 +/- 7.132" and "avg = 8.846 +/- 1.677."  In addition, the statistics for all the regular attributes do not appear to be normalized, i.e. "avg = 65.074 +/- 16.351, avg = 0.337 +/- 2.242, etc."  Therefore, even though I selected "return_preprocessing_model" in the "Normalization" operator in the model builder file--none of the regular attributes or the predictions appear to remain normalized.
    Well, did you actually load both the preprocessing model and the regression model to a file and loaded and applied them to your new data. As far as I understand from your process descriptions, you normalized the data in your example set, learned a model, saved it, loaded it again and applied it (on the same example set which was already normalized). Then of course the data is still normalized and the model applied on the normalized data. If you however apply the regression model on new data which has not been normalized before, you can't expect it to be normalized. What I am trying to say is: in your first file, save the preprocessing model before applying the learner, load (and apply) it in your second file before actually loading (and applying) the regression model. This should do the trick.
    Darrell wrote:

    Please let me know how I can transfer the xml and data file to you for verification...
    Well simply copy and paste it into the forum post and bracket it in the tags the forum supplies for program code.

    Hope this was helpful,
    Tobias
  • IngoRMIngoRM Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, Community Manager, RMResearcher, Member, University Professor Posts: 1,751 RM Founder
    Hi,

    you have to save / load / apply both models like Tobias have pointed out or you have to use the GroupModel operator alternatively. I have posted an answer how this works here:

    http://rapid-i.com/rapidforum/index.php/topic,211.0.html

    Cheers,
    Ingo
  • DarrellDarrell Member Posts: 16 Maven
    First, sorry for starting a separate  thread on this same topic.  Since my first reply timed-out, I didn't realize I created a new thread when I reposted my reply.  Also, I wanted to compliment your team on providing such a great product!

    Regarding the "normalization" issue, I used the "ModelGrouper" operator in version 4.2 as suggested, but I still can't get my model predictions to correlate between versions 4.1 and 4.2.  However, using your suggestion I was able to get all of the attributes to normalize correctly, but the prediction values are still very different.  I'm sure that I must have a logic error somewhere, but I just can't find it.

    Below are copies of the files I used for testing with Version 4.1 and Version 4.2.  While the avg and std of the prediction using version 4.1 appears correct, the avg and std of the prediction using version 4.2 appears faulty.

    Copy of "ModelBuilder_v41"
      <operator name="Root" class="Process" expanded="yes">
        <operator name="ExampleSource" class="ExampleSource">
            <parameter key="attributes" value="C:\Program Files\Rapid-I\temp\sector_attributes_v41.aml"/>
        </operator>
        <operator name="Normalization" class="Normalization">
            <parameter key="return_preprocessing_model" value="true"/>
        </operator>
        <operator name="OperatorChain" class="OperatorChain" expanded="yes">
            <operator name="LinearRegression" class="LinearRegression">
                <parameter key="eliminate_colinear_features" value="false"/>
                <parameter key="feature_selection" value="none"/>
                <parameter key="keep_example_set" value="true"/>
            </operator>
            <operator name="ModelWriter" class="ModelWriter">
                <parameter key="model_file" value="C:\Program Files\Rapid-I\temp\test_v41.mod"/>
            </operator>
        </operator>
    </operator>
    Copy of "ModelLoader_v41"
    <operator name="Root" class="Process" expanded="yes">
        <operator name="ExampleSource" class="ExampleSource">
            <parameter key="attributes" value="C:\Program Files\Rapid-I\temp\sector_attributes_v41.aml"/>
        </operator>
        <operator name="ModelLoader" class="ModelLoader">
            <parameter key="model_file" value="C:\Program Files\Rapid-I\temp\test_v41.mod"/>
        </operator>
        <operator name="ModelApplier" class="ModelApplier">
            <list key="application_parameters">
            </list>
        </operator>
        <operator name="CSVExampleSetWriter" class="CSVExampleSetWriter">
            <parameter key="column_separator" value=","/>
            <parameter key="csv_file" value="C:\Program Files\Rapid-I\temp\tempResults.csv"/>
        </operator>
    </operator>
    After running both programs in Version 4.1, the statistics of the prediction are "avg = 0.390 +/- 0.261." 

    Copy of "ModelBuilder_v42"
    [<operator name="Root" class="Process" expanded="yes">
        <operator name="ExampleSource" class="ExampleSource">
            <parameter key="attributes" value="C:\Program Files\Rapid-I\temp\sector_attributes_v42.aml"/>
        </operator>
        <operator name="OperatorChain" class="OperatorChain" expanded="yes">
            <operator name="Normalization" class="Normalization">
                <parameter key="return_preprocessing_model" value="true"/>
            </operator>
            <operator name="LinearRegression" class="LinearRegression">
                <parameter key="feature_selection" value="none"/>
                <parameter key="keep_example_set" value="true"/>
            </operator>
            <operator name="ModelGrouper" class="ModelGrouper">
            </operator>
            <operator name="ModelWriter" class="ModelWriter">
                <parameter key="model_file" value="C:\Program Files\Rapid-I\temp\test_v42.mod"/>
            </operator>
        </operator>
    </operator>
    Copy of "ModelLoader_v42"
    <operator name="Root" class="Process" expanded="yes">
        <operator name="ExampleSource" class="ExampleSource">
            <parameter key="attributes" value="C:\Program Files\Rapid-I\temp\sector_attributes_v42.aml"/>
        </operator>
        <operator name="ModelLoader" class="ModelLoader">
            <parameter key="model_file" value="C:\Program Files\Rapid-I\temp\test_v42.mod"/>
        </operator>
        <operator name="ModelApplier" class="ModelApplier">
            <list key="application_parameters">
            </list>
        </operator>
        <operator name="CSVExampleSetWriter" class="CSVExampleSetWriter">
            <parameter key="column_separator" value=","/>
            <parameter key="csv_file" value="C:\Program Files\Rapid-I\temp\tempResults.csv"/>
        </operator>
    </operator>
    After running both programs using Version 4.2, the statistics of the prediction are "avg = 8.846 +/-1.677." 

    Therefore, I still can't figure out why the prediction average and standard deviation is so much different in Version 4.2.

    Thanks again for your great support and your fantastic product.

    best regards,

    Darrell
  • IngoRMIngoRM Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, Community Manager, RMResearcher, Member, University Professor Posts: 1,751 RM Founder
    Hi Darrell,

    ok, I see. I think I found the cause for the problem. The ModelGrouper operator adds the model beginning with the last one ending with the first one. That means, the prediction model is applied first and the data is normalized afterwards. You could use a IOSelector to exchange the order of both models before grouping them like in the following process:

    <operator name="Root" class="Process" expanded="yes">
        <operator name="ExampleSetGenerator" class="ExampleSetGenerator">
            <parameter key="target_function" value="sum"/>
        </operator>
        <operator name="Normalization" class="Normalization">
            <parameter key="return_preprocessing_model" value="true"/>
        </operator>
        <operator name="LinearRegression" class="LinearRegression">
            <parameter key="eliminate_colinear_features" value="false"/>
            <parameter key="feature_selection" value="none"/>
        </operator>
        <operator name="IOSelector" class="IOSelector">
            <parameter key="io_object" value="Model"/>
            <parameter key="select_which" value="2"/>
        </operator>
        <operator name="ModelGrouper" class="ModelGrouper">
        </operator>
        <operator name="ModelWriter" class="ModelWriter">
            <parameter key="model_file" value="model_group_lin_test.mod"/>
        </operator>
    </operator>
    You can check the order by having a view at the combined model. The models will be applied in the order they are defined in the grouped model.

    Cheers,
    Ingo
  • IngoRMIngoRM Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, Community Manager, RMResearcher, Member, University Professor Posts: 1,751 RM Founder
    Another note: I just changed the latest CVS version in a way that the IOSelector will no longer be necessary and the models are grouped in the "creation order" (unless this was changed by other operators).

    Cheers,
    Ingo
  • DarrellDarrell Member Posts: 16 Maven
    Ingo,

    Using the IOSelector operator as you described fixed the issue.  Thanks again for all your great support!  I don't know how long, or if I would have ever, figured that one out.

    best regards,

    Darrell
Sign In or Register to comment.