"TextMining using LibSVMLearner -- does sort order of Excel input file matter?"

wotsiznamizwotsiznamiz Member Posts: 9 Contributor II
edited May 23 in Help
I am using the following code to text-mine a ~10,000 row Excel Record Set.  The Excel file has three columns: (1) the label, (2) the text, and (3) the ID.

I have noticed something peculiar -- when I sort the Excel file differently, the model that is produced is dramatically different.  For example, if I sort on the label column, RapidMiner produces much better results than if I sort on ID.  Should I always be sorting on the label column?  I would have thought that RapidMiner would produce the same results on inputs sorted in any manner.  Is this a bug?  Can I rely on my results after seeing this behavior?

<operator name="Root" class="Process" expanded="yes">
    <parameter key="logfile" value="C:\RapidMiner\NPS_PaymentStatus\log.log"/>
    <parameter key="resultfile" value="C:\RapidMiner\NPS_PaymentStatus\Result_file.res"/>
    <operator name="MemoryCleanUp_START" class="MemoryCleanUp">
    </operator>
    <operator name="ExcelExampleSource" class="ExcelExampleSource">
        <parameter key="excel_file" value="C:\RapidMiner\NPS_PaymentStatus\RapidMiner_PaymentStatus_MASTER_MinusNEUTRALS&amp;BLANKS.xls"/>
        <parameter key="first_row_as_names" value="true"/>
        <parameter key="create_label" value="true"/>
        <parameter key="create_id" value="true"/>
        <parameter key="id_column" value="3"/>
    </operator>
    <operator name="Nominal2String" class="Nominal2String">
    </operator>
    <operator name="StringTextInput" class="StringTextInput" expanded="yes">
        <parameter key="remove_original_attributes" value="true"/>
        <parameter key="prune_below" value="10"/>
        <list key="namespaces">
        </list>
        <operator name="StringTokenizer" class="StringTokenizer">
        </operator>
        <operator name="StopwordFilterFile" class="StopwordFilterFile">
            <parameter key="file" value="C:\RapidMiner\NPS_PaymentStatus\STOPWORDS.txt"/>
        </operator>
        <operator name="TokenLengthFilter" class="TokenLengthFilter">
        </operator>
        <operator name="LovinsStemmer" class="LovinsStemmer">
        </operator>
    </operator>
    <operator name="ExampleSetWriter" class="ExampleSetWriter">
        <parameter key="example_set_file" value="C:\RapidMiner\NPS_PaymentStatus\EXAMPLE_SET_FILE.dat"/>
        <parameter key="attribute_description_file" value="C:\RapidMiner\NPS_PaymentStatus\ATTRIBUTE_DESCRIPTION_FILE.aml"/>
        <parameter key="quote_nominal_values" value="false"/>
    </operator>
    <operator name="MemoryCleanUp_02" class="MemoryCleanUp">
    </operator>
    <operator name="XValidation" class="XValidation" expanded="yes">
        <parameter key="keep_example_set" value="true"/>
        <parameter key="create_complete_model" value="true"/>
        <parameter key="number_of_validations" value="2"/>
        <operator name="LibSVMLearner" class="LibSVMLearner">
            <parameter key="keep_example_set" value="true"/>
            <parameter key="kernel_type" value="linear"/>
            <parameter key="degree" value="1"/>
            <list key="class_weights">
            </list>
            <parameter key="calculate_confidences" value="true"/>
        </operator>
        <operator name="OperatorChain" class="OperatorChain" expanded="yes">
            <operator name="ModelApplier" class="ModelApplier">
                <parameter key="keep_model" value="true"/>
                <list key="application_parameters">
                </list>
                <parameter key="create_view" value="true"/>
            </operator>
            <operator name="BinominalClassificationPerformance" class="BinominalClassificationPerformance">
                <parameter key="keep_example_set" value="true"/>
                <parameter key="main_criterion" value="AUC"/>
                <parameter key="AUC" value="true"/>
                <parameter key="precision" value="true"/>
                <parameter key="recall" value="true"/>
                <parameter key="lift" value="true"/>
                <parameter key="fallout" value="true"/>
                <parameter key="f_measure" value="true"/>
                <parameter key="false_positive" value="true"/>
                <parameter key="false_negative" value="true"/>
                <parameter key="true_positive" value="true"/>
                <parameter key="true_negative" value="true"/>
                <parameter key="sensitivity" value="true"/>
                <parameter key="specificity" value="true"/>
                <parameter key="youden" value="true"/>
                <parameter key="positive_predictive_value" value="true"/>
                <parameter key="negative_predictive_value" value="true"/>
                <parameter key="psep" value="true"/>
            </operator>
            <operator name="ECS_ModelResults" class="ExampleSetWriter">
                <parameter key="example_set_file" value="C:\RapidMiner\NPS_PaymentStatus\EXAMPLE_SET_FILE_MODEL.dat"/>
                <parameter key="format" value="special_format"/>
                <parameter key="special_format" value="$i $l $p $d"/>
            </operator>
            <operator name="PerformanceWriter" class="PerformanceWriter">
                <parameter key="performance_file" value="C:\RapidMiner\NPS_PaymentStatus\NPS_PaymentStatus.per"/>
            </operator>
            <operator name="ResultWriter" class="ResultWriter">
                <parameter key="result_file" value="C:\RapidMiner\NPS_PaymentStatus\Result_file.res"/>
            </operator>
        </operator>
    </operator>
    <operator name="MemoryCleanUp_END" class="MemoryCleanUp">
    </operator>
</operator>


Tagged:

Answers

  • IngoRMIngoRM Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, Community Manager, RMResearcher, Member, University Professor Posts: 1,622  RM Founder
    Hi,

    your process in general looks good to me (at least from viewing at the XML code alone  ;) )

    I would have thought that RapidMiner would produce the same results on inputs sorted in any manner.
    Not necessarily. This completely depends on the learning scheme. However, with a 2-fold cross validation alone you can probably not really take any definite statement about the performance of the models. If the dramatic change in prediction performance still is true for a 10 times 10-fold cross validation I would be more worried  ;)

    Cheers,
    Ingo
  • wotsiznamizwotsiznamiz Member Posts: 9 Contributor II
    THX!
Sign In or Register to comment.