"Applying Feature Selection on text input"

jebadiah · July 2009

Hello. I am new to using RapidMiner so please excuse my ignorance.

I am trying to perform K-Means Clustering on a set of text files. I have downloaded and installed the plug-in needed to input text files. Now, I want to apply Feature Selection to it. However, when I try to, it seems that it needs an ExampleSet to be able to perform the Feature Selection function. Is there a way for me to apply Feature Selection on text input?

Here is how my xml looks like right now:

<operator name="Root" class="Process" expanded="yes">
<operator name="TextInput" class="TextInput" expanded="yes">
<list key="texts">
<parameter key="blogs" value="D:\Text-files"/>
</list>
<parameter key="vector_creation" value="TermFrequency"/>
<operator name="StringTokenizer" class="StringTokenizer">
</operator>
<operator name="StopwordFilterFile" class="StopwordFilterFile">
<parameter key="file" value="D:\stop.txt"/>
</operator>
<operator name="StopwordFilterFile (2)" class="StopwordFilterFile">
<parameter key="file" value="D:\punctuations.txt"/>
</operator>
</operator>
<operator name="KMeans" class="KMeans">
<parameter key="k" value="8"/>
</operator>
</operator>

When I try to add the ff:

<operator name="BackwardElimination" class="FeatureSelection" expanded="yes">
<parameter key="selection direction" value="backward"/>
</operator>

The ff. error occurs:

Error in: TextInput (TextInput) Error in experiment setup: com.rapidminer.operator.MissingIOObjectException: The operator needs some input of type com.rapidminer.example.ExampleSet which is not provided

Can anyone please suggest something to help me do this. Thank you very much. :-*

jebadiah · July 2009

Hi again. I was able to produce to this xml file

<operator name="Root" class="Process" expanded="yes">
<operator name="TextInput" class="TextInput" expanded="yes">
<list key="texts">
<parameter key="blogs" value="D:\Blogs-final"/>
</list>
<parameter key="vector_creation" value="TermFrequency"/>
<operator name="StringTokenizer" class="StringTokenizer">
</operator>
<operator name="StopwordFilterFile" class="StopwordFilterFile">
<parameter key="file" value="C:\Users\Jhermin\Desktop\dyermin\Thesis\src\Files\stop.txt"/>
</operator>
<operator name="ExampleSetGenerator" class="ExampleSetGenerator">
<parameter key="target_function" value="random"/>
</operator>
</operator>
<operator name="BackwardElimination" class="FeatureSelection" breakpoints="after" expanded="yes">
<parameter key="selection_direction" value="forward"/>
<parameter key="show_stop_dialog" value="true"/>
</operator>
<operator name="KMeans" class="KMeans">
<parameter key="k" value="8"/>
</operator>
</operator>

but it returns this error:

Root[1] (Process)
+- TextInput[1] (TextInput)
| +- StringTokenizer[1] (StringTokenizer)
| +- StopwordFilterFile[1] (StopwordFilterFile)
| +- ExampleSetGenerator[1] (ExampleSetGenerator)
here ==> | +- BackwardElimination[1] (FeatureSelection)
+- KMeans[0] (KMeans)

I would really appreciate if anyone has any ideas why this error appears. Thanks a lot.

jebadiah · July 2009

No one? Please? I really need to do this. Thanks in advance.

fischer · July 2009

Hi,

well, the approach you are taking is a bit, umh, ... broken. Feature selection does not work this way. An example of a ForwardSelection is in the samples folder under 05_features/10_ForwardSelection.xml. The important point is: You need to have your learner inside the forward-selection. otherwise, it does not know how to optimize. In general, the FS takes an ExampleSet and must contain operators that are able to evaluate such an example set by producing a PerformanceVector.

As an aside, it might turn out that it is a bad idea to try backward elimination on text data.

Best,
Simon

jebadiah · July 2009

Hello, thank you for your reply.

I am currently trying out this xml:

<operator name="Root" class="Process" expanded="yes">
    <operator name="TextInput" class="TextInput" expanded="yes">
        <list key="texts">
          <parameter key="blogs"	value="D:\Blogs"/>
        </list>
        <parameter key="vector_creation"	value="TermFrequency"/>
        <operator name="StringTokenizer" class="StringTokenizer">
        </operator>
        <operator name="StopwordFilterFile" class="StopwordFilterFile">
            <parameter key="file"	value="C:\Users\Jhermin\Desktop\dyermin\Thesis\src\Files\stop.txt"/>
        </operator>
    </operator>
    <operator name="FS" class="FeatureSelection" expanded="yes">
        <operator name="XValidation" class="XValidation" expanded="yes">
            <parameter key="create_complete_model"	value="true"/>
            <parameter key="number_of_validations"	value="5"/>
            <parameter key="sampling_type"	value="shuffled sampling"/>
            <operator name="NearestNeighbors" class="NearestNeighbors">
                <parameter key="k"	value="5"/>
            </operator>
            <operator name="ApplierChain" class="OperatorChain" expanded="yes">
                <operator name="Applier" class="ModelApplier">
                    <list key="application_parameters">
                    </list>
                </operator>
                <operator name="Performance" class="Performance">
                </operator>
            </operator>
        </operator>
    </operator>
    <operator name="KMeans" class="KMeans">
        <parameter key="k"	value="3"/>
    </operator>
</operator>

However, it is running very slowly. And it cannot accommodate about 300 text files, it returns Java Heap Space error. I have tried changing the rapidminerGUI script but nothing is changing. Do you have any idea how I can change the maximum size for the heap space?

Thank you very much. You are very helpful.

land · July 2009

Hi,
the topic of adjusting the maximum heap size has been discussed in this forum a look of time. Please use the search button in order to find one of the discussions and the solutions.

Greetings,
Sebastian

keith · July 2009

Or check on the RM Wiki page on the topic: http://rapid-i.com/wiki/index.php?title=Memory_Issues

land · July 2009

Good hint. It seems, I'm not used to the Wiki, yet

Marcello_Sandi · July 2009

Hi,

There is an interesting problem over this model. I ran in my optimized workstation, which has 7GB exclusive memory to JVM and personalized JVM arguments.

I used hardware and graphic examples and appear a RuntimeException caught. java.lang.OutOfMemoryError: GC overhead limit exceeded. Very strange for such small bases.

With this workstation I already run a BOW with 9700 words and 8500 lines.

Using the top command on linux, I was watching the process and realized several PID java when running model.

Marcello Sandi

land · July 2009

Hi Marcello,
we don't start any other java process, so probably this is an artifact from somewhere else...

We are aware of the problem that the feature selection has sometimes problems on example sets with a really great number of attributes. Since those great numbers mostly occur on text mining and feature selection on text mining is of limited use, the problem was not of top priority.
But with the next major release we will add a more memory efficient variant.

Greetings,
Sebastian

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

"Applying Feature Selection on text input"

Answers