# "getting started with text-mining"

Member Posts: 0 Newbie
edited May 2019 in Help
Hi,

I want to take a look at the text-mining part of the rapid miner. I am evaluating some products as part of a case study about text-mining.

So what I want to do is:
• to get data input from file/web (I did this with the input and the crawler)
• to get the feature extraction running -> I can excract here some entities, right?
• to get the english stop word filter running
I don´t know why but the feature extraction  doesn´t work with the text I´m using for my input. Can you give me or show me a quick tutorial how to use these things?

Thanks a lot!

Benjamin
Tagged:

• Member Posts: 0 Newbie
Oh I didn´t see that I can start a feature selection over the wizard. I´ll try it over that way.
• Member Posts: 0 Newbie
I still don´t get this running. I use the example from the wizard and use a text from a file. The error occurs in the Learner:

Many operators like classification and regression methods or the PerformancEvaluator require the input example sets to have a label or class attribute. If this not the case, applying these operators is pointless. If you read the data using an ExampleSource, you can specify the label attribute by using a 'label' tag in the attribute description file.

any suggestions
• Moderator, Employee, Member Posts: 291  RM Product Management
Hi Benjamin,

I must admit I do not completely understand yet, what you are trying to do. Could you please post the XML representation of your RapidMiner process.

And btw.: the error you mention occurs, when there is no label defined but you try to do supervised learning.

Regards,
Tobias
• Member Posts: 0 Newbie
sorry I know, I wrote quite confusing. I try to explain it again:

We´re doing a study about text-mining software at my university and we try to compare them in a kind of way. Means that we´re looking at three different programs and we want also to take a look in a "free" program like the Rapid miner. For that I want to determine what the Rapid Miner is capable of.

So let´s see the technical part.

I installed the program and selected in the wizard the feature selection. I took as an Input file a .txt file with some random text in it copied out of wikipedia. As output file I saved it with a random name. So if I try to run that I get the error with the learner. I don´t know if I´m missing anything but the program is capable of getting the text from the file. I´m now under Linux and will post later how I did it. But perhaps you have some suggestions already.

• Moderator, Employee, Member Posts: 291  RM Product Management
Hi Benjamin,
Benjamin wrote:

We´re doing a study about text-mining software at my university and we try to compare them in a kind of way. Means that we´re looking at three different programs and we want also to take a look in a "free" program like the Rapid miner. For that I want to determine what the Rapid Miner is capable of.
Well, RapidMiner in combination with the text plugin (you did install this, too, didn't you?) offers a wide range of possibilities concerning text mining where the text mining plugin is mainly responsible for extracting features from texts and then all of the RapidMiner functionality can be used to actually mine from the loaded data.
Benjamin wrote:

I installed the program and selected in the wizard the feature selection. I took as an Input file a .txt file with some random text in it copied out of wikipedia. As output file I saved it with a random name. So if I try to run that I get the error with the learner. I don´t know if I´m missing anything but the program is capable of getting the text from the file. I´m now under Linux and will post later how I did it. But perhaps you have some suggestions already.
Ok, lets see. First of all, you did replace the [tt]ExampleSource[/tt] operator in the FeatureSelection template from the wizard to a text specific input operator (e.g. [tt]TextInput[/tt]), right? Just asking these questions, because I still do not really know where actually your problem emerges. Hence, it would be far easier if you would post the process XML into this thread. To do that, simply click on the XML tab on the right side and copy & paste the XML code into the forum. Thanks.

Regards,
Tobias
• Member Posts: 0 Newbie
I´ll write in German as that´s easier for solving the problem. If you don´t speak German, tell me and I´ll translate it.

Ich bekomm den einfachen Durchlauf von der Feature Selection nicht hin. Ich würde gerne einfach mal die Durchlaufen lassen können um zu sehen, wozu sie fähig ist. Da ich es wahrscheinlich einfach falsch bediene, hab ich mal ein paar Screenshots gemacht dazu.

Vorgehen: Zuerst die Datei, die ich ihm zum einlesen gebe. Ist was ganz einfaches, 10 Wörter, jeweils durch Leerzeichen getrennt.

Dann starte ich den Wizard und wähle Feature Selection. Bei dem Fenster Make Settings wähle ich jetzt Start Configuration Wizard. Dort lese ich die Datei ein und wähle use first row Column names. Bei dem nächsten Fenster, belasse ich die attribute Value Types mit nominal. Siehe hier:

Im nächsten Fenster wähle ich dann wieder nichts neues aus, bzw. Ich hab auch schon versucht ein Element als Label zu definieren, hat aber nix gebracht.

Als Filename für die Attribute Datei wähle ich einfach einenxbeliebigen und speicher das in die Datei. Danach sieht mein Baum so aus:

Das passende XML ist:
<operator name="Root" class="Process" expanded="yes">    <operator name="ExampleSource" class="ExampleSource">        <parameter key="attributes"	value="C:\Dokumente und Einstellungen\benny\Eigene Dateien\rm_workspace\bla.aml"/>    </operator>    <operator name="FS" class="FeatureSelection" expanded="yes">        <operator name="FSChain" class="OperatorChain" expanded="yes">            <operator name="XValidation" class="XValidation" expanded="yes">                <operator name="Learner" class="LibSVMLearner">                </operator>                <operator name="ApplierChain" class="OperatorChain" expanded="yes">                    <operator name="Applier" class="ModelApplier">                    </operator>                    <operator name="Evaluator" class="Performance">                    </operator>                </operator>            </operator>            <operator name="ProcessLog" class="ProcessLog">                <list key="log">                  <parameter key="generation"	value="operator.FS.value.generation"/>                  <parameter key="performance"	value="operator.FS.value.performance"/>                </list>            </operator>        </operator>    </operator></operator>
Wenn ich das ganze jetzt mal durchlaufen lasse, bekomme ich das hier:

Der Fehler sagt ja was darüber aus, dass Label fehlen, aber wär cool, wenn ihr mir da genaueres Feedback geben könntet, da ich Testweise größere Datenmengen einlesen  und die auswerten will.

Danke auf jedenfall für das bisherige Feedback
• Moderator, Employee, Member Posts: 291  RM Product Management
Hi Benjamin,

ok, thanks for the detailed description of your problem. Now I understand what you do ... wrong! I will try to clarify the picture somewhat. I hope it is ok that I stick to English, since English is the general language used in this forum and there are a lot of international RapidMiner users.

The general problem why your process does not work is properly explained by the error message you observe. You simply did not define a label. This can be either done in the wizard where you are asked to specify special attributes. Alternatively, you can wait until you finished the wizard and then put a [tt]ChangeAttributeRole[/tt] operator in your operator tree between the [tt]ExampleSource[/tt] operator and the [tt]FeatureSelection[/tt] operator. You mentioned that the first way did not solve your problem. To be honest, I doubt that this resulted in the same error. So please try again or use the second way.

Another remark on your so-called test data. As I tried to tell you in the previous posts, normally data in the scope of text mining is not just simply put into an example set (in nominal attributes) to apply a learner on that data afterwards. You should hence not use the [tt]ExampleSource[/tt] operator but an input operator from the text plugin.

If you only want to see the feature selection in action, I would recommend to apply it on a simple normal data set, e.g. some data that come in the samples directory with RapidMiner.

Regards,
Tobias