"text input from a single text file using text plugin"

angshuangshu Member Posts: 3 Contributor I
edited May 2019 in Help
Hi,

I am new to text plugin, I am trying to do some text clustering using rapidminer with text plugin. I have all the text in one file in which each line needs to be considered as a different document. I tried using SplitSegmenter, but since a new file is created for every line, the space in blowing up which will hamper scalability.

Can someone suggest a way i can cluster the different lines in the same text so i dont hae to create different files.

Appreciate your response
Regards
Angshu

Answers

  • landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn
    Hi,
    this is possible. You have to do a little trick: Load the file using the CSVExampleSource operator. Configure the operator in a way, that only one column is created from the file! In order to do so, specify a text never occuring in the field for the column separtion regular expression. Then insert a Nominal2String operator to change the value type to string. After this, using the StringTextInput, you can transform the texts into wordvectors for clustring. To simplify your life, I append a sample process:
    <operator name="Root" class="Process" expanded="yes">
        <operator name="CSVExampleSource" class="CSVExampleSource">
            <parameter key="filename" value="C:\Dokumente und Einstellungen\sland\Desktop\test.txt"/>
            <parameter key="read_attribute_names" value="false"/>
            <parameter key="column_separators" value="This text never occures in the file --- sdhaksj dhaskljdh alkdjsh sa"/>
            <parameter key="use_comment_characters" value="false"/>
        </operator>
        <operator name="Nominal2String" class="Nominal2String">
        </operator>
        <operator name="StringTextInput" class="StringTextInput" expanded="yes">
            <list key="namespaces">
            </list>
        </operator>
    </operator>

    Greetings,
      Sebastian
  • ram_nit05ram_nit05 Member Posts: 12 Contributor II
    Hi Angshu,

    Just to add to what Sebastian was saying, in GUI form, you can use the following operator flow,

    1. Examplesource - configure your input( tab/csv delimited; format of input fields(nominal or string,etc); type of variable( label for dependent variable and attribute for independent variables, id for keys) ;then save it in attribute file.

    2. Stringtextinput - for generating word vectors ; for further info visit,http://kmandcomputing.blogspot.com/2008/06/opinion-mining-with-rapidminer-quick.html

    I had faced the same problem and the flow mentioned above helped.

    Thanks,
    Ram


  • angshuangshu Member Posts: 3 Contributor I
    Thanks Sebastian and Ram, your replies helped a lot

    Best Regards
    Angshu
Sign In or Register to comment.