Is it possible to keep track of texts after input?

Legacy UserLegacy User Member Posts: 0 Newbie
edited November 2018 in Help
Hi everyone,

After importing texts using the Text plugin, and throughout the process, is it possible to keep the data linked to the source text files, for further uses?

It seems to me that this could be useful in many situations, but I'm not sure if it can be done.

In my case, I am clustering a few segments of texts. In the graph view of the cluster model, each segment is represented by its corresponding identification number. It would be nice if, by clicking on an id, the full text included in the segment's file could be displayed.

Thanks in advance for your help.

Answers

  • IngoRMIngoRM Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, Community Manager, RMResearcher, Member, University Professor Posts: 1,751 RM Founder
    Hi,

    this is all possible: please check out the sample process in the Text Plugin Samples which can be downloaded at our SourceForge web page (e.g. 03_DimensionalityReduction.xml). The trick is to use Ids associated to the texts (activate for example the "long" ids in the TextInput operator) and also activate the parameter "create_visualizer". If you double click on a point in a plot, a dialog containing the full text will be shown! Here is the sample:

    <operator name="Root" class="Process" expanded="yes">
        <description text="#ylt#h3#ygt#Dimensionality reduction on text documents#ylt#/h3#ygt##ylt#p#ygt#In this experiment, texts are visualized on a 2D area by applying dimensionality reduction. After the experiment finished, select the attributes d1 and d2 as x and y in the scatter plotter and the attribute label as plot#ylt#/p#ygt#. #ylt#p#ygt##ylt#b#ygt#Hint:#ylt#/b#ygt# Double-click on any point to see a pop-up window with the full text.#ylt#/p#ygt#"/>
        <operator name="TextInput" class="TextInput" expanded="yes">
            <parameter key="create_text_visualizer" value="true"/>
            <parameter key="id_attribute_type" value="long"/>
            <list key="texts">
              <parameter key="graphics" value="../data/newsgroup/graphics"/>
              <parameter key="hardware" value="../data/newsgroup/hardware"/>
            </list>
            <operator name="StringTokenizer" class="StringTokenizer">
            </operator>
            <operator name="EnglishStopwordFilter" class="EnglishStopwordFilter">
            </operator>
            <operator name="TokenLengthFilter" class="TokenLengthFilter">
                <parameter key="min_chars" value="3"/>
            </operator>
            <operator name="PorterStemmer" class="PorterStemmer">
            </operator>
        </operator>
        <operator name="SVDReduction" class="SVDReduction">
        </operator>
    </operator>
    Cheers,
    Ingo
Sign In or Register to comment.