Transforming output from Process Docs to create a word list/document

DataLibDataLib Member Posts: 2 Contributor I
edited April 29 in Help
Hi there...

We have a challenge to create word/tag clouds from a database system...

Easy I thought, create a table with the first column being Document ID, another column for the word and then a third column as the count of that word in the document (we probably wouldn’t use the 3rd column, but just in case).  In this way we could create a very quick word cloud no matter what the user selects as the subset of documents.

So I have set up the job in Rapid Miner, reading the records from the database including only the Document ID and the full text field, passed it through the Process Documents element (tokenise, transform case, filter stop word, filter tokens, stem)... Job done...

Unfortunately no... and here is my problem. 

The data that comes out from the Process Document element has the Document ID as the first column, but then every word that is found is the name of the remaining columns... I have looked at Transpose and Pivot, but neither of these do what I need....

We did think about saving the output as CSV and then doing something outside of Rapid Miner, but it would then mean it will be a manual process rather than something I can automate hourly to deal with new records.

Any thoughts or ideas will be most appreciated.

Answers

  • IngoRMIngoRM Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, Community Manager, RMResearcher, Member, University Professor Posts: 1,642  RM Founder
    Hi,

    did you try the operator "De-Pivot"? This should do the job as far as I can tell from your description.

    Cheers,
    Ingo
  • DataLibDataLib Member Posts: 2 Contributor I
    Ingo,

    Thanks for the reply.. I have had a quick look and it could work if the list of words (and therefore the columns/attributes) stayed the same... but the list of words already is large and having to set up the attributes in the de-pivot task would take a very long time each time the job was run.

    I have had a quick look at the Cut Document operator, and it would appear to do what I want, expect it does not allow for any other meta data to be passed through so I cannot tell what document the words relate to.

    Any suggestions you can make would be really appreciated.
    Chris
  • IngoRMIngoRM Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, Community Manager, RMResearcher, Member, University Professor Posts: 1,642  RM Founder
    Hi again,

    I have had a quick look and it could work if the list of words (and therefore the columns/attributes) stayed the same... but the list of words already is large and having to set up the attributes in the de-pivot task would take a very long time each time the job was run.
    You do not need to set them all up manually, you could use regular expressions instead. Maybe the following example might help:

    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.1.017">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="5.1.017" expanded="true" name="Process">
        <process expanded="true" height="224" width="681">
          <operator activated="true" class="retrieve" compatibility="5.1.017" expanded="true" height="60" name="Retrieve" width="90" x="45" y="30">
            <parameter key="repository_entry" value="//Samples/data/Market-Data"/>
          </operator>
          <operator activated="true" class="generate_attributes" compatibility="5.1.017" expanded="true" height="76" name="Generate Attributes" width="90" x="179" y="30">
            <list key="function_descriptions">
              <parameter key="AMOUNT" value="1"/>
            </list>
          </operator>
          <operator activated="true" class="pivot" compatibility="5.1.017" expanded="true" height="76" name="Pivot" width="90" x="313" y="30">
            <parameter key="group_attribute" value="TID"/>
            <parameter key="index_attribute" value="ITEM"/>
            <parameter key="skip_constant_attributes" value="false"/>
          </operator>
          <operator activated="true" class="de_pivot" compatibility="5.1.017" expanded="true" height="76" name="De-Pivot" width="90" x="447" y="30">
            <list key="attribute_name">
              <parameter key="AMOUNT" value="AMOUNT.*"/>
            </list>
            <parameter key="index_attribute" value="ITEM"/>
          </operator>
          <connect from_op="Retrieve" from_port="output" to_op="Generate Attributes" to_port="example set input"/>
          <connect from_op="Generate Attributes" from_port="example set output" to_op="Pivot" to_port="example set input"/>
          <connect from_op="Pivot" from_port="example set output" to_op="De-Pivot" to_port="example set input"/>
          <connect from_op="De-Pivot" from_port="example set output" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>
    Of course you could also use ".*" for all attributes but you should probably filter out the operator used for identifying the groups. This should do the trick.

    I have had a quick look at the Cut Document operator, and it would appear to do what I want, expect it does not allow for any other meta data to be passed through so I cannot tell what document the words relate to.
    Could also be a possible approach. Maybe you could multiply the data before, use Cut Document in one path and join both data sets afterwards?

    Cheers,
    Ingo
  • ItMakesSenseItMakesSense Member Posts: 1 Contributor I
    Hi Chris,

    Did you ever solve your challenge? I'm trying to do the same thing but without success.

    If I use ".*" like Ingo suggests I get the following error  ???

    'attributes have different value types:no conversion is performed.'

    Thanks

    Scott

    EDIT
    I have realised what I was doing wrong now.

    Using the following regular expression did the trick

    [^id].*


Sign In or Register to comment.