Due to recent updates, all users are required to create an Altair One account to login to the RapidMiner community. Click the Register button to create your account using the same email that you have previously used to login to the RapidMiner community. This will ensure that any previously created content will be synced to your Altair One account. Once you login, you will be asked to provide a username that identifies you to other Community users. Email us at Community with questions.

[SOLVED] Transform Document-Term matrix to flat table?

RWingerterRWingerter Member Posts: 38 Contributor II
edited November 2018 in Help
A newbie question.

I have a simple process which uses „Data to Documents“, „Process Documents“ and „Tokenize“  to turn a list of strings into a wordlist.The second result is my ExampleSet turned into a Document-Term Matrix.

My question is: How can I transform the Document-Term matrix (Document_ID x Term) to a flat table with three attributes (Document_ID, Term, occurrences)?

Regards,

Roland

Answers

  • SkirzynskiSkirzynski Member Posts: 164 Maven
    Hey Roland,

    It is more likely to get an answer by posting a (self-)process with a small chunk of your data. Currently I am not sure what you have and what you want.

    Best
      Marcin
  • RWingerterRWingerter Member Posts: 38 Contributor II
    Hi Marcin,

    thanks for your reply. Here is my example data and my simple process.

    The input is a list of user queries (query_id, query, frequency), which is processed with "Process Documents from Data". The result is a word list and a document-term matrix. In addition, I would like to get a term-document table with Term, Query_ID, and TF*IDF, e.g.

    Term Query_ID TF*IDF
    ---------------------------------
    Term1    1        0.34
    Term1    2        0.23
    Term2    3        1.00

    I tried various things without success. Maybe it's not difficult to do, but I didn't manage.

    Sample data:

    Query_ID;Query;frequency
    1;hautarzt;103921
    2;zahnarzt;101684
    3;augenarzt;89233
    4;frauenarzt;75116
    5;arzt;70755
    6;ärzte;65176
    7;zahnärzte;57836
    8;allgemeinarzt;54111
    9;tierarzt;52387
    10;augenärzte;49855
    11;hautärzte;33141
    12;kinderarzt;32989
    13;kinderärzte;26377
    14;hno arzt;22984
    15;tierärzte;22090
    16;frauenärzte;20694
    17;lungenfacharzt;16468
    18;praktische ärzte;14175
    19;hno-ärzte;13290
    20;hausarzt;12595
    21;hautarztpraxen;12262
    22;allgemeinärzte;11906
    23;ärzte allgemeinmedizin und praktische ärzte;11781
    24;ärzte orthopädie;10833
    25;hals nasen ohrenärzte;5457
    26;hno ärzte;4607
    27;hals nasen ohren arzt;4319
    28;ärzte innere medizin;4053
    29;ärzte urologie;3886
    30;ärzte frauenheilkunde und geburtshilfe;3837
    Code:

    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.3.008">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="5.3.008" expanded="true" name="Process">
        <process expanded="true">
          <operator activated="true" class="read_csv" compatibility="5.3.008" expanded="true" height="60" name="Read CSV" width="90" x="45" y="75">
            <parameter key="csv_file" value="C:\Users\retegniw\Documents\Office\Excel\Sample queries.csv"/>
            <parameter key="first_row_as_names" value="false"/>
            <list key="annotations">
              <parameter key="0" value="Name"/>
            </list>
            <parameter key="encoding" value="windows-1252"/>
            <list key="data_set_meta_data_information">
              <parameter key="0" value="Query_ID.true.integer.id"/>
              <parameter key="1" value="Query.true.text.attribute"/>
              <parameter key="2" value="frequency.true.integer.attribute"/>
            </list>
          </operator>
          <operator activated="true" class="set_role" compatibility="5.3.008" expanded="true" height="76" name="Set Role" width="90" x="179" y="75">
            <parameter key="attribute_name" value="Query_ID"/>
            <parameter key="target_role" value="id"/>
            <list key="set_additional_roles"/>
          </operator>
          <operator activated="true" class="nominal_to_text" compatibility="5.3.008" expanded="true" height="76" name="Nominal to Text" width="90" x="313" y="75"/>
          <operator activated="true" class="text:process_document_from_data" compatibility="5.3.000" expanded="true" height="76" name="Process Documents from Data" width="90" x="447" y="75">
            <parameter key="keep_text" value="true"/>
            <list key="specify_weights"/>
            <process expanded="true">
              <operator activated="true" class="text:tokenize" compatibility="5.3.000" expanded="true" height="60" name="Tokenize" width="90" x="45" y="30"/>
              <operator activated="true" class="text:filter_stopwords_german" compatibility="5.3.000" expanded="true" height="60" name="Filter Stopwords (German)" width="90" x="179" y="30"/>
              <connect from_port="document" to_op="Tokenize" to_port="document"/>
              <connect from_op="Tokenize" from_port="document" to_op="Filter Stopwords (German)" to_port="document"/>
              <connect from_op="Filter Stopwords (German)" from_port="document" to_port="document 1"/>
              <portSpacing port="source_document" spacing="0"/>
              <portSpacing port="sink_document 1" spacing="0"/>
              <portSpacing port="sink_document 2" spacing="0"/>
            </process>
          </operator>
          <connect from_op="Read CSV" from_port="output" to_op="Set Role" to_port="example set input"/>
          <connect from_op="Set Role" from_port="example set output" to_op="Nominal to Text" to_port="example set input"/>
          <connect from_op="Nominal to Text" from_port="example set output" to_op="Process Documents from Data" to_port="example set"/>
          <connect from_op="Process Documents from Data" from_port="example set" to_port="result 1"/>
          <connect from_op="Process Documents from Data" from_port="word list" to_port="result 2"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
          <portSpacing port="sink_result 3" spacing="0"/>
        </process>
      </operator>
    </process>
    Any and all help welcome.

    Thank you

    Roland
  • SkirzynskiSkirzynski Member Posts: 164 Maven
    The operator you are looking for is the "De-Pivot" which is indeed not easy to use (in my opinion). Unfortunately it cannot handle special attributes, thus, you have to explicitly exclude Query_ID in the "attribute_name" with a negative lookahead, which is not optimal, but it works.

    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.3.009">
     <context>
       <input/>
       <output/>
       <macros/>
     </context>
     <operator activated="true" class="process" compatibility="5.3.009" expanded="true" name="Process">
       <process expanded="true">
         <operator activated="true" class="read_csv" compatibility="5.3.009" expanded="true" height="60" name="Read CSV" width="90" x="45" y="75">
           <parameter key="csv_file" value="/home/marcin/temp/forum-6582.csv"/>
           <parameter key="first_row_as_names" value="false"/>
           <list key="annotations">
             <parameter key="0" value="Name"/>
           </list>
           <parameter key="encoding" value="windows-1252"/>
           <list key="data_set_meta_data_information">
             <parameter key="0" value="Query_ID.true.integer.id"/>
             <parameter key="1" value="Query.true.text.attribute"/>
             <parameter key="2" value="frequency.true.integer.attribute"/>
           </list>
         </operator>
         <operator activated="true" class="set_role" compatibility="5.3.009" expanded="true" height="76" name="Set Role" width="90" x="179" y="75">
           <parameter key="attribute_name" value="Query_ID"/>
           <parameter key="target_role" value="id"/>
           <list key="set_additional_roles"/>
         </operator>
         <operator activated="true" class="nominal_to_text" compatibility="5.3.009" expanded="true" height="76" name="Nominal to Text" width="90" x="313" y="75"/>
         <operator activated="true" class="text:process_document_from_data" compatibility="5.3.001" expanded="true" height="76" name="Process Documents from Data" width="90" x="447" y="75">
           <parameter key="keep_text" value="true"/>
           <list key="specify_weights"/>
           <process expanded="true">
             <operator activated="true" class="text:tokenize" compatibility="5.3.001" expanded="true" height="60" name="Tokenize" width="90" x="45" y="30"/>
             <operator activated="true" class="text:filter_stopwords_german" compatibility="5.3.001" expanded="true" height="60" name="Filter Stopwords (German)" width="90" x="179" y="30"/>
             <connect from_port="document" to_op="Tokenize" to_port="document"/>
             <connect from_op="Tokenize" from_port="document" to_op="Filter Stopwords (German)" to_port="document"/>
             <connect from_op="Filter Stopwords (German)" from_port="document" to_port="document 1"/>
             <portSpacing port="source_document" spacing="0"/>
             <portSpacing port="sink_document 1" spacing="0"/>
             <portSpacing port="sink_document 2" spacing="0"/>
           </process>
         </operator>
         <operator activated="true" class="select_attributes" compatibility="5.3.009" expanded="true" height="76" name="Select Attributes" width="90" x="246" y="210">
           <parameter key="attribute_filter_type" value="subset"/>
           <parameter key="attributes" value="text|frequency"/>
           <parameter key="invert_selection" value="true"/>
           <parameter key="include_special_attributes" value="true"/>
         </operator>
         <operator activated="true" class="de_pivot" compatibility="5.3.009" expanded="true" height="76" name="De-Pivot" width="90" x="380" y="210">
           <list key="attribute_name">
             <parameter key="TF-IDF" value="^(?!Query_ID).*"/>
           </list>
           <parameter key="index_attribute" value="Term"/>
           <parameter key="create_nominal_index" value="true"/>
         </operator>
         <connect from_op="Read CSV" from_port="output" to_op="Set Role" to_port="example set input"/>
         <connect from_op="Set Role" from_port="example set output" to_op="Nominal to Text" to_port="example set input"/>
         <connect from_op="Nominal to Text" from_port="example set output" to_op="Process Documents from Data" to_port="example set"/>
         <connect from_op="Process Documents from Data" from_port="example set" to_op="Select Attributes" to_port="example set input"/>
         <connect from_op="Select Attributes" from_port="example set output" to_op="De-Pivot" to_port="example set input"/>
         <connect from_op="De-Pivot" from_port="example set output" to_port="result 1"/>
         <portSpacing port="source_input 1" spacing="0"/>
         <portSpacing port="sink_result 1" spacing="0"/>
         <portSpacing port="sink_result 2" spacing="0"/>
       </process>
     </operator>
    </process>
  • RWingerterRWingerter Member Posts: 38 Contributor II
    Hi Marcin,

    thank you very much, it works like a charm.
    Marcin wrote:

    The operator you are looking for is the "De-Pivot" which is indeed not easy to use (in my opinion).
    I had looked at the "De-Pivot" operator, but I had no idea how to adress the attribute names. I am not saying I understand your code (that will certainly take a while), but for now I am just happy to have a solution. Thanks again.

    Kind regards

    Roland
Sign In or Register to comment.