Options

"How get term-document matrix for SVD?"

B_B_ Member Posts: 70 Maven
edited June 2019 in Help
When I use the ProcessDocument module it creates a document-term matrix. I need to create a term-document matrix to feed to SVD.  I.e.,
              doc1      doc2    doc3
term1
term2
term2

I placed WordListtoDocument after it and fed the output to SVD, but the output from WordListtoDocument doesn't have the correct format - SVD generates an error. 

This is representative of how I set up the ProcessDoc(word) ->(word) WordListtoDoc (example set) -> (example set)SVD  modules.
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.0">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="5.0.0" expanded="true" name="Process">
    <process expanded="true" height="521" width="955">
      <operator activated="true" class="read_excel" compatibility="5.0.0" expanded="true" height="60" name="Read Excel" width="90" x="45" y="30">
        <parameter key="excel_file" value="R:\Data\restrvw_1.xls"/>
        <list key="annotations"/>
      </operator>
      <operator activated="true" class="replace" compatibility="5.0.0" expanded="true" height="76" name="Replace" width="90" x="45" y="120">
        <parameter key="attribute_filter_type" value="subset"/>
        <parameter key="attributes" value="Summary|Good|Bad"/>
        <parameter key="replace_what" value="car"/>
      </operator>
      <operator activated="true" class="select_attributes" compatibility="5.0.0" expanded="true" height="76" name="Select Attributes" width="90" x="45" y="255">
        <parameter key="attribute_filter_type" value="subset"/>
        <parameter key="attributes" value="Summary|rowid|Rest_Type"/>
      </operator>
      <operator activated="true" class="set_role" compatibility="5.0.0" expanded="true" height="76" name="Rest_type to label" width="90" x="179" y="165">
        <parameter key="name" value="Car_Type"/>
        <parameter key="target_role" value="label"/>
      </operator>
      <operator activated="true" class="text:process_document_from_data" compatibility="5.0.0" expanded="true" height="76" name="Process Documents from Data" width="90" x="246" y="30">
        <parameter key="create_word_vector" value="false"/>
        <parameter key="prune_method" value="absolute"/>
        <parameter key="prune_below_absolute" value="2"/>
        <parameter key="prune_above_absolute" value="100"/>
        <list key="specify_weights"/>
        <process expanded="true" height="526" width="806">
          <operator activated="true" class="text:transform_cases" compatibility="5.0.0" expanded="true" height="60" name="Transform Cases" width="90" x="45" y="75"/>
          <operator activated="true" class="text:tokenize" compatibility="5.0.0" expanded="true" height="60" name="Tokenize" width="90" x="246" y="75"/>
          <operator activated="true" class="text:filter_stopwords_english" compatibility="5.0.0" expanded="true" height="60" name="Filter Stopwords (English)" width="90" x="447" y="30"/>
          <connect from_port="document" to_op="Transform Cases" to_port="document"/>
          <connect from_op="Transform Cases" from_port="document" to_op="Tokenize" to_port="document"/>
          <connect from_op="Tokenize" from_port="document" to_op="Filter Stopwords (English)" to_port="document"/>
          <connect from_op="Filter Stopwords (English)" from_port="document" to_port="document 1"/>
          <portSpacing port="source_document" spacing="0"/>
          <portSpacing port="sink_document 1" spacing="0"/>
          <portSpacing port="sink_document 2" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="text:wordlist_to_data" compatibility="5.0.0" expanded="true" height="76" name="WordList to Data" width="90" x="447" y="30"/>
      <operator activated="true" class="singular_value_decomposition" compatibility="5.0.8" expanded="true" height="94" name="SVD" width="90" x="604" y="102"/>
      <connect from_op="Read Excel" from_port="output" to_op="Replace" to_port="example set input"/>
      <connect from_op="Replace" from_port="example set output" to_op="Select Attributes" to_port="example set input"/>
      <connect from_op="Select Attributes" from_port="example set output" to_op="Rest_type to label" to_port="example set input"/>
      <connect from_op="Rest_type to label" from_port="example set output" to_op="Process Documents from Data" to_port="word list"/>
      <connect from_op="Process Documents from Data" from_port="word list" to_op="WordList to Data" to_port="word list"/>
      <connect from_op="WordList to Data" from_port="example set" to_op="SVD" to_port="example set input"/>
      <connect from_op="SVD" from_port="example set output" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>
I also used the Transpose module after ProcessDocument but SVD did not accept the example set from Transpose.

ProcessDoc(example set) ->(example set) Transpose(example set) -> (example set)SVD

How should I set this up?

TIA

Answers

  • Options
    landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn
    Hi,
    why exactly did it not accept it? This should be the way to go.

    Greetings,
      Sebastian
  • Options
    B_B_ Member Posts: 70 Maven
    Sebastian

    This example shows want I want to do.

    I have some text with labels.  I pass the text into Process Documents to get a doc/term matrix.  I want to swing or pivot the results so I can feed the matrix into SVD with terms as rows and labels as columns.

    WordList to Data does not provide an example set that can be used in later processes. 

    The word list output from Process Documents from Data is perfect but it is not an example set format.  It has the words as rows, and Attribute Name, Total Occurrences and labels as column names.  If I can filter out Attribute Name and Total Occurrences and leave labels as column names then feed this into SVD this will work.  (Note:  If I don't use labels, I can Transpose the example set from Process Documents from Data to get a term/doc matrix.  But I need to use labels instead of document IDs.)

    I've tried several combinations of Transpose, Pivot, Aggregate and Word List to Data and Filter/Set Role Attributes after Process Documents from Data.    I get errors or incomplete results.

    How do I set up the modules to correctly feed the term/label matrix into SVD?


    (Feature enhancement idea: enable SVD to output either row/column or column/row)

    Here is a data set (with Level just randomly assigned)

    Excel sheet

    Book Level Title
    B1 Undergrad A Course on Integral Equations
    B2 Undergrad Attractors for Semigroups and Evolution Equations
    B3 Grad Automatic Differentiation of Algorithms: Theory, Implementation, and Application
    B4 Undergrad Geometrical Aspects of Partial Differential Equations
    B5 Undergrad Ideals, Varieties, and Algorithms { An Introduction to Computational Algebraic Geometry and Commutative Algebra
    B6 Grad Introduction to Hamiltonian Dynamical Systems and the N-Body Problem
    B7 Grad Knapsack Problems: Algorithms and Computer Implementations
    B8 Grad Methods of Solving Singular Systems of Ordinary Differential Equations
    B9 Grad Nonlinear Systems
    B10 Undergrad Ordinary Differential Equations
    B11 Undergrad Oscillation Theory for Neutral Differential Equations with Delay
    B12 Grad Oscillation Theory of Delay Differential Equations
    B13 Grad Pseudodifferential Operators and Nonlinear Partial Differential Equations
    B14 Undergrad Sinc Methods for Quadrature and Differential Equations
    B15 Grad Stability of Stochastic Differential Equations with Respect to Semi-Martingales
    B16 Undergrad The Boundary Integral Approach to Static and Dynamic Contact Problems
    B17 Undergrad The Double Mellin-Barnes Type Integrals and Their Applications to Convolution Theory


    I want to feed this into SVD with labels as the columns (I assume I can feed this example set into the TFIDF module to get term weights before SVD.)

    Term Undergrad Grad
    algorithms 1 2
    delay 1 1
    differential 4 4
    equations 6 4
    integral 2 0
    introduction 1 1
    methods 1 1
    nonlinear 0 2
    ordinary 1 1
    oscillation 1 1
    partial 1 1
    problems 1 1
    systems 0 3
    theory 2 2



    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.0">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" expanded="true" name="">
        <process expanded="true" height="636" width="975">
          <operator activated="true" class="read_excel" expanded="true" height="60" name="Read Excel" width="90" x="45" y="30">
            <parameter key="excel_file" value="D:\Data\Docs\book_title2.xls"/>
            <list key="annotations"/>
          </operator>
          <operator activated="false" class="nominal_to_numerical" expanded="true" height="94" name="Nominal to Numerical" width="90" x="849" y="525">
            <parameter key="attribute_filter_type" value="single"/>
            <parameter key="attribute" value="Overall_Ra"/>
          </operator>
          <operator activated="true" class="set_role" expanded="true" height="76" name="Set Role (2)" width="90" x="45" y="120">
            <parameter key="name" value="Book"/>
            <parameter key="target_role" value="id"/>
          </operator>
          <operator activated="false" class="join" expanded="true" height="76" name="Join" width="90" x="45" y="525"/>
          <operator activated="false" class="rename_by_replacing" expanded="true" height="76" name="Rename by Replacing" width="90" x="179" y="525">
            <parameter key="attribute_filter_type" value="subset"/>
          </operator>
          <operator activated="false" class="set_role" expanded="true" height="76" name="Set Role (3)" width="90" x="447" y="525">
            <parameter key="name" value="Car_Type"/>
            <parameter key="target_role" value="label"/>
          </operator>
          <operator activated="true" class="set_role" expanded="true" height="76" name="Set Role (4)" width="90" x="45" y="210">
            <parameter key="name" value="Level"/>
            <parameter key="target_role" value="label"/>
          </operator>
          <operator activated="true" class="nominal_to_text" expanded="true" height="76" name="Nominal to Text" width="90" x="45" y="390">
            <parameter key="attribute_filter_type" value="single"/>
            <parameter key="attribute" value="Title"/>
          </operator>
          <operator activated="true" class="text:process_document_from_data" expanded="true" height="76" name="Process Documents from Data" width="90" x="246" y="30">
            <parameter key="vector_creation" value="Term Occurrences"/>
            <parameter key="prune_method" value="absolute"/>
            <parameter key="prune_below_absolute" value="2"/>
            <parameter key="prune_above_absolute" value="100"/>
            <list key="specify_weights"/>
            <process expanded="true" height="526" width="806">
              <operator activated="true" class="text:transform_cases" expanded="true" height="60" name="Transform Cases" width="90" x="45" y="30"/>
              <operator activated="true" class="text:tokenize" expanded="true" height="60" name="Tokenize" width="90" x="246" y="30"/>
              <operator activated="true" class="text:filter_stopwords_english" expanded="true" height="60" name="Filter Stopwords (English)" width="90" x="380" y="30"/>
              <connect from_port="document" to_op="Transform Cases" to_port="document"/>
              <connect from_op="Transform Cases" from_port="document" to_op="Tokenize" to_port="document"/>
              <connect from_op="Tokenize" from_port="document" to_op="Filter Stopwords (English)" to_port="document"/>
              <connect from_op="Filter Stopwords (English)" from_port="document" to_port="document 1"/>
              <portSpacing port="source_document" spacing="0"/>
              <portSpacing port="sink_document 1" spacing="0"/>
              <portSpacing port="sink_document 2" spacing="0"/>
            </process>
          </operator>
          <operator activated="true" class="text:wordlist_to_data" expanded="true" height="76" name="WordList to Data" width="90" x="380" y="165"/>
          <operator activated="true" class="singular_value_decomposition" expanded="true" height="94" name="SVD" width="90" x="581" y="165"/>
          <connect from_op="Read Excel" from_port="output" to_op="Set Role (2)" to_port="example set input"/>
          <connect from_op="Set Role (2)" from_port="example set output" to_op="Set Role (4)" to_port="example set input"/>
          <connect from_op="Set Role (4)" from_port="example set output" to_op="Nominal to Text" to_port="example set input"/>
          <connect from_op="Nominal to Text" from_port="example set output" to_op="Process Documents from Data" to_port="example set"/>
          <connect from_op="Process Documents from Data" from_port="example set" to_port="result 1"/>
          <connect from_op="Process Documents from Data" from_port="word list" to_op="WordList to Data" to_port="word list"/>
          <connect from_op="WordList to Data" from_port="example set" to_op="SVD" to_port="example set input"/>
          <connect from_op="SVD" from_port="example set output" to_port="result 2"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
          <portSpacing port="sink_result 3" spacing="0"/>
        </process>
      </operator>
    </process>



  • Options
    haddockhaddock Member Posts: 849 Maven
    Hi there,

    I suspect the problem lies in the data pre-processing because if I lay out your data as follows into Words.csv and run the XML the operators seem to work fine..
    Book|   Level|      Title
    B1|  Undergrad|  A Course on Integral Equations
    B2|  Undergrad|  Attractors for Semigroups and Evolution Equations
    B3|  Grad|      Automatic Differentiation of Algorithms: Theory, Implementation, and Application
    B4|  Undergrad|  Geometrical Aspects of Partial Differential Equations
    B5|  Undergrad|  Ideals, Varieties, and Algorithms { An Introduction to Computational Algebraic Geometry and Commutative Algebra
    B6|  Grad|      Introduction to Hamiltonian Dynamical Systems and the N-Body Problem
    B7|  Grad|      Knapsack Problems: Algorithms and Computer Implementations
    B8|  Grad|    Methods of Solving Singular Systems of Ordinary Differential Equations
    B9|  Grad|      Nonlinear Systems
    B10|  Undergrad|  Ordinary Differential Equations
    B11|  Undergrad|  Oscillation Theory for Neutral Differential Equations with Delay
    B12|  Grad|      Oscillation Theory of Delay Differential Equations
    B13|  Grad|      Pseudodifferential Operators and Nonlinear Partial Differential Equations
    B14|  Undergrad|  Sinc Methods for Quadrature and Differential Equations
    B15|  Grad|      Stability of Stochastic Differential Equations with Respect to Semi-Martingales
    B16|  Undergrad|  The Boundary Integral Approach to Static and Dynamic Contact Problems
    B17|  Undergrad|  The Double Mellin-Barnes Type Integrals and Their Applications to Convolution Theory

    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.0">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="5.0.0" expanded="true" name="Process">
        <process expanded="true" height="386" width="796">
          <operator activated="true" class="read_csv" compatibility="5.0.8" expanded="true" height="60" name="Read CSV" width="90" x="45" y="30">
            <parameter key="file_name" value="C:\Documents and Settings\Alien\My Documents\RM5\samples\data\Words.csv"/>
            <parameter key="column_separators" value="|"/>
            <parameter key="parse_numbers" value="false"/>
          </operator>
          <operator activated="true" class="set_role" compatibility="5.0.8" expanded="true" height="76" name="Set Role" width="90" x="45" y="120">
            <parameter key="name" value="Book"/>
            <parameter key="target_role" value="id"/>
          </operator>
          <operator activated="true" class="set_role" compatibility="5.0.8" expanded="true" height="76" name="Set Role (2)" width="90" x="45" y="255">
            <parameter key="name" value="Level"/>
            <parameter key="target_role" value="label"/>
          </operator>
          <operator activated="true" breakpoints="before" class="nominal_to_text" compatibility="5.0.8" expanded="true" height="76" name="Nominal to Text" width="90" x="179" y="50">
            <parameter key="attribute_filter_type" value="single"/>
            <parameter key="attribute" value="Title"/>
          </operator>
          <operator activated="true" class="text:process_document_from_data" compatibility="5.0.0" expanded="true" height="76" name="Process Documents from Data" width="90" x="313" y="30">
            <parameter key="vector_creation" value="Binary Term Occurrences"/>
            <list key="specify_weights"/>
            <process expanded="true" height="356" width="814">
              <operator activated="true" class="text:tokenize" compatibility="5.0.0" expanded="true" height="60" name="Tokenize" width="90" x="246" y="30"/>
              <operator activated="true" class="text:filter_stopwords_english" compatibility="5.0.0" expanded="true" height="60" name="Filter Stopwords (English)" width="90" x="447" y="30"/>
              <connect from_port="document" to_op="Tokenize" to_port="document"/>
              <connect from_op="Tokenize" from_port="document" to_op="Filter Stopwords (English)" to_port="document"/>
              <connect from_op="Filter Stopwords (English)" from_port="document" to_port="document 1"/>
              <portSpacing port="source_document" spacing="0"/>
              <portSpacing port="sink_document 1" spacing="0"/>
              <portSpacing port="sink_document 2" spacing="0"/>
            </process>
          </operator>
          <operator activated="true" class="transpose" compatibility="5.0.8" expanded="true" height="76" name="Transpose" width="90" x="514" y="30"/>
          <operator activated="true" class="text:wordlist_to_data" compatibility="5.0.5" expanded="true" height="76" name="WordList to Data" width="90" x="514" y="120"/>
          <connect from_op="Read CSV" from_port="output" to_op="Set Role" to_port="example set input"/>
          <connect from_op="Set Role" from_port="example set output" to_op="Set Role (2)" to_port="example set input"/>
          <connect from_op="Set Role (2)" from_port="example set output" to_op="Nominal to Text" to_port="example set input"/>
          <connect from_op="Nominal to Text" from_port="example set output" to_op="Process Documents from Data" to_port="example set"/>
          <connect from_op="Process Documents from Data" from_port="example set" to_op="Transpose" to_port="example set input"/>
          <connect from_op="Process Documents from Data" from_port="word list" to_op="WordList to Data" to_port="word list"/>
          <connect from_op="Transpose" from_port="example set output" to_port="result 1"/>
          <connect from_op="WordList to Data" from_port="word list" to_port="result 2"/>
          <connect from_op="WordList to Data" from_port="example set" to_port="result 3"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
          <portSpacing port="sink_result 3" spacing="0"/>
          <portSpacing port="sink_result 4" spacing="0"/>
        </process>
      </operator>
    </process>
    In my experience pesky separators like commas in text can mess stuff up quite quickly.

  • Options
    landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn
    Hi,
    I'm not quite convinced that the word list does you any good. Even if the svd would run on it, what would be the result? Drawn just from the information of how often a word occured?
    I think you will have to transpose the word vector from the process documents operator. I would suggest the following to go further in this matter:
    Replace the data loading with some Create Document oPerators, replace the Process Documents from Data to Process Documents. Then you can post this process and I'm able to execute it without problems. This way I might get an impression how to help you.

    Greetings,
      Sebastian
  • Options
    B_B_ Member Posts: 70 Maven
    Sebastian,

    Here is setup 1 - with the text in a container.


    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.0">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" expanded="true" name="">
        <process expanded="true" height="636" width="975">
          <operator activated="false" class="nominal_to_numerical" expanded="true" height="94" name="Nominal to Numerical" width="90" x="849" y="525">
            <parameter key="attribute_filter_type" value="single"/>
            <parameter key="attribute" value="Overall_Ra"/>
          </operator>
          <operator activated="true" class="text:create_document" expanded="true" height="60" name="Create Document" width="90" x="45" y="75">
            <parameter key="text" value="Book&#9;Level&#9;Title&#10;B1&#9;Undergrad&#9;A Course on Integral Equations&#10;B2&#9;Undergrad&#9;Attractors for Semigroups and Evolution Equations&#10;B3&#9;Grad&#9;Automatic Differentiation of Algorithms: Theory, Implementation, and Application&#10;B4&#9;Undergrad&#9;Geometrical Aspects of Partial Differential Equations&#10;B5&#9;Undergrad&#9;Ideals, Varieties, and Algorithms  An Introduction to Computational Algebraic Geometry and Commutative Algebra&#10;B6&#9;Grad&#9;Introduction to Hamiltonian Dynamical Systems and the N-Body Problem&#10;B7&#9;Grad&#9;Knapsack Problems: Algorithms and Computer Implementations&#10;B8&#9;Grad&#9;Methods of Solving Singular Systems of Ordinary Differential Equations&#10;B9&#9;Grad&#9;Nonlinear Systems&#10;B10&#9;Undergrad&#9;Ordinary Differential Equations&#10;B11&#9;Undergrad&#9;Oscillation Theory for Neutral Differential Equations with Delay&#10;B12&#9;Grad&#9;Oscillation Theory of Delay Differential Equations&#10;B13&#9;Grad&#9;Pseudodifferential Operators and Nonlinear Partial Differential Equations&#10;B14&#9;Undergrad&#9;Sinc Methods for Quadrature and Differential Equations&#10;B15&#9;Grad&#9;Stability of Stochastic Differential Equations with Respect to Semi-Martingales&#10;B16&#9;Undergrad&#9;The Boundary Integral Approach to Static and Dynamic Contact Problems&#10;B17&#9;Undergrad&#9;The Double Mellin-Barnes Type Integrals and Their Applications to Convolution Theory&#10;"/>
          </operator>
          <operator activated="true" class="text:documents_to_data" expanded="true" height="76" name="Documents to Data" width="90" x="112" y="165">
            <parameter key="text_attribute" value="Title"/>
            <parameter key="label_attribute" value="Level"/>
          </operator>
          <operator activated="true" class="set_role" expanded="true" height="76" name="Set Role (2)" width="90" x="112" y="300">
            <parameter key="name" value="Book"/>
            <parameter key="target_role" value="id"/>
          </operator>
          <operator activated="true" class="nominal_to_text" expanded="true" height="76" name="Nominal to Text" width="90" x="112" y="435">
            <parameter key="attribute_filter_type" value="single"/>
            <parameter key="attribute" value="Title"/>
          </operator>
          <operator activated="true" class="text:process_document_from_data" expanded="true" height="76" name="Process Documents from Data" width="90" x="313" y="30">
            <parameter key="prune_method" value="absolute"/>
            <parameter key="prune_below_absolute" value="2"/>
            <parameter key="prune_above_absolute" value="100"/>
            <list key="specify_weights"/>
            <process expanded="true" height="526" width="806">
              <operator activated="true" class="text:transform_cases" expanded="true" height="60" name="Transform Cases" width="90" x="45" y="30"/>
              <operator activated="true" class="text:tokenize" expanded="true" height="60" name="Tokenize" width="90" x="246" y="30"/>
              <operator activated="true" class="text:filter_stopwords_english" expanded="true" height="60" name="Filter Stopwords (English)" width="90" x="380" y="30"/>
              <connect from_port="document" to_op="Transform Cases" to_port="document"/>
              <connect from_op="Transform Cases" from_port="document" to_op="Tokenize" to_port="document"/>
              <connect from_op="Tokenize" from_port="document" to_op="Filter Stopwords (English)" to_port="document"/>
              <connect from_op="Filter Stopwords (English)" from_port="document" to_port="document 1"/>
              <portSpacing port="source_document" spacing="0"/>
              <portSpacing port="sink_document 1" spacing="0"/>
              <portSpacing port="sink_document 2" spacing="0"/>
            </process>
          </operator>
          <operator activated="true" class="pivot" expanded="true" height="76" name="Pivot" width="90" x="380" y="165">
            <parameter key="group_attribute" value="Level"/>
            <parameter key="index_attribute" value="id"/>
          </operator>
          <operator activated="false" class="transpose" expanded="true" height="76" name="Transpose" width="90" x="581" y="390"/>
          <operator activated="true" class="singular_value_decomposition" expanded="true" height="94" name="SVD" width="90" x="380" y="300">
            <parameter key="dimensions" value="4"/>
          </operator>
          <operator activated="false" class="weka:W-EM" expanded="true" height="76" name="W-EM" width="90" x="648" y="525">
            <parameter key="N" value="4.0"/>
            <parameter key="V" value="true"/>
            <parameter key="add_as_label" value="true"/>
          </operator>
          <operator activated="true" class="dbscan" expanded="true" height="76" name="Clustering" width="90" x="648" y="210">
            <parameter key="epsilon" value="0.05"/>
            <parameter key="min_points" value="2"/>
            <parameter key="measure_types" value="NumericalMeasures"/>
          </operator>
          <operator activated="false" class="join" expanded="true" height="76" name="Join" width="90" x="45" y="525"/>
          <operator activated="false" class="rename_by_replacing" expanded="true" height="76" name="Rename by Replacing" width="90" x="179" y="525">
            <parameter key="attribute_filter_type" value="subset"/>
          </operator>
          <operator activated="false" class="set_role" expanded="true" height="76" name="Set Role (3)" width="90" x="447" y="525">
            <parameter key="name" value="Car_Type"/>
            <parameter key="target_role" value="label"/>
          </operator>
          <connect from_op="Create Document" from_port="output" to_op="Documents to Data" to_port="documents 1"/>
          <connect from_op="Documents to Data" from_port="example set" to_op="Set Role (2)" to_port="example set input"/>
          <connect from_op="Set Role (2)" from_port="example set output" to_op="Nominal to Text" to_port="example set input"/>
          <connect from_op="Nominal to Text" from_port="example set output" to_op="Process Documents from Data" to_port="example set"/>
          <connect from_op="Process Documents from Data" from_port="example set" to_op="Pivot" to_port="example set input"/>
          <connect from_op="Pivot" from_port="example set output" to_op="SVD" to_port="example set input"/>
          <connect from_op="SVD" from_port="example set output" to_op="Clustering" to_port="example set"/>
          <connect from_op="Clustering" from_port="cluster model" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>
    Create Document changes the document text to an example set, but the Documents to Data process does not create metadata for id, label and text data that can be passed to later operators.  If I put the text data into a spreadsheet the metadata is created and Set Role and Nominal to Text work correctly to create id and text fields.

    The output of the Process Documents from Data creates a document-term matrix with documents as rows and terms as columns.  I  want to pivot this and have terms as rows and categories as columns instead of documents as columns.

    This second process sets up the example set correctly but doesn't pivot correctly to get terms as rows with Level (ie category) as columns.  (Just paste the text info from Create Document into an xls sheet.)

    Thanks for your help.


    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.0">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" expanded="true" name="">
        <process expanded="true" height="636" width="975">
          <operator activated="false" class="nominal_to_numerical" expanded="true" height="94" name="Nominal to Numerical" width="90" x="849" y="525">
            <parameter key="attribute_filter_type" value="single"/>
            <parameter key="attribute" value="Overall_Ra"/>
          </operator>
          <operator activated="true" class="read_excel" expanded="true" height="60" name="Read Excel" width="90" x="45" y="75">
            <parameter key="excel_file" value="C:\Data\book_title2.xls"/>
            <list key="annotations"/>
          </operator>
          <operator activated="true" class="set_role" expanded="true" height="76" name="Set Role (2)" width="90" x="112" y="255">
            <parameter key="name" value="Book"/>
            <parameter key="target_role" value="id"/>
          </operator>
          <operator activated="true" class="nominal_to_text" expanded="true" height="76" name="Nominal to Text" width="90" x="112" y="390">
            <parameter key="attribute_filter_type" value="single"/>
            <parameter key="attribute" value="Title"/>
          </operator>
          <operator activated="true" class="text:process_document_from_data" expanded="true" height="76" name="Process Documents from Data" width="90" x="313" y="30">
            <parameter key="prune_method" value="absolute"/>
            <parameter key="prune_below_absolute" value="2"/>
            <parameter key="prune_above_absolute" value="100"/>
            <list key="specify_weights"/>
            <process expanded="true" height="526" width="806">
              <operator activated="true" class="text:transform_cases" expanded="true" height="60" name="Transform Cases" width="90" x="45" y="30"/>
              <operator activated="true" class="text:tokenize" expanded="true" height="60" name="Tokenize" width="90" x="246" y="30"/>
              <operator activated="true" class="text:filter_stopwords_english" expanded="true" height="60" name="Filter Stopwords (English)" width="90" x="380" y="30"/>
              <connect from_port="document" to_op="Transform Cases" to_port="document"/>
              <connect from_op="Transform Cases" from_port="document" to_op="Tokenize" to_port="document"/>
              <connect from_op="Tokenize" from_port="document" to_op="Filter Stopwords (English)" to_port="document"/>
              <connect from_op="Filter Stopwords (English)" from_port="document" to_port="document 1"/>
              <portSpacing port="source_document" spacing="0"/>
              <portSpacing port="sink_document 1" spacing="0"/>
              <portSpacing port="sink_document 2" spacing="0"/>
            </process>
          </operator>
          <operator activated="true" breakpoints="before" class="pivot" expanded="true" height="76" name="Pivot" width="90" x="380" y="165">
            <parameter key="group_attribute" value="Level"/>
            <parameter key="index_attribute" value="id"/>
          </operator>
          <operator activated="true" class="singular_value_decomposition" expanded="true" height="94" name="SVD" width="90" x="380" y="300">
            <parameter key="dimensions" value="4"/>
          </operator>
          <operator activated="true" class="dbscan" expanded="true" height="76" name="Clustering" width="90" x="648" y="210">
            <parameter key="epsilon" value="0.05"/>
            <parameter key="min_points" value="2"/>
            <parameter key="measure_types" value="NumericalMeasures"/>
          </operator>
          <connect from_op="Read Excel" from_port="output" to_op="Set Role (2)" to_port="example set input"/>
          <connect from_op="Set Role (2)" from_port="example set output" to_op="Nominal to Text" to_port="example set input"/>
          <connect from_op="Nominal to Text" from_port="example set output" to_op="Process Documents from Data" to_port="example set"/>
          <connect from_op="Process Documents from Data" from_port="example set" to_op="Pivot" to_port="example set input"/>
          <connect from_op="Pivot" from_port="example set output" to_op="SVD" to_port="example set input"/>
          <connect from_op="SVD" from_port="example set output" to_op="Clustering" to_port="example set"/>
          <connect from_op="Clustering" from_port="cluster model" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>



  • Options
    landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn
    Hi,
    if I exchange the pivoting with a transpose it seems to me, that I receive one word per row, each column expresses the weight in each document. Wasn't that what you was longing for?
    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.0">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="5.0.0" expanded="true" name="">
        <process expanded="true" height="636" width="975">
          <operator activated="false" class="nominal_to_numerical" compatibility="5.0.0" expanded="true" height="94" name="Nominal to Numerical" width="90" x="849" y="525">
            <parameter key="attribute_filter_type" value="single"/>
            <parameter key="attribute" value="Overall_Ra"/>
          </operator>
          <operator activated="true" class="read_excel" compatibility="5.0.0" expanded="true" height="60" name="Read Excel" width="90" x="45" y="75">
            <parameter key="excel_file" value="C:\Dokumente und Einstellungen\sland\Desktop\texte.xls"/>
            <list key="annotations"/>
            <list key="data_set_meta_data_information"/>
          </operator>
          <operator activated="true" class="set_role" compatibility="5.0.0" expanded="true" height="76" name="Set Role (2)" width="90" x="112" y="255">
            <parameter key="name" value="Book"/>
            <parameter key="target_role" value="id"/>
          </operator>
          <operator activated="true" class="nominal_to_text" compatibility="5.0.0" expanded="true" height="76" name="Nominal to Text" width="90" x="112" y="390">
            <parameter key="attribute_filter_type" value="single"/>
            <parameter key="attribute" value="Title"/>
          </operator>
          <operator activated="true" class="text:process_document_from_data" compatibility="5.0.0" expanded="true" height="76" name="Process Documents from Data" width="90" x="313" y="30">
            <parameter key="prune_below_absolute" value="2"/>
            <parameter key="prune_above_absolute" value="100"/>
            <list key="specify_weights"/>
            <process expanded="true" height="526" width="806">
              <operator activated="true" class="text:transform_cases" compatibility="5.0.0" expanded="true" height="60" name="Transform Cases" width="90" x="45" y="30"/>
              <operator activated="true" class="text:tokenize" compatibility="5.0.0" expanded="true" height="60" name="Tokenize" width="90" x="246" y="30"/>
              <operator activated="true" class="text:filter_stopwords_english" compatibility="5.0.0" expanded="true" height="60" name="Filter Stopwords (English)" width="90" x="380" y="30"/>
              <connect from_port="document" to_op="Transform Cases" to_port="document"/>
              <connect from_op="Transform Cases" from_port="document" to_op="Tokenize" to_port="document"/>
              <connect from_op="Tokenize" from_port="document" to_op="Filter Stopwords (English)" to_port="document"/>
              <connect from_op="Filter Stopwords (English)" from_port="document" to_port="document 1"/>
              <portSpacing port="source_document" spacing="0"/>
              <portSpacing port="sink_document 1" spacing="0"/>
              <portSpacing port="sink_document 2" spacing="0"/>
            </process>
          </operator>
          <operator activated="true" breakpoints="after" class="transpose" compatibility="5.0.8" expanded="true" height="76" name="Transpose" width="90" x="380" y="165"/>
          <operator activated="true" class="singular_value_decomposition" compatibility="5.0.0" expanded="true" height="94" name="SVD" width="90" x="380" y="300">
            <parameter key="dimensions" value="4"/>
          </operator>
          <operator activated="true" class="dbscan" compatibility="5.0.0" expanded="true" height="76" name="Clustering" width="90" x="648" y="210">
            <parameter key="epsilon" value="0.05"/>
            <parameter key="min_points" value="2"/>
            <parameter key="measure_types" value="NumericalMeasures"/>
          </operator>
          <connect from_op="Read Excel" from_port="output" to_op="Set Role (2)" to_port="example set input"/>
          <connect from_op="Set Role (2)" from_port="example set output" to_op="Nominal to Text" to_port="example set input"/>
          <connect from_op="Nominal to Text" from_port="example set output" to_op="Process Documents from Data" to_port="example set"/>
          <connect from_op="Process Documents from Data" from_port="example set" to_op="Transpose" to_port="example set input"/>
          <connect from_op="Transpose" from_port="example set output" to_op="SVD" to_port="example set input"/>
          <connect from_op="SVD" from_port="example set output" to_op="Clustering" to_port="example set"/>
          <connect from_op="Clustering" from_port="cluster model" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>
    Greetings,
    Sebastian
  • Options
    B_B_ Member Posts: 70 Maven
    Sebastian

    I am trying to get the term/category  matrix, not the term/document matrix - Sorry I wasn't clearer.  I want to condense the term/document matrix to a term/category matrix and send this to SVD.

    I want to get this format to send to SVD

    Term  Undergrad  Grad
    algorithms  1            2
    delay            1           1
    differential  4         4
    equations  6           4
    integral      2            0
    introduction  1            1
    methods      1            1
    nonlinear  0              2
    ordinary  1            1
    oscillation  1            1
    partial              1                1
    problems  1            1
    systems    0              3
    theory      2          2



    This produces the right output (term / category)  but the example set isn't read by the Filter Attribute operator.

    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.0">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="5.0.0" expanded="true" name="Process">
        <process expanded="true" height="521" width="955">
          <operator activated="true" class="read_excel" compatibility="5.0.0" expanded="true" height="60" name="Read Excel" width="90" x="45" y="30">
            <parameter key="excel_file" value="C:\Data\booktitle.xls"/>
            <list key="annotations"/>
          </operator>
          <operator activated="true" class="set_role" compatibility="5.0.8" expanded="true" height="76" name="Set Role" width="90" x="45" y="120">
            <parameter key="name" value="Book"/>
            <parameter key="target_role" value="id"/>
          </operator>
          <operator activated="true" class="set_role" compatibility="5.0.8" expanded="true" height="76" name="Set Role (2)" width="90" x="45" y="255">
            <parameter key="name" value="Level"/>
            <parameter key="target_role" value="label"/>
          </operator>
          <operator activated="true" class="nominal_to_text" compatibility="5.0.8" expanded="true" height="76" name="Nominal to Text" width="90" x="179" y="255">
            <parameter key="attribute_filter_type" value="single"/>
            <parameter key="attribute" value="Title"/>
          </operator>
          <operator activated="true" class="text:process_document_from_data" compatibility="5.0.0" expanded="true" height="76" name="Process Documents from Data" width="90" x="246" y="30">
            <parameter key="create_word_vector" value="false"/>
            <parameter key="prune_method" value="absolute"/>
            <parameter key="prune_below_absolute" value="2"/>
            <parameter key="prune_above_absolute" value="100"/>
            <list key="specify_weights"/>
            <process expanded="true" height="526" width="806">
              <operator activated="true" class="text:transform_cases" compatibility="5.0.0" expanded="true" height="60" name="Transform Cases" width="90" x="45" y="75"/>
              <operator activated="true" class="text:tokenize" compatibility="5.0.0" expanded="true" height="60" name="Tokenize" width="90" x="246" y="75"/>
              <operator activated="true" class="text:filter_stopwords_english" compatibility="5.0.0" expanded="true" height="60" name="Filter Stopwords (English)" width="90" x="447" y="30"/>
              <connect from_port="document" to_op="Transform Cases" to_port="document"/>
              <connect from_op="Transform Cases" from_port="document" to_op="Tokenize" to_port="document"/>
              <connect from_op="Tokenize" from_port="document" to_op="Filter Stopwords (English)" to_port="document"/>
              <connect from_op="Filter Stopwords (English)" from_port="document" to_port="document 1"/>
              <portSpacing port="source_document" spacing="0"/>
              <portSpacing port="sink_document 1" spacing="0"/>
              <portSpacing port="sink_document 2" spacing="0"/>
            </process>
          </operator>
          <operator activated="true" breakpoints="after" class="text:wordlist_to_data" compatibility="5.0.0" expanded="true" height="76" name="WordList to Data" width="90" x="447" y="30"/>
          <operator activated="true" class="singular_value_decomposition" compatibility="5.0.8" expanded="true" height="94" name="SVD" width="90" x="604" y="102"/>
          <connect from_op="Read Excel" from_port="output" to_op="Set Role" to_port="example set input"/>
          <connect from_op="Set Role" from_port="example set output" to_op="Set Role (2)" to_port="example set input"/>
          <connect from_op="Set Role (2)" from_port="example set output" to_op="Nominal to Text" to_port="example set input"/>
          <connect from_op="Nominal to Text" from_port="example set output" to_op="Process Documents from Data" to_port="example set"/>
          <connect from_op="Process Documents from Data" from_port="word list" to_op="WordList to Data" to_port="word list"/>
          <connect from_op="WordList to Data" from_port="example set" to_op="SVD" to_port="example set input"/>
          <connect from_op="SVD" from_port="example set output" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>


    Here is the output of Word List to Data.  The output format is correct, but the Filter Attribute operator doesn't see the information, so I can't remove In Document and In labled, and leave only in class (undergrad) and in class(graduate).


    Row Nbr Word In document in labeled in class (undergrad) in class (Graduate)
    1 algorithms 3 3 1 2
    2 delay 2 2 1 1
    3 differential 8 8 4 4
    4 equations 10 10 6 4
    5 integral 2 2 2 0
    6 introduction 2 2 1 1
    7 methods 2 2 1 1
    8 nonlinear 2 2 0 2
    9 ordinary 2 2 1 1
    10 oscillation 2 2 1 1
    11 partial 2 2 1 1
    12 problems 2 2 1 1
    13 systems 3 3 0 3
    14 theory 4 4 2 2

    I mentioned in my last post I've tried Pivot to get the term/category matrix, but that didn't work.

    Any ideas how to get the results of Word List to Data to output an example set that SVD can read?  Everything before SVD is working but the example set out isn't readable by SVD.  (The Excel sheet in the example for this post is copy and pasted with this

    Book|  Level|      Title
    B1|  Undergrad|  A Course on Integral Equations
    B2|  Undergrad|  Attractors for Semigroups and Evolution Equations
    B3|  Grad|      Automatic Differentiation of Algorithms: Theory, Implementation, and Application
    B4|  Undergrad|  Geometrical Aspects of Partial Differential Equations
    B5|  Undergrad|  Ideals, Varieties, and Algorithms { An Introduction to Computational Algebraic Geometry and Commutative Algebra
    B6|  Grad|      Introduction to Hamiltonian Dynamical Systems and the N-Body Problem
    B7|  Grad|      Knapsack Problems: Algorithms and Computer Implementations
    B8|  Grad|    Methods of Solving Singular Systems of Ordinary Differential Equations
    B9|  Grad|      Nonlinear Systems
    B10|  Undergrad|  Ordinary Differential Equations
    B11|  Undergrad|  Oscillation Theory for Neutral Differential Equations with Delay
    B12|  Grad|      Oscillation Theory of Delay Differential Equations
    B13|  Grad|      Pseudodifferential Operators and Nonlinear Partial Differential Equations
    B14|  Undergrad|  Sinc Methods for Quadrature and Differential Equations
    B15|  Grad|      Stability of Stochastic Differential Equations with Respect to Semi-Martingales
    B16|  Undergrad|  The Boundary Integral Approach to Static and Dynamic Contact Problems
    B17|  Undergrad|  The Double Mellin-Barnes Type Integrals and Their Applications to Convolution Theory


    Thanks.
  • Options
    landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn
    Hi,
    you can use the select attribute operator although there's no attribute choosable. The problem is, that the attributes aren't known during design time (the complete documents would have to be parsed for this) and hence they can't be shown. But you can enter them manually.

    Greetings,
      Sebastian
Sign In or Register to comment.