Due to recent updates, all users are required to create an Altair One account to login to the RapidMiner community. Click the Register button to create your account using the same email that you have previously used to login to the RapidMiner community. This will ensure that any previously created content will be synced to your Altair One account. Once you login, you will be asked to provide a username that identifies you to other Community users. Email us at Community with questions.
"How get term-document matrix for SVD?"
When I use the ProcessDocument module it creates a document-term matrix. I need to create a term-document matrix to feed to SVD. I.e.,
doc1 doc2 doc3
term1
term2
term2
I placed WordListtoDocument after it and fed the output to SVD, but the output from WordListtoDocument doesn't have the correct format - SVD generates an error.
This is representative of how I set up the ProcessDoc(word) ->(word) WordListtoDoc (example set) -> (example set)SVD modules.
ProcessDoc(example set) ->(example set) Transpose(example set) -> (example set)SVD
How should I set this up?
TIA
doc1 doc2 doc3
term1
term2
term2
I placed WordListtoDocument after it and fed the output to SVD, but the output from WordListtoDocument doesn't have the correct format - SVD generates an error.
This is representative of how I set up the ProcessDoc(word) ->(word) WordListtoDoc (example set) -> (example set)SVD modules.
<?xml version="1.0" encoding="UTF-8" standalone="no"?>I also used the Transpose module after ProcessDocument but SVD did not accept the example set from Transpose.
<process version="5.0">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="5.0.0" expanded="true" name="Process">
<process expanded="true" height="521" width="955">
<operator activated="true" class="read_excel" compatibility="5.0.0" expanded="true" height="60" name="Read Excel" width="90" x="45" y="30">
<parameter key="excel_file" value="R:\Data\restrvw_1.xls"/>
<list key="annotations"/>
</operator>
<operator activated="true" class="replace" compatibility="5.0.0" expanded="true" height="76" name="Replace" width="90" x="45" y="120">
<parameter key="attribute_filter_type" value="subset"/>
<parameter key="attributes" value="Summary|Good|Bad"/>
<parameter key="replace_what" value="car"/>
</operator>
<operator activated="true" class="select_attributes" compatibility="5.0.0" expanded="true" height="76" name="Select Attributes" width="90" x="45" y="255">
<parameter key="attribute_filter_type" value="subset"/>
<parameter key="attributes" value="Summary|rowid|Rest_Type"/>
</operator>
<operator activated="true" class="set_role" compatibility="5.0.0" expanded="true" height="76" name="Rest_type to label" width="90" x="179" y="165">
<parameter key="name" value="Car_Type"/>
<parameter key="target_role" value="label"/>
</operator>
<operator activated="true" class="text:process_document_from_data" compatibility="5.0.0" expanded="true" height="76" name="Process Documents from Data" width="90" x="246" y="30">
<parameter key="create_word_vector" value="false"/>
<parameter key="prune_method" value="absolute"/>
<parameter key="prune_below_absolute" value="2"/>
<parameter key="prune_above_absolute" value="100"/>
<list key="specify_weights"/>
<process expanded="true" height="526" width="806">
<operator activated="true" class="text:transform_cases" compatibility="5.0.0" expanded="true" height="60" name="Transform Cases" width="90" x="45" y="75"/>
<operator activated="true" class="text:tokenize" compatibility="5.0.0" expanded="true" height="60" name="Tokenize" width="90" x="246" y="75"/>
<operator activated="true" class="text:filter_stopwords_english" compatibility="5.0.0" expanded="true" height="60" name="Filter Stopwords (English)" width="90" x="447" y="30"/>
<connect from_port="document" to_op="Transform Cases" to_port="document"/>
<connect from_op="Transform Cases" from_port="document" to_op="Tokenize" to_port="document"/>
<connect from_op="Tokenize" from_port="document" to_op="Filter Stopwords (English)" to_port="document"/>
<connect from_op="Filter Stopwords (English)" from_port="document" to_port="document 1"/>
<portSpacing port="source_document" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<portSpacing port="sink_document 2" spacing="0"/>
</process>
</operator>
<operator activated="true" class="text:wordlist_to_data" compatibility="5.0.0" expanded="true" height="76" name="WordList to Data" width="90" x="447" y="30"/>
<operator activated="true" class="singular_value_decomposition" compatibility="5.0.8" expanded="true" height="94" name="SVD" width="90" x="604" y="102"/>
<connect from_op="Read Excel" from_port="output" to_op="Replace" to_port="example set input"/>
<connect from_op="Replace" from_port="example set output" to_op="Select Attributes" to_port="example set input"/>
<connect from_op="Select Attributes" from_port="example set output" to_op="Rest_type to label" to_port="example set input"/>
<connect from_op="Rest_type to label" from_port="example set output" to_op="Process Documents from Data" to_port="word list"/>
<connect from_op="Process Documents from Data" from_port="word list" to_op="WordList to Data" to_port="word list"/>
<connect from_op="WordList to Data" from_port="example set" to_op="SVD" to_port="example set input"/>
<connect from_op="SVD" from_port="example set output" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>
ProcessDoc(example set) ->(example set) Transpose(example set) -> (example set)SVD
How should I set this up?
TIA
Tagged:
0
Answers
why exactly did it not accept it? This should be the way to go.
Greetings,
Sebastian
This example shows want I want to do.
I have some text with labels. I pass the text into Process Documents to get a doc/term matrix. I want to swing or pivot the results so I can feed the matrix into SVD with terms as rows and labels as columns.
WordList to Data does not provide an example set that can be used in later processes.
The word list output from Process Documents from Data is perfect but it is not an example set format. It has the words as rows, and Attribute Name, Total Occurrences and labels as column names. If I can filter out Attribute Name and Total Occurrences and leave labels as column names then feed this into SVD this will work. (Note: If I don't use labels, I can Transpose the example set from Process Documents from Data to get a term/doc matrix. But I need to use labels instead of document IDs.)
I've tried several combinations of Transpose, Pivot, Aggregate and Word List to Data and Filter/Set Role Attributes after Process Documents from Data. I get errors or incomplete results.
How do I set up the modules to correctly feed the term/label matrix into SVD?
(Feature enhancement idea: enable SVD to output either row/column or column/row)
Here is a data set (with Level just randomly assigned)
Excel sheet
Book Level Title
B1 Undergrad A Course on Integral Equations
B2 Undergrad Attractors for Semigroups and Evolution Equations
B3 Grad Automatic Differentiation of Algorithms: Theory, Implementation, and Application
B4 Undergrad Geometrical Aspects of Partial Differential Equations
B5 Undergrad Ideals, Varieties, and Algorithms { An Introduction to Computational Algebraic Geometry and Commutative Algebra
B6 Grad Introduction to Hamiltonian Dynamical Systems and the N-Body Problem
B7 Grad Knapsack Problems: Algorithms and Computer Implementations
B8 Grad Methods of Solving Singular Systems of Ordinary Differential Equations
B9 Grad Nonlinear Systems
B10 Undergrad Ordinary Differential Equations
B11 Undergrad Oscillation Theory for Neutral Differential Equations with Delay
B12 Grad Oscillation Theory of Delay Differential Equations
B13 Grad Pseudodifferential Operators and Nonlinear Partial Differential Equations
B14 Undergrad Sinc Methods for Quadrature and Differential Equations
B15 Grad Stability of Stochastic Differential Equations with Respect to Semi-Martingales
B16 Undergrad The Boundary Integral Approach to Static and Dynamic Contact Problems
B17 Undergrad The Double Mellin-Barnes Type Integrals and Their Applications to Convolution Theory
I want to feed this into SVD with labels as the columns (I assume I can feed this example set into the TFIDF module to get term weights before SVD.)
Term Undergrad Grad
algorithms 1 2
delay 1 1
differential 4 4
equations 6 4
integral 2 0
introduction 1 1
methods 1 1
nonlinear 0 2
ordinary 1 1
oscillation 1 1
partial 1 1
problems 1 1
systems 0 3
theory 2 2
I suspect the problem lies in the data pre-processing because if I lay out your data as follows into Words.csv and run the XML the operators seem to work fine.. In my experience pesky separators like commas in text can mess stuff up quite quickly.
I'm not quite convinced that the word list does you any good. Even if the svd would run on it, what would be the result? Drawn just from the information of how often a word occured?
I think you will have to transpose the word vector from the process documents operator. I would suggest the following to go further in this matter:
Replace the data loading with some Create Document oPerators, replace the Process Documents from Data to Process Documents. Then you can post this process and I'm able to execute it without problems. This way I might get an impression how to help you.
Greetings,
Sebastian
Here is setup 1 - with the text in a container. Create Document changes the document text to an example set, but the Documents to Data process does not create metadata for id, label and text data that can be passed to later operators. If I put the text data into a spreadsheet the metadata is created and Set Role and Nominal to Text work correctly to create id and text fields.
The output of the Process Documents from Data creates a document-term matrix with documents as rows and terms as columns. I want to pivot this and have terms as rows and categories as columns instead of documents as columns.
This second process sets up the example set correctly but doesn't pivot correctly to get terms as rows with Level (ie category) as columns. (Just paste the text info from Create Document into an xls sheet.)
Thanks for your help.
if I exchange the pivoting with a transpose it seems to me, that I receive one word per row, each column expresses the weight in each document. Wasn't that what you was longing for? Greetings,
Sebastian
I am trying to get the term/category matrix, not the term/document matrix - Sorry I wasn't clearer. I want to condense the term/document matrix to a term/category matrix and send this to SVD.
I want to get this format to send to SVD
Term Undergrad Grad
algorithms 1 2
delay 1 1
differential 4 4
equations 6 4
integral 2 0
introduction 1 1
methods 1 1
nonlinear 0 2
ordinary 1 1
oscillation 1 1
partial 1 1
problems 1 1
systems 0 3
theory 2 2
This produces the right output (term / category) but the example set isn't read by the Filter Attribute operator. Here is the output of Word List to Data. The output format is correct, but the Filter Attribute operator doesn't see the information, so I can't remove In Document and In labled, and leave only in class (undergrad) and in class(graduate).
Row Nbr Word In document in labeled in class (undergrad) in class (Graduate)
1 algorithms 3 3 1 2
2 delay 2 2 1 1
3 differential 8 8 4 4
4 equations 10 10 6 4
5 integral 2 2 2 0
6 introduction 2 2 1 1
7 methods 2 2 1 1
8 nonlinear 2 2 0 2
9 ordinary 2 2 1 1
10 oscillation 2 2 1 1
11 partial 2 2 1 1
12 problems 2 2 1 1
13 systems 3 3 0 3
14 theory 4 4 2 2
I mentioned in my last post I've tried Pivot to get the term/category matrix, but that didn't work.
Any ideas how to get the results of Word List to Data to output an example set that SVD can read? Everything before SVD is working but the example set out isn't readable by SVD. (The Excel sheet in the example for this post is copy and pasted with this
Book| Level| Title
B1| Undergrad| A Course on Integral Equations
B2| Undergrad| Attractors for Semigroups and Evolution Equations
B3| Grad| Automatic Differentiation of Algorithms: Theory, Implementation, and Application
B4| Undergrad| Geometrical Aspects of Partial Differential Equations
B5| Undergrad| Ideals, Varieties, and Algorithms { An Introduction to Computational Algebraic Geometry and Commutative Algebra
B6| Grad| Introduction to Hamiltonian Dynamical Systems and the N-Body Problem
B7| Grad| Knapsack Problems: Algorithms and Computer Implementations
B8| Grad| Methods of Solving Singular Systems of Ordinary Differential Equations
B9| Grad| Nonlinear Systems
B10| Undergrad| Ordinary Differential Equations
B11| Undergrad| Oscillation Theory for Neutral Differential Equations with Delay
B12| Grad| Oscillation Theory of Delay Differential Equations
B13| Grad| Pseudodifferential Operators and Nonlinear Partial Differential Equations
B14| Undergrad| Sinc Methods for Quadrature and Differential Equations
B15| Grad| Stability of Stochastic Differential Equations with Respect to Semi-Martingales
B16| Undergrad| The Boundary Integral Approach to Static and Dynamic Contact Problems
B17| Undergrad| The Double Mellin-Barnes Type Integrals and Their Applications to Convolution Theory
Thanks.
you can use the select attribute operator although there's no attribute choosable. The problem is, that the attributes aren't known during design time (the complete documents would have to be parsed for this) and hence they can't be shown. But you can enter them manually.
Greetings,
Sebastian