"Create document operator not working"

JSP007 · April 2014

RapidMiner 6.0003 Studio Starter.
I am doing text mining using Read Document operators for five files. + Process Document + Clustering (k-Means).
Works like a charm!

So I switch in Create Document. RapidMiner halts and warns "The data contains missing which is not allowed for k-Means. ... Cause Clustering".

Tested:
- pasted in Create Document the same text that is in one of the files. Same Warning.

- typed only one single word in Create Document. Same Warning.

Any ideas to go forward would be much appreciated.
Jorge

awchisholm · April 2014

Hello Jorge,

It's best to post your XML process. One thought however, did you still include a Process Documents operator after the Create Documents?

regards

Andrew

JSP007 · April 2014

Thank you very much, Andrew. I am a newbie in RapidMiner so I might have done something wrong. Although I cannot see what I did wrong, it looks exactly like the text in the book Data Mining for the Masses.

Yes, I did connect the output of the Create Document operator to the Process Documents operator. The only change I did. Now I have reduced the number of inputs to three, just to test. And the text in the Create Document is only a short phrase.

Here is the XML-code, I hope it sheds some light on the problem. BTW the Text Mining operator i version 5 and I am using RM version 6. Would this matter?

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="6.0.003">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="6.0.003" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="text:read_document" compatibility="5.3.002" expanded="true" height="60" name="Linde 01" width="90" x="45" y="75">
<parameter key="file" value="/Users/JSP007/Documents/Dokumentƒ/tema Big Data/tema Övningar Data Mining For The Masses/tema Text Mining Diverse mejl/Linde 01.txt"/>
</operator>
<operator activated="true" class="text:read_document" compatibility="5.3.002" expanded="true" height="60" name="Veronica 01" width="90" x="45" y="165">
<parameter key="file" value="/Users/JSP007/Documents/Dokumentƒ/tema Big Data/tema Övningar Data Mining For The Masses/tema Text Mining Diverse mejl/Veronica 01.txt"/>
</operator>
<operator activated="true" class="text:create_document" compatibility="5.3.002" expanded="true" height="60" name="Create Document" width="90" x="45" y="390">
<parameter key="text" value="I am testing here."/>
</operator>
<operator activated="true" class="text:read_document" compatibility="5.3.002" expanded="true" height="60" name="Unknown writer V" width="90" x="45" y="255">
<parameter key="file" value="/Users/JSP007/Documents/Dokumentƒ/tema Big Data/tema Övningar Data Mining For The Masses/tema Text Mining Diverse mejl/Unknown writer Veronica.txt"/>
</operator>
<operator activated="true" class="text:process_documents" compatibility="5.3.002" expanded="true" height="148" name="Process Documents" width="90" x="313" y="165">
<process expanded="true">
<operator activated="true" class="text:tokenize" compatibility="5.3.002" expanded="true" height="60" name="Tokenize" width="90" x="45" y="30"/>
<operator activated="true" class="text:filter_stopwords_english" compatibility="5.3.002" expanded="true" height="60" name="Filter Stopwords (English)" width="90" x="179" y="30"/>
<operator activated="true" class="text:transform_cases" compatibility="5.3.002" expanded="true" height="60" name="Transform Cases" width="90" x="313" y="30"/>
<connect from_port="document" to_op="Tokenize" to_port="document"/>
<connect from_op="Tokenize" from_port="document" to_op="Filter Stopwords (English)" to_port="document"/>
<connect from_op="Filter Stopwords (English)" from_port="document" to_op="Transform Cases" to_port="document"/>
<connect from_op="Transform Cases" from_port="document" to_port="document 1"/>
<portSpacing port="source_document" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<portSpacing port="sink_document 2" spacing="0"/>
</process>
</operator>
<operator activated="true" class="k_means" compatibility="6.0.003" expanded="true" height="76" name="Clustering" width="90" x="447" y="30"/>
<connect from_op="Linde 01" from_port="output" to_op="Process Documents" to_port="documents 1"/>
<connect from_op="Veronica 01" from_port="output" to_op="Process Documents" to_port="documents 2"/>
<connect from_op="Create Document" from_port="output" to_op="Process Documents" to_port="documents 4"/>
<connect from_op="Unknown writer V" from_port="output" to_op="Process Documents" to_port="documents 3"/>
<connect from_op="Process Documents" from_port="example set" to_op="Clustering" to_port="example set"/>
<connect from_op="Process Documents" from_port="word list" to_port="result 2"/>
<connect from_op="Clustering" from_port="cluster model" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
<portSpacing port="sink_result 3" spacing="0"/>
</process>
</operator>
</process>

awchisholm · April 2014

Hello Jorge

I can see a problem - hopefully the same as you find. I don't have your data in the files so I commented those out. Now the Create Document operator by itself and the Tokenizing within the Process Documents operator will lead to an example set containing one example. This means the k-means operator does not have enough examples to work with and an error happens.

If you duplicate the Create Document operator so there are two inputs to the Process Documents operator, you should get two examples in the example set and k-means will happily cluster these.

As an aside, the k-means measure type is set to BregmanDivergences - I have no idea what this actually does but it's the default. You might want to change it to something like NumericalMeasures.

As another aside, the k-means numerical measure is traditionally set to CosineSimilarity when doing text mining things. I would be the first to say however, that this is not a hard rule and you should do whatever fits what you are doing.

regards

Andrew

JSP007 · April 2014

Your are all too kind, Mathew, and I really appreciate that.
I am afraid you are talking over my level.

I am sorry I cannot send a picture but the "picture" is this:

Read Document (RD) ---> Process Document (PD) --> Clustering k-Means (C)

I have SIX text files that are read with six RD and go into a PD with SIX Doc inputs. The PD output goes into the C.
It all works as expected, Clustering works as expected, etc.

I then want to use a CD rather than a RD for one of the texts, and paste text (or type text) into a CD operator. This is the ONLY change in the system.

Create Document (CD) ---> Process Document (PD) --> Clustering (C)
PD now has SEVEN Doc inputs.
BUT it all stops with a warning: "The data contains missing values that are not allowed in k-Means"
----
And that is all. I just compared with some colleagues, they all do this without any problem whatsoever. I have even used the same files. The only difference is that I use Rapid Miner Studio Starter and they use Rapid Miner 5.
Is this a bug in Rapid Miner Studio Starter 6?

I have only one PD and one C (k-Means). I tried to understand your suggestion of duplicating the operator but I do not really understand what to do (Remember that I am a newbie in Rapid Miner).
I can work with everything else but not with CD.
Very puzzled I am.

awchisholm · April 2014

On the "Process Documents" operator, deselect the "add meta information" check box

Does that change things?

regards

Andrew

JSP007 · April 2014

YES! It does change something, it does change everything. Everything works just fine now.

But is meta data not important (in this case)? Maybe not, the question is then *when* is it important.
Never mind, I will learn that while working and learning about RapidMiner.

Now, I would like to mark this post as solved (to help other people) but I do not really know how to do it here. I did choose Thumbs up and wrote SOLVED in the Subject but usually there are other ways.

--> Thank you ever so much, Matthew, you have been to great help!

awchisholm · April 2014

Hello Jorge

I think it's a bug - the presence of special attributes with missing values should not upset the clustering operator.

regards

Andrew

MariusHelf · May 2014

awchisholm wrote:

Hello Jorge

I think it's a bug - the presence of special attributes with missing values should not upset the clustering operator.

regards

Andrew

Yes, this is a bug that has crept into our latest release. It's on our developer's list.

Best regards,
Marius

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

"Create document operator not working"

Answers