"FP-Growth process fails"

hhassanienhhassanien Member Posts: 2 Contributor I

Hello ,

 

The attached process had failed on the FP-Growth node with an error saying:

Process Failed

 

Exception: java.lang.StackOverflowError

1
1 votes

Fixed and Released · Last Updated

8.2.0

Comments

  • hhassanienhhassanien Member Posts: 2 Contributor I

    Please also find the process attached herewith.

  • Pavithra_RaoPavithra_Rao Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 123 RM Data Scientist

    Hi @hhassanien

    Could you please share the data files that I used in the attached process.

     

    Also sharing the log files will help debug issue easily...

    The studio logs can be found in :

    C:\users\<username>\.RapidMiner\

     

    Cheers

     

     

  • sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,959 Community Manager

    hi @hhassanien - yes that looks like a problem. Pushing to Product Feedback.

     

    [EDIT: @Pavithra_Rao I used "Data Mining for the Masses" pdf and got the same error. It's attached. Modified XML below.]

     

    <?xml version="1.0" encoding="UTF-8"?><process version="8.1.001">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="8.1.001" expanded="true" name="Process">
    <process expanded="true">
    <operator activated="true" class="text:read_document" compatibility="7.5.000" expanded="true" height="68" name="Read Document" width="90" x="45" y="238">
    <parameter key="file" value="/Users/GenzerConsulting/Desktop/DataMiningForTheMasses.pdf"/>
    </operator>
    <operator activated="true" class="text:process_documents" compatibility="7.5.000" expanded="true" height="103" name="Process Documents" width="90" x="246" y="238">
    <parameter key="vector_creation" value="Binary Term Occurrences"/>
    <process expanded="true">
    <operator activated="true" class="text:tokenize" compatibility="7.5.000" expanded="true" height="68" name="Tokenize" width="90" x="45" y="85"/>
    <operator activated="true" class="text:filter_stopwords_english" compatibility="7.5.000" expanded="true" height="68" name="Filter Stopwords (English)" width="90" x="45" y="187"/>
    <operator activated="true" class="text:transform_cases" compatibility="7.5.000" expanded="true" height="68" name="Transform Cases" width="90" x="45" y="289"/>
    <operator activated="true" class="text:filter_by_length" compatibility="7.5.000" expanded="true" height="68" name="Filter Tokens (by Length)" width="90" x="45" y="391"/>
    <operator activated="true" class="text:generate_n_grams_terms" compatibility="7.5.000" expanded="true" height="68" name="Generate n-Grams (Terms)" width="90" x="45" y="493">
    <parameter key="max_length" value="4"/>
    </operator>
    <operator activated="true" class="text:filter_tokens_by_content" compatibility="7.5.000" expanded="true" height="68" name="Filter Tokens (3)" width="90" x="246" y="187">
    <parameter key="condition" value="equals"/>
    <parameter key="string" value="version"/>
    <parameter key="invert condition" value="true"/>
    </operator>
    <operator activated="true" class="text:filter_tokens_by_content" compatibility="7.5.000" expanded="true" height="68" name="Filter Tokens (by Content)" width="90" x="313" y="289">
    <parameter key="condition" value="equals"/>
    <parameter key="string" value="aasher"/>
    <parameter key="regular_expression" value="asher"/>
    <parameter key="invert condition" value="true"/>
    </operator>
    <operator activated="true" class="text:filter_tokens_by_content" compatibility="7.5.000" expanded="true" height="68" name="Filter Tokens (6)" width="90" x="447" y="289">
    <parameter key="condition" value="equals"/>
    <parameter key="string" value="document"/>
    <parameter key="invert condition" value="true"/>
    </operator>
    <operator activated="true" class="text:filter_tokens_by_content" compatibility="7.5.000" expanded="true" height="68" name="Filter Tokens (2)" width="90" x="447" y="187">
    <parameter key="condition" value="equals"/>
    <parameter key="string" value="hyperone"/>
    <parameter key="invert condition" value="true"/>
    </operator>
    <operator activated="true" class="text:filter_tokens_by_content" compatibility="7.5.000" expanded="true" height="68" name="Filter Tokens (5)" width="90" x="514" y="85">
    <parameter key="condition" value="equals"/>
    <parameter key="string" value="page"/>
    <parameter key="invert condition" value="true"/>
    </operator>
    <operator activated="true" class="text:filter_tokens_by_content" compatibility="7.5.000" expanded="true" height="68" name="Filter Tokens (7)" width="90" x="581" y="187">
    <parameter key="condition" value="equals"/>
    <parameter key="string" value="process"/>
    <parameter key="invert condition" value="true"/>
    </operator>
    <operator activated="true" class="text:filter_tokens_by_content" compatibility="7.5.000" expanded="true" height="68" name="Filter Tokens (4)" width="90" x="715" y="85">
    <parameter key="condition" value="equals"/>
    <parameter key="string" value="author"/>
    <parameter key="invert condition" value="true"/>
    </operator>
    <connect from_port="document" to_op="Tokenize" to_port="document"/>
    <connect from_op="Tokenize" from_port="document" to_op="Filter Stopwords (English)" to_port="document"/>
    <connect from_op="Filter Stopwords (English)" from_port="document" to_op="Transform Cases" to_port="document"/>
    <connect from_op="Transform Cases" from_port="document" to_op="Filter Tokens (by Length)" to_port="document"/>
    <connect from_op="Filter Tokens (by Length)" from_port="document" to_op="Generate n-Grams (Terms)" to_port="document"/>
    <connect from_op="Generate n-Grams (Terms)" from_port="document" to_op="Filter Tokens (3)" to_port="document"/>
    <connect from_op="Filter Tokens (3)" from_port="document" to_op="Filter Tokens (by Content)" to_port="document"/>
    <connect from_op="Filter Tokens (by Content)" from_port="document" to_op="Filter Tokens (6)" to_port="document"/>
    <connect from_op="Filter Tokens (6)" from_port="document" to_op="Filter Tokens (2)" to_port="document"/>
    <connect from_op="Filter Tokens (2)" from_port="document" to_op="Filter Tokens (5)" to_port="document"/>
    <connect from_op="Filter Tokens (5)" from_port="document" to_op="Filter Tokens (7)" to_port="document"/>
    <connect from_op="Filter Tokens (7)" from_port="document" to_op="Filter Tokens (4)" to_port="document"/>
    <connect from_op="Filter Tokens (4)" from_port="document" to_port="document 1"/>
    <portSpacing port="source_document" spacing="0"/>
    <portSpacing port="sink_document 1" spacing="0"/>
    <portSpacing port="sink_document 2" spacing="0"/>
    </process>
    </operator>
    <operator activated="true" class="numerical_to_binominal" compatibility="8.1.001" expanded="true" height="82" name="Numerical to Binominal" width="90" x="380" y="85"/>
    <operator activated="true" class="fp_growth" compatibility="8.1.001" expanded="true" height="82" name="FP-Growth" width="90" x="514" y="85"/>
    <connect from_op="Read Document" from_port="output" to_op="Process Documents" to_port="documents 1"/>
    <connect from_op="Process Documents" from_port="example set" to_op="Numerical to Binominal" to_port="example set input"/>
    <connect from_op="Process Documents" from_port="word list" to_port="result 1"/>
    <connect from_op="Numerical to Binominal" from_port="example set output" to_op="FP-Growth" to_port="example set"/>
    <connect from_op="FP-Growth" from_port="example set" to_port="result 2"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    <portSpacing port="sink_result 3" spacing="0"/>
    </process>
    </operator>
    </process>


    Scott

     

  • sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,959 Community Manager
  • yyhuangyyhuang Administrator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 364 RM Data Scientist

    Hi @hhassanien,

     

    Thanks for sharing the data and process. Do you want to use FP-Growth algorithm to find the group of keywords that always co-exist in some documents?

     

    Here are only 5 documents and you will get a very wide table, 5 rows, 50k columns after text processing. Wow, that is 10000 times!  It will cause heap space issue for such small transaction but huge items... b/c for all keywords show in one single document will be associated in a rule with at least 20% (1/5=0.2) support and 100% confidence, which result in millions of association rules for 50k keywords.

     

    Ideally we want an input data with more transaction(usually > 200 rows of transactions) for market basket analysis (FP-G). So some workarounds for your document analysis:

    1. You can add more documents to increase number of examples, and reduce the number of columns by prunining on keywords or filter on tokens. I modified a little bit on the text mining process by adding pruning to on the corpus. The binominal data set used in fp-growth get dimmensional reduction to 5 by 400. It created 16 millions of frequent items (keywords).

    freq-items.PNG

    Warning: the code below may need at least 2 min to run FP-Growth on the reduced data set for a laptop with RAM 32GB. If you need to create associate rules out of the freuqent items from FP-Growth, run it on a server with even more memory.

    <?xml version="1.0" encoding="UTF-8"?><process version="8.1.001">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="8.1.001" expanded="true" name="Process">
    <process expanded="true">
    <operator activated="true" class="free_memory" compatibility="8.1.001" expanded="true" height="68" name="Free Memory" width="90" x="45" y="34"/>
    <operator activated="true" class="text:read_document" compatibility="8.1.000" expanded="true" height="68" name="FICO" width="90" x="179" y="34">
    <parameter key="file" value="C:\Users\YuanyuanHuang\Documents\RMCommunity\FICO.pdf"/>
    <parameter key="content_type" value="pdf"/>
    </operator>
    <operator activated="true" class="text:read_document" compatibility="8.1.000" expanded="true" height="68" name="MM" width="90" x="179" y="136">
    <parameter key="file" value="C:\Users\YuanyuanHuang\Documents\RMCommunity\MM.pdf"/>
    <parameter key="content_type" value="pdf"/>
    </operator>
    <operator activated="true" class="text:read_document" compatibility="8.1.000" expanded="true" height="68" name="SD" width="90" x="179" y="238">
    <parameter key="file" value="C:\Users\YuanyuanHuang\Documents\RMCommunity\SD.pdf"/>
    <parameter key="content_type" value="pdf"/>
    </operator>
    <operator activated="true" class="text:read_document" compatibility="8.1.000" expanded="true" height="68" name="HCM" width="90" x="179" y="340">
    <parameter key="file" value="C:\Users\YuanyuanHuang\Documents\RMCommunity\HCM.pdf"/>
    <parameter key="content_type" value="pdf"/>
    </operator>
    <operator activated="true" class="text:read_document" compatibility="8.1.000" expanded="true" height="68" name="Integration" width="90" x="179" y="442">
    <parameter key="file" value="C:\Users\YuanyuanHuang\Documents\RMCommunity\Integration.pdf"/>
    <parameter key="content_type" value="pdf"/>
    </operator>
    <operator activated="true" breakpoints="after" class="text:process_documents" compatibility="8.1.000" expanded="true" height="187" name="Process Documents" width="90" x="447" y="85">
    <parameter key="vector_creation" value="Term Frequency"/>
    <parameter key="add_meta_information" value="false"/>
    <parameter key="prune_method" value="absolute"/>
    <parameter key="prune_below_absolute" value="3"/>
    <parameter key="prune_above_absolute" value="5"/>
    <process expanded="true">
    <operator activated="true" class="text:tokenize" compatibility="8.1.000" expanded="true" height="68" name="Tokenize" width="90" x="45" y="34"/>
    <operator activated="true" class="text:transform_cases" compatibility="8.1.000" expanded="true" height="68" name="Transform Cases" width="90" x="179" y="34"/>
    <operator activated="false" class="text:stem_snowball" compatibility="8.1.000" expanded="true" height="68" name="Stem (Snowball)" width="90" x="179" y="136"/>
    <operator activated="true" class="text:filter_stopwords_english" compatibility="8.1.000" expanded="true" height="68" name="Filter Stopwords (English)" width="90" x="447" y="34"/>
    <operator activated="true" class="text:filter_by_length" compatibility="8.1.000" expanded="true" height="68" name="Filter Tokens (by Length)" width="90" x="581" y="34"/>
    <operator activated="true" class="text:filter_tokens_by_content" compatibility="8.1.000" expanded="true" height="68" name="Filter Tokens (3)" width="90" x="715" y="34">
    <parameter key="condition" value="equals"/>
    <parameter key="string" value="version"/>
    <parameter key="invert condition" value="true"/>
    </operator>
    <operator activated="true" class="text:filter_tokens_by_content" compatibility="8.1.000" expanded="true" height="68" name="Filter Tokens (by Content)" width="90" x="849" y="34">
    <parameter key="condition" value="equals"/>
    <parameter key="string" value="aasher"/>
    <parameter key="regular_expression" value="asher"/>
    <parameter key="invert condition" value="true"/>
    </operator>
    <operator activated="true" class="text:filter_tokens_by_content" compatibility="8.1.000" expanded="true" height="68" name="Filter Tokens (6)" width="90" x="983" y="34">
    <parameter key="condition" value="equals"/>
    <parameter key="string" value="document"/>
    <parameter key="invert condition" value="true"/>
    </operator>
    <operator activated="true" class="text:filter_tokens_by_content" compatibility="8.1.000" expanded="true" height="68" name="Filter Tokens (2)" width="90" x="1117" y="34">
    <parameter key="condition" value="equals"/>
    <parameter key="string" value="hyperone"/>
    <parameter key="invert condition" value="true"/>
    </operator>
    <operator activated="true" class="text:filter_tokens_by_content" compatibility="8.1.000" expanded="true" height="68" name="Filter Tokens (5)" width="90" x="1251" y="34">
    <parameter key="condition" value="equals"/>
    <parameter key="string" value="page"/>
    <parameter key="invert condition" value="true"/>
    </operator>
    <operator activated="true" class="text:filter_tokens_by_content" compatibility="8.1.000" expanded="true" height="68" name="Filter Tokens (7)" width="90" x="1385" y="34">
    <parameter key="condition" value="equals"/>
    <parameter key="string" value="process"/>
    <parameter key="invert condition" value="true"/>
    </operator>
    <operator activated="true" class="text:filter_tokens_by_content" compatibility="8.1.000" expanded="true" height="68" name="Filter Tokens (4)" width="90" x="1519" y="34">
    <parameter key="condition" value="equals"/>
    <parameter key="string" value="author"/>
    <parameter key="invert condition" value="true"/>
    </operator>
    <operator activated="false" class="text:generate_n_grams_terms" compatibility="8.1.000" expanded="true" height="68" name="Generate n-Grams (Terms)" width="90" x="1519" y="136">
    <parameter key="max_length" value="3"/>
    </operator>
    <connect from_port="document" to_op="Tokenize" to_port="document"/>
    <connect from_op="Tokenize" from_port="document" to_op="Transform Cases" to_port="document"/>
    <connect from_op="Transform Cases" from_port="document" to_op="Filter Stopwords (English)" to_port="document"/>
    <connect from_op="Filter Stopwords (English)" from_port="document" to_op="Filter Tokens (by Length)" to_port="document"/>
    <connect from_op="Filter Tokens (by Length)" from_port="document" to_op="Filter Tokens (3)" to_port="document"/>
    <connect from_op="Filter Tokens (3)" from_port="document" to_op="Filter Tokens (by Content)" to_port="document"/>
    <connect from_op="Filter Tokens (by Content)" from_port="document" to_op="Filter Tokens (6)" to_port="document"/>
    <connect from_op="Filter Tokens (6)" from_port="document" to_op="Filter Tokens (2)" to_port="document"/>
    <connect from_op="Filter Tokens (2)" from_port="document" to_op="Filter Tokens (5)" to_port="document"/>
    <connect from_op="Filter Tokens (5)" from_port="document" to_op="Filter Tokens (7)" to_port="document"/>
    <connect from_op="Filter Tokens (7)" from_port="document" to_op="Filter Tokens (4)" to_port="document"/>
    <connect from_op="Filter Tokens (4)" from_port="document" to_port="document 1"/>
    <portSpacing port="source_document" spacing="0"/>
    <portSpacing port="sink_document 1" spacing="0"/>
    <portSpacing port="sink_document 2" spacing="0"/>
    </process>
    <description align="center" color="transparent" colored="false" width="126">remove those words that show in every document and remove those words only showed in one doc</description>
    </operator>
    <operator activated="true" class="numerical_to_binominal" compatibility="8.1.001" expanded="true" height="82" name="Numerical to Binominal" width="90" x="648" y="34"/>
    <operator activated="true" breakpoints="after" class="fp_growth" compatibility="8.1.001" expanded="true" height="82" name="FP-Growth" width="90" x="782" y="34">
    <parameter key="find_min_number_of_itemsets" value="false"/>
    <parameter key="max_number_of_retries" value="10"/>
    <parameter key="min_support" value="0.9"/>
    </operator>
    <operator activated="true" class="create_association_rules" compatibility="8.1.001" expanded="true" height="82" name="Create Association Rules" width="90" x="916" y="34">
    <parameter key="min_confidence" value="1.0"/>
    </operator>
    <connect from_op="FICO" from_port="output" to_op="Process Documents" to_port="documents 5"/>
    <connect from_op="MM" from_port="output" to_op="Process Documents" to_port="documents 1"/>
    <connect from_op="SD" from_port="output" to_op="Process Documents" to_port="documents 2"/>
    <connect from_op="HCM" from_port="output" to_op="Process Documents" to_port="documents 3"/>
    <connect from_op="Integration" from_port="output" to_op="Process Documents" to_port="documents 4"/>
    <connect from_op="Process Documents" from_port="example set" to_op="Numerical to Binominal" to_port="example set input"/>
    <connect from_op="Numerical to Binominal" from_port="example set output" to_op="FP-Growth" to_port="example set"/>
    <connect from_op="FP-Growth" from_port="frequent sets" to_op="Create Association Rules" to_port="item sets"/>
    <connect from_op="Create Association Rules" from_port="rules" to_port="result 1"/>
    <connect from_op="Create Association Rules" from_port="item sets" to_port="result 2"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="147"/>
    <portSpacing port="sink_result 3" spacing="0"/>
    </process>
    </operator>
    </process>

    2. Transpose your document-term matrix, and get a new data matrix  with 5 columns, then you can use pair-wised word-word distance to find groups of words with high similarities..

    similarity-results.PNG

    <?xml version="1.0" encoding="UTF-8"?><process version="8.1.001">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="8.1.001" expanded="true" name="Process">
    <process expanded="true">
    <operator activated="true" class="free_memory" compatibility="8.1.001" expanded="true" height="68" name="Free Memory" width="90" x="45" y="34"/>
    <operator activated="true" class="text:read_document" compatibility="8.1.000" expanded="true" height="68" name="FICO" width="90" x="179" y="34">
    <parameter key="file" value="C:\Users\YuanyuanHuang\Documents\RMCommunity\FICO.pdf"/>
    <parameter key="content_type" value="pdf"/>
    </operator>
    <operator activated="true" class="text:read_document" compatibility="8.1.000" expanded="true" height="68" name="MM" width="90" x="179" y="136">
    <parameter key="file" value="C:\Users\YuanyuanHuang\Documents\RMCommunity\MM.pdf"/>
    <parameter key="content_type" value="pdf"/>
    </operator>
    <operator activated="true" class="text:read_document" compatibility="8.1.000" expanded="true" height="68" name="SD" width="90" x="179" y="238">
    <parameter key="file" value="C:\Users\YuanyuanHuang\Documents\RMCommunity\SD.pdf"/>
    <parameter key="content_type" value="pdf"/>
    </operator>
    <operator activated="true" class="text:read_document" compatibility="8.1.000" expanded="true" height="68" name="HCM" width="90" x="179" y="340">
    <parameter key="file" value="C:\Users\YuanyuanHuang\Documents\RMCommunity\HCM.pdf"/>
    <parameter key="content_type" value="pdf"/>
    </operator>
    <operator activated="true" class="text:read_document" compatibility="8.1.000" expanded="true" height="68" name="Integration" width="90" x="179" y="442">
    <parameter key="file" value="C:\Users\YuanyuanHuang\Documents\RMCommunity\Integration.pdf"/>
    <parameter key="content_type" value="pdf"/>
    </operator>
    <operator activated="true" breakpoints="after" class="text:process_documents" compatibility="8.1.000" expanded="true" height="187" name="Process Documents" width="90" x="447" y="85">
    <parameter key="vector_creation" value="Term Frequency"/>
    <parameter key="add_meta_information" value="false"/>
    <parameter key="prune_below_absolute" value="3"/>
    <parameter key="prune_above_absolute" value="5"/>
    <process expanded="true">
    <operator activated="true" class="text:tokenize" compatibility="8.1.000" expanded="true" height="68" name="Tokenize" width="90" x="45" y="34"/>
    <operator activated="true" class="text:transform_cases" compatibility="8.1.000" expanded="true" height="68" name="Transform Cases" width="90" x="179" y="34"/>
    <operator activated="false" class="text:stem_snowball" compatibility="8.1.000" expanded="true" height="68" name="Stem (Snowball)" width="90" x="179" y="136"/>
    <operator activated="true" class="text:filter_stopwords_english" compatibility="8.1.000" expanded="true" height="68" name="Filter Stopwords (English)" width="90" x="447" y="34"/>
    <operator activated="true" class="text:filter_by_length" compatibility="8.1.000" expanded="true" height="68" name="Filter Tokens (by Length)" width="90" x="581" y="34">
    <parameter key="min_chars" value="3"/>
    </operator>
    <operator activated="true" class="text:filter_tokens_by_content" compatibility="8.1.000" expanded="true" height="68" name="Filter Tokens (3)" width="90" x="715" y="34">
    <parameter key="condition" value="equals"/>
    <parameter key="string" value="version"/>
    <parameter key="invert condition" value="true"/>
    </operator>
    <operator activated="true" class="text:filter_tokens_by_content" compatibility="8.1.000" expanded="true" height="68" name="Filter Tokens (by Content)" width="90" x="849" y="34">
    <parameter key="condition" value="equals"/>
    <parameter key="string" value="aasher"/>
    <parameter key="regular_expression" value="asher"/>
    <parameter key="invert condition" value="true"/>
    </operator>
    <operator activated="true" class="text:filter_tokens_by_content" compatibility="8.1.000" expanded="true" height="68" name="Filter Tokens (6)" width="90" x="983" y="34">
    <parameter key="condition" value="equals"/>
    <parameter key="string" value="document"/>
    <parameter key="invert condition" value="true"/>
    </operator>
    <operator activated="true" class="text:filter_tokens_by_content" compatibility="8.1.000" expanded="true" height="68" name="Filter Tokens (2)" width="90" x="1117" y="34">
    <parameter key="condition" value="equals"/>
    <parameter key="string" value="hyperone"/>
    <parameter key="invert condition" value="true"/>
    </operator>
    <operator activated="true" class="text:filter_tokens_by_content" compatibility="8.1.000" expanded="true" height="68" name="Filter Tokens (5)" width="90" x="1251" y="34">
    <parameter key="condition" value="equals"/>
    <parameter key="string" value="page"/>
    <parameter key="invert condition" value="true"/>
    </operator>
    <operator activated="true" class="text:filter_tokens_by_content" compatibility="8.1.000" expanded="true" height="68" name="Filter Tokens (7)" width="90" x="1385" y="34">
    <parameter key="condition" value="equals"/>
    <parameter key="string" value="process"/>
    <parameter key="invert condition" value="true"/>
    </operator>
    <operator activated="true" class="text:filter_tokens_by_content" compatibility="8.1.000" expanded="true" height="68" name="Filter Tokens (4)" width="90" x="1519" y="34">
    <parameter key="condition" value="equals"/>
    <parameter key="string" value="author"/>
    <parameter key="invert condition" value="true"/>
    </operator>
    <operator activated="false" class="text:generate_n_grams_terms" compatibility="8.1.000" expanded="true" height="68" name="Generate n-Grams (Terms)" width="90" x="1653" y="85">
    <parameter key="max_length" value="4"/>
    </operator>
    <connect from_port="document" to_op="Tokenize" to_port="document"/>
    <connect from_op="Tokenize" from_port="document" to_op="Transform Cases" to_port="document"/>
    <connect from_op="Transform Cases" from_port="document" to_op="Filter Stopwords (English)" to_port="document"/>
    <connect from_op="Filter Stopwords (English)" from_port="document" to_op="Filter Tokens (by Length)" to_port="document"/>
    <connect from_op="Filter Tokens (by Length)" from_port="document" to_op="Filter Tokens (3)" to_port="document"/>
    <connect from_op="Filter Tokens (3)" from_port="document" to_op="Filter Tokens (by Content)" to_port="document"/>
    <connect from_op="Filter Tokens (by Content)" from_port="document" to_op="Filter Tokens (6)" to_port="document"/>
    <connect from_op="Filter Tokens (6)" from_port="document" to_op="Filter Tokens (2)" to_port="document"/>
    <connect from_op="Filter Tokens (2)" from_port="document" to_op="Filter Tokens (5)" to_port="document"/>
    <connect from_op="Filter Tokens (5)" from_port="document" to_op="Filter Tokens (7)" to_port="document"/>
    <connect from_op="Filter Tokens (7)" from_port="document" to_op="Filter Tokens (4)" to_port="document"/>
    <connect from_op="Filter Tokens (4)" from_port="document" to_port="document 1"/>
    <portSpacing port="source_document" spacing="0"/>
    <portSpacing port="sink_document 1" spacing="0"/>
    <portSpacing port="sink_document 2" spacing="0"/>
    </process>
    <description align="center" color="transparent" colored="false" width="126"/>
    </operator>
    <operator activated="true" class="transpose" compatibility="8.1.001" expanded="true" height="82" name="Transpose" width="90" x="581" y="85"/>
    <operator activated="true" class="data_to_similarity" compatibility="8.1.001" expanded="true" height="82" name="Data to Similarity" width="90" x="715" y="85"/>
    <operator activated="true" class="similarity_to_data" compatibility="8.1.001" expanded="true" height="82" name="Similarity to Data (2)" width="90" x="849" y="85"/>
    <operator activated="true" class="sort" compatibility="8.1.001" expanded="true" height="82" name="Sorted Similarity" width="90" x="983" y="85">
    <parameter key="attribute_name" value="DISTANCE"/>
    <parameter key="sorting_direction" value="decreasing"/>
    </operator>
    <connect from_op="FICO" from_port="output" to_op="Process Documents" to_port="documents 5"/>
    <connect from_op="MM" from_port="output" to_op="Process Documents" to_port="documents 1"/>
    <connect from_op="SD" from_port="output" to_op="Process Documents" to_port="documents 2"/>
    <connect from_op="HCM" from_port="output" to_op="Process Documents" to_port="documents 3"/>
    <connect from_op="Integration" from_port="output" to_op="Process Documents" to_port="documents 4"/>
    <connect from_op="Process Documents" from_port="example set" to_op="Transpose" to_port="example set input"/>
    <connect from_op="Transpose" from_port="example set output" to_op="Data to Similarity" to_port="example set"/>
    <connect from_op="Data to Similarity" from_port="similarity" to_op="Similarity to Data (2)" to_port="similarity"/>
    <connect from_op="Data to Similarity" from_port="example set" to_op="Similarity to Data (2)" to_port="exampleSet"/>
    <connect from_op="Similarity to Data (2)" from_port="exampleSet" to_op="Sorted Similarity" to_port="example set input"/>
    <connect from_op="Sorted Similarity" from_port="example set output" to_port="result 1"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="147"/>
    </process>
    </operator>
    </process>

    3.  Run word2vec (available in word2vec extension from marketplace) on the documents to extract the vocabulary and their context with deep learning neural network. 

    Please check out the knowledge base article by Dr Martin Schmitz 

    https://community.rapidminer.com/t5/RapidMiner-Studio-Knowledge-Base/Synonym-Detection-with-Word2Vec/ta-p/43860

     

    Cheers,

    YY

  • sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,959 Community Manager

    wow - thank you @Pavithra_Rao for such a detailed and helpful response!

  • sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,959 Community Manager

    Unfortunately we're going to decline to fix this. Two reasons: 1) as @Pavithra_Rao showed, there is a good workaround for this and in fact what she shows is likely best practice anyway; 2) the FP-Growth operator is being rebuilt from the ground-up right now. :)

  • yyhuangyyhuang Administrator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 364 RM Data Scientist

    We will have an improved FP-Growth operator in our next release 8.2 

    It will be much faster with the new data core implementation and also compatible with transactional data like

    TransactionID                       item1|item2|item3|item4

    Kudos to @gmeier !

Sign In or Register to comment.